ProtoFuzz: A Protobuf Fuzzer

At Trail of Bits, we occasionally perform source code audits. A recent one targeted an application that used Protocol Buffers extensively.

Google’s Protocol Buffers (protobuf) is a common method of serializing data, typically found in distributed applications. Protobufs simplify the generally error-prone task of parsing binary data by letting a developer define the type of data, and letting a protobuf compiler (protoc) generate all the serialization and deserialization code automatically.

Fuzzing a service expecting protobuf-encoded structures directly is not likely to achieve satisfactory code coverage. First, protobuf deserialization code is fairly mature and has seen scrutiny. Second, we are not typically interested in flaws in the protobuf implementation itself. Our main goal is to target the code behind protobuf decoding. Our aim becomes to create valid protobuf-encoded structures that are composed of malicious values.

This library is in sufficiently widespread use that we found it worthwhile to create a generic Protobuf message generator to help with assessments. The message generator is a Python3 library with a simple interface: provided a protobuf definition, it creates Python generators for various permutations of all defined messages. We call it ProtoFuzz.

For data itself, we use the fuzzdb database as the source of values that are generated, but it’s relatively straightforward to define your own collection of values.

Installation

When installing in Ubuntu:

pip install py3-protobuffers
sudo add-apt-repository -y ppa:5-james-t/protobuf-ppa
sudo apt-get -qq update
sudo apt-get -y install protobuf-compiler
git clone --recursive git@github.com:trailofbits/protofuzz.git
cd protofuzz/
python3 setup.py install

Usage

Message generation is handled by ProtobufGenerator instances. Each instance backs a Protobuf-produced class. This class has two functions: create fuzzing strategies and create field dependencies.

A fuzzing strategy defines how fields are permuted. So far just two are defined: linear and permutation. A linear strategy creates a stream of protobuf objects that are the equivalent of Python’s zip() across all values that can be generated. A permutation produces a stream that is a cartesian product of all the values that can be generated. A linear() permutation can be used to get a sense of the kinds of values that will be generated without creating a multitude of values.

Field dependencies force the values of some fields to be created from the values of others via any callable object. This is used for fields that probably shouldn’t be fuzzed, like lengths, CRC checksums, magic values, etc.

The entry point into the library is the `protofuzz.protofuzz` module. It defines three functions:

protofuzz.from_description_string()

Create a dict of ProtobufGenerator objects from a string Protobuf definition.

from protofuzz import protofuzz
message_fuzzers = protofuzz.from_description_string("""
    message Address {
     required int32 house = 1;
     required string street = 2;
    }
""")
for obj in message_fuzzers['Address'].permute():
    print("Generated object: {}".format(obj))
Generated object: house: -1
street: "!"

Generated object: house: 0
street: "!"

Generated object: house: 256
street: "!"

protofuzz.from_file()

Create a dict of ProtobufGenerator objects from a path to a .proto file.

from protofuzz import protofuzz
message_fuzzers = protofuzz.from_file('test.proto')
for obj in message_fuzzers['Person'].permute():
    print("Generated object: {}".format(obj))
Generated object: name: "!"
id: -1
email: "!"
phone {
  number: "!"
  type: MOBILE
}

Generated object: name: "!\'"
id: -1
email: "!"
phone {
  number: "!"
  type: MOBILE
}
...

protofuzz.from_protobuf_class()

Create a ProtobufGenerator from an already-loaded Protobuf class.

Creating Linked Fields

Some fields shouldn’t be fuzzed. For example, fields like magic values, checksums, and lengths should not be mutated. To this end, protofuzz supports resolving selected field values from other fields. To create a linked field, use ProtobufGenerator’s add_dependency method. Dependencies can also be created between nested objects. For example,

fuzzer = protofuzz.from_description_string('''
message Contents {
  required string header = 1;
  required string body = 2;
}
message Payload {
  required int32 length = 1;
  required Contents contents = 2;
}
''')

fuzzer['Payload'].add_dependency('length', 'contents.body', len)
for idx, obj in zip(range(3), fuzzer['Payload'].permute()):
  print("Generated object: {}".format(obj))
Generated object: length: 1
contents {
  header: "!"
  body: "!"
}

Generated object: length: 2
contents {
  header: "!"
  body: "!\'"
}

Generated object: length: 29
contents {
  header: "!"
  body: "!@#$%%^#$%#$@#$%$$@#$%^^**(()"
}
...

Miscellaneous

Although not related to fuzzing directly, Protofuzz also includes a simple logging class that’s implemented as a ring buffer to aid in fuzzing campaigns. See protobuf.log.

Conclusion

We created Protofuzz to assist with security assessments. It gave us the ability to quickly test message-handling code with minimal ramp up.

The library itself is implemented with minimal dependencies, making it appropriate for integration with continuous integration (CI) and testing tools.

If you have any questions, please feel free to reach out at yan@trailofbits.com or file an issue.

The DBIR’s ‘Forest’ of Exploit Signatures

If you follow the recommendations in the 2016 Verizon Data Breach Investigations Report (DBIR), you will expose your organization to more risk, not less. The report’s most glaring flaw is the assertion that the TLS FREAK vulnerability is among the ‘Top 10’ most exploited on the Internet. No experienced security practitioner believes that FREAK is widely exploited. Where else did Verizon get it wrong?

This question undermines the rest of the report. The DBIR is a collaborative effort involving 60+ organizations’ proprietary data. It’s the single best source of information for enterprise defenders, which is why it’s a travesty that its section on vulnerabilities used in data breaches contains misleading data, analysis, and recommendations.

Verizon must ‘be better.’ They have to set a higher standard for the data they accept from collaborators. I recommend they base their analysis on documented data breaches, partner with agent-based security vendors, and include a red team in the review process. I’ll elaborate on these points later.

Digging into the vulnerability data

For the rest of this post, I’ll focus on the DBIR’s Vulnerability section (pages 13-16). There, Verizon uses bad data to discuss trends in software exploits used in data breaches. This section was contributed by Kenna Security (formerly Risk I/O), a vulnerability management startup with $10 million in venture funding. Unlike the rest of the report, nothing in this section is based on data breaches.

image01.png

The Kenna Security website claims they authored the Vulnerabilities section in the 2016 DBIR

It’s easy to criticize the analysis in the Vulnerabilities section. It repeats common tropes long attacked by the security community, like simple counting of known vulnerabilities (Figures 11, 12, and 13). Counting vulnerabilities fails to consider the number of assets, their importance to the business, or their impact. There’s something wrong with the underlying data, too.

Verizon notes in the section’s header that portions of the data come from vulnerability scanners. In footnote 8, they share some of the underlying data, a list of the top 10 exploited vulnerabilities as detected by Kenna. According to the report, these vulnerabilities represent 85% of successful exploit traffic on the Internet.

image03.png

Footnote 8 lists the vulnerabilities most commonly used in data breaches

Jericho at OSVDB was the first to pick apart this list of CVEs. He noted that the DBIR never explains how successful exploitation is detected (their subsequent clarification doesn’t hold water), nor what successful exploitation means in the context of a vulnerability scanner. Worse, he points out that among the ‘top 10’ are obscure local privilege escalations, denial of service flaws for Windows 95, and seemingly arbitrary CVEs from Oracle CPUs.

Rory McCune at NCC was the second to note discrepancies in the top ten list. Rory zeroed in on the fact that one of Kenna’s top 10 was the FREAK TLS flaw which requires network man-in-the-middle, a vulnerable server, a vulnerable client to exploit, and substantial computational power to pull it off at scale. Additionally, successful exploitation produces no easily identifiable network signature. In the face of all this evidence against the widespread exploitation of FREAK, Kenna’s extraordinary claims require extraordinary evidence.

When questioned about similar errors in the 2015 DBIR, Kenna’s Chief Data Scientist Michael Rohytman explained, “the dataset is based on the correlation of ids exploit signatures with open vulns.” Rohytman later noted that disagreements about the data likely stem from differing opinions about the meaning of “successful exploitation.”

These statements show that the vulnerability data is unlike all other data used in the DBIR. Rather than the result of a confirmed data breach, the “successful exploit traffic” of these “mega-vulns” was synthesized by correlating vulnerability scanner output with intrusion detection system (IDS) alerts. The result of this correlation does not describe the frequency nor tactics of real exploits used in the wild.

Obfuscating with fake science

Faced with a growing chorus of criticism, Verizon and Kenna published a blog post that ignores critics, attempts to obfuscate their analysis with appeals to authority, substitutes jargon for a counterargument, and reiterates dangerous enterprise security policies from the report.

image05.png

Kenna’s blog post begins with appeals to authority and ad hominem attacks on critics

The first half of the Kenna blog post moves the goalposts. They present a new top ten list that, in many ways, is even more disconnected from data breaches than the original. Four of the ten are now Denial of Service (DoS) flaws which do not permit unauthorized access to data. Two more are FREAK which, if successfully exploited, only permit access to HTTPS traffic. Three are 15-year-old UPnP exploits that only affect Windows XP SP0 and lower. The final exploit is Heartbleed which, despite potentially devastating impact, can be traced to few confirmed data breaches since its discovery.

Kenna’s post does answer critics’ calls for the methodology used to define a ‘successful exploitation’: an “event” where 1) a scanner detects an open vulnerability, 2) an IDS triggers on that vulnerability, and 3) one or more post-exploitation indicators of compromise (IOCs) are triggered, presumably all on the same host. This approach fails to account for the biggest challenge with security products: false positives.

image02.png

Kenna is using a synthetic benchmark for successful exploitation based on IDS signatures

Flaws in the data

As mentioned earlier, the TLS FREAK vulnerability is the most prominent error in the DBIR’s Vulnerabilities section. FREAK requires special access as a network Man-in-the-Middle (MITM). Successful exploitation only downgrades the protections from TLS. An attacker would then have to factor a 512-bit RSA modulus to decrypt the session data; an attack that cost US$75 for each session around the time the report was in production. After decrypting the result, they’d just have a chat log; no access to either the client nor server devices. Given all this effort, the low pay-off, and the comparative ease and promise of other exploits, it’s impossible that the TLS FREAK flaw would have been one of the ten most exploited vulnerabilities in 2015.

The rest of the section’s data is based on correlations between intrusion detection systems and vulnerability scanners. This approach yields questionable results.

All available evidence (threat intel reports, the Microsoft SIR, etc.) show that real attacks occur on the client side: Office, PDF, Flash, Browsers, etc. These vulnerabilities, which figure so prominently in Microsoft data and DFIR reports about APTs, don’t appear in the DBIR. How come exploit kits and APTs are using Flash as a vector, yet Kenna’s top 10 fails to list a single Flash vulnerability? Because, by and large, these sorts of attacks are not visible to IDS nor vulnerability scanners. Kenna’s data comes from sources that cannot see the actual attacks.

Intrusion detection systems are designed to inspect traffic and apply a database of known signatures to the specific protocol fields. If a match appears, most products will emit an alert and move on to the next packet. This “first exit” mode helps with performance, but it can lead to attack shadowing, where the first signature to match the traffic generates the only alert. This problem gets worse when the first signature to match is a false positive.

The SNMP vulnerabilities reported by Kenna (CVE-2002-0012, CVE-2002-0013) highlight the problem of relying on IDS data. The IDS signatures for these vulnerabilities are often triggered by benign security scans and network discovery tools. It is highly unlikely that a 14-year old DoS attack would be one of the most exploited vulnerabilities across corporate networks.

Vulnerability scanners are notorious for false positives. These products often depend on credentials to gather system information, but fall back to less-reliable testing methods as a last resort. The UPnP issues reported by Kenna (CVE-2001-0877, CVE-2001-0876) are false positives from vulnerability scanning data. Similar to the SNMP issues, these vulnerabilities are often flagged on systems that are not Windows 98, ME, or XP, and are considered line noise by those familiar with vulnerability scanner output.

It’s unclear how the final step of Kenna’s three-step algorithm, detection of post-exploitation IOCs, supports correlation. In the republished top ten list, four of the vulnerabilities are DoS flaws and two enable HTTPS downgrades. What is a post-exploitation IOC for a DoS? In all of the cases listed, the target host would crash, stop receiving further traffic, and likely reboot. It’s more accurate to interpret post-exploitation IOCs to mean, “more than one IDS signature was triggered.”

The simplest explanation for Kenna’s results? A serious error in the correlation methodology.

Issues with the methodology

Kenna claims to have 200+ million successful exploit events in their dataset. In nearly all the cases we know about, attackers use very few exploits. Duqu duped Kaspersky with just two exploits. Phineas Phisher hacked Hacking Team with just one exploit. Stuxnet stuck with four exploits. The list goes on. There are not 50+ million breaches in a year. This is a sign of poor data quality. Working back from the three-step algorithm described earlier, I conclude that Kenna counted IDS signatures fired, not successful exploit events.

There are some significant limitations to relying on data collected from scanners and IDS. Of the thousands of companies that employ these devices -and who share the resulting data with Kenna- a marginal number go through the effort of configuring their systems properly. Without this configuration, the resulting data is a useless cacophony of false positives. Aggregating thousands of customers’ noisy datasets is no way to tune into a meaningful signal. But that’s precisely what Kenna asks the DBIR’s readers to accept as the basis for the Vulnerabilities section.

Let’s remember the hundreds of companies, public initiatives, and bots scanning the Internet. Take the University of Michigan’s Scans.io as one example. They scan the entire Internet dozens of times per day. Many of these scans would trigger Kenna’s three-part test to detect a successful exploit. Weighting the results by the number of times an IDS event triggers yields a disproportionate number of events. If the results aren’t normalized for another factor, the large numbers will skew results and insights.

Screen Shot 2016-05-05 at 11.57.35 PM

Kenna weighted their results by the number of IDS events

Finally, there’s the issue of enterprises running honeypots. A honeypot responds positively to any attempt to hack into it. This would also “correlate” with Kenna’s three-part algorithm. There’s no indication that such systems were removed from the DBIR’s dataset.

In the course of performing research, scientists frequently build models of how they think the real world operates, then back-test them with empirical data. High-quality sources of empirical exploit incidence data are available from US-CERT, which coordinates security incidents for all US government agencies, and Microsoft, which has unique data sources like Windows Defender and crash reports from millions of PCs. From their reports, only the Heartbleed vulnerability appears in Kenna’s list. The rest of the data and recommendations from US-CERT and Microsoft match. Neither of them agree with Kenna.

Ignore the DBIR’s vulnerability recommendations

“This is absolutely indispensable when we defenders are working together against a sentient attacker.” — Kenna Security

Even if you take the DBIR’s vulnerability analysis at face value, there’s no basis for assuming human attackers behave like bots. Scan and IDS data does not correlate to what real attackers would do. The only way to determine what attackers truly do is to study real attacks.

image07.png

image04.png

Kenna Security advocates a dangerous patch strategy based on faulty assumptions

Empirical data disagrees with this approach. Whenever new exploits and vulnerabilities come out, attacks spike. This misguided recommendation has the potential to cause real, preventable harm. In fact, the Vulnerabilities section of the DBIR both advocates this position and then refutes it only one page later.

image06.png

The DBIR presents faulty information on page 13…

image00.png

… then directly contradicts itself only one page later

Recommendations from this section fall victim to many of the same criticisms of pure vulnerability counting: they fail to consider the number of assets, the criticality of them, the impact of vulnerabilities, and how they are used by real attackers. Without acknowledging the source of the data, Verizon and Kenna walk the reader down a dangerous path.

Improvements for the 2017 DBIR

“It would be a shame if we lost the forest for the exploit signatures.”
— Michael Rohytman, Chief Data Scientist, Kenna

This closing remark from Kenna’s rebuttal encapsulates the issue: exploit signatures were used in lieu of data from real attacks. They skipped important steps while collecting data over the past year, jumped to assumptions based on scanners and IDS devices, and appeared to hope that their conclusions would align with what security professionals see on the ground. Above all, this incident demonstrates the folly of applying data science without sufficient input from practitioners. The resulting analysis and recommendations should not be taken seriously.

Kenna’s 2015 contribution to the DBIR received similar criticism, but they didn’t change for 2016. Instead, Verizon expanded the Vulnerability section and used it for the basis of recommendations. It’s alarming that Verizon and Kenna aren’t applying critical thinking to their own performance. They need to be more ambitious with how they collect and analyze their data.

Here’s how the Verizon 2017 DBIR could improve on its vulnerability reporting:

  1. Collect exploit data from confirmed data breaches. This is the basis for the rest of the DBIR’s data. Their analysis of exploits should be just as rigorous. Contrary to what I was told on Twitter, there is enough data to achieve statistical relevance. With the 2017 report a year away, there’s enough time to correct the processes of collecting and analyzing exploit data. Information about vulnerability scans and IDS signatures don’t serve the information security community, nor their customers.
  2. That said, if Verizon wants to take more time to refine the quality of the data they receive from their partners, why not partner with agent-based security vendors in the meantime? Host-based collection is far closer to exploits than network data. CrowdStrike, FireEye, Bit9, Novetta, Symantec and more all have agents on hosts that can detect successful exploitation based on process execution and memory inspection; more reliable factors.
  3. Finally, include a red team in the review process of future reports. It wasn’t until the 2014 DBIR that attackers’ patterns were separated into nine categories; a practice that practitioners had developed years earlier. That technique would have been readily available if the team behind the DBIR had spoken to practitioners who understand how to break and defend systems. Involving a red team in the review process would strengthen the report’s conclusions and recommendations.

Be better

For the 2016 DBIR, Verizon accepted a huge amount of low-quality data from a vendor. They reprinted the analysis verbatim. Clearly, no one who understands vulnerabilities was involved in the review process. The DBIR team tossed in some data-science vocab for credibility, and a few distracting jokes, and asked for readers’ trust.

Worse, Verizon stands behind the report, rather than acknowledge and correct the errors.

Professionals and businesses around the world depend on this report to make important security decisions. It’s up to Verizon to remain the dependable source for our industry.

I’d like to thank HD Moore, Thomas Ptacek, Grugq, Dan Rosenberg, Mike Russell, Kelly Shortridge, Rafael Turner, the entire team at Trail of Bits, and many others that cannot be cited for their contributions and comments on this blog post.

UPDATE 1:

Rory McCune has posted a followup where he notes a huge spike in Kenna’s observed exploitation of FREAK occurs at exactly the same time that the University of Michigan was scanning the entire internet for it. This supports the theory that benign internet-wide scans made it into Kenna’s data set where they were scaled by their frequency of occurrence.

Kenna's data on FREAK overlaps precisely with internet-wide scans from the University of Michigan

Kenna’s data on FREAK overlaps precisely with internet-wide scans from the University of Michigan

Further, an author of FREAK has publicly disclaimed any notion that it was widely exploited.

UPDATE 2:

Rob Graham has pointed out that typical IDS signatures for FREAK do not detect attacks but rather only detect TLS clients that offer weak cipher suites. This supports the theory that the underlying data was not inspected nor were practitioners consulted prior to using this data in the DBIR.

UPDATE 3:

Justin Kennedy has shared exploit data from five years of penetration tests conducted against his clients and noted that FREAK and Denial of Service attacks never once assisted compromising a target. This supports the theory that exploitation data in the DBIR distorts the view on the ground.

UPDATE 4:

Threatbutt has immortalized this episode with the release of their Danger Zone Incident Retort (DZIR).

UPDATE 5:

Karim Toubba, Kenna Security CEO, has posted a mea culpa on their blog. He notes that they did not confirm any of their analysis with their own product before delivering it to Verizon for inclusion in the DBIR.

What is the point of Kenna's contribution if it was not backed by their insights?

Kenna’s contribution to the DBIR was not validated by their own product

Further, Karim notes that their platform ranks FREAK as a “25 out of 100”, however, even this ranking is orders of magnitude too high based on the available evidence. This introduces the question of whether the problems exposed in Kenna’s analysis from the DBIR extend into their product as well.

Screen Shot 2016-05-12 at 1.40.36 PM

Kenna’s product prioritizes FREAK an order of magnitude higher than it likely should be

Finally, I consider the criticisms in this blog post applicable to their entire contribution to the DBIR and not only their “top ten successfully exploited vulnerabilities” list. Attempts to pigeonhole this criticism to the top ten miss the fact that other sections of the report are based on the same or similar data from Kenna.

UPDATE 6:

Verizon published their own mea culpa.

 

The Problem with Dynamic Program Analysis

Developers have access to tools like AddressSanitizer and Valgrind that will tell them when the code that they’re running accesses uninitialized memory, leaks memory, or uses memory after it’s been freed. Despite the availability of these excellent tools, memory bugs still persist, still get shipped to users, and still get exploited in the wild.

Most of today’s bug-finding finding tools are dynamic: they identify bugs in programs while those programs are running. This is great because all programs have massive test suites that exercise every line of code… right? Wrong. Large test suites are the exception, not the rule. Test suites definitely help find and reduce bugs, but bugs still get through.

Perhaps the solution is to pay to have your code audited by professionals. More eyes on your code is a good thing™, but the underlying issue remains. Analyses run inside the heads of experts are still “dynamic”: thinking through every code path is just not tractable.

So dynamic analyses can miss bugs because they can’t check every possible program path. What can check every possible program path?

Finding use-after-frees in millions of lines of code

We use static analysis to analyze millions of lines of code, without ever running the code. The analysis technique, called data-flow tracking, enables us to analyze and summarize properties about every possible program path. This solves the aforementioned problem of missing bugs that occur when certain program paths are not exercised.

How does an analysis that sees everything actually work? Below we describe the 1-2-3 of an actual whole-program static analysis tool that we developed and regularly use. The tool, PointsTo, finds and reports on potential use-after-free bugs in large codebases.

Step 1: Convert to LLVM bitcode

PointsTo operates on the LLVM bitcode representation of a program. We chose LLVM bitcode because it is a convenient intermediate representation for performing program analyses. Unsurprisingly, the first stage of our analysis pipeline converts a program’s source code into an LLVM bitcode database. We use an internal tool named CompInfo to produce these databases. An alternative, open-source tool for doing something similar is whole-program-llvm.

image04

Step 2: Create the data-flow graph

The key idea behind PointsTo is to analyze how pointers to allocated objects flow through the program. What we care about are assignments to and copies of pointers, pointer dereferences, and frees of pointers. These operations on pointers are represented using a data-flow graph.

Four steps to creating a data-flow graph

The most interesting step in the process is the why and how of transforming allocations and frees into special assignments. The “why” is that this transformation lets us repurpose an existing program analysis to find paths from FREE definitions to pointer dereferences. The how is more subtle: how does PointsTo know that it should change “new A” into an ALLOC and “delete a” into a FREE?

Imagine a hypothetical embedded system where programs are starved for memory and so the natural choice is to use a custom memory allocator called ration_memory. We created a Python modelling language to feed PointsTo information about higher-level function behaviors. Our modelling scripts tell PointsTo that “new A” returns a new object, and so we can use it to say the same thing about ration_memory.

Segue: Hidden data flows

The transformation from source code into a data flow graph looked pretty simple, but that was because the source code we started with was simple. It had no function calls, and more importantly, it had no function pointers or method calls! What happens if callback below is a function pointer? What happens if callback frees x?

int *x = malloc(4);
callback(x);
*x += 1;

This is the secret sauce and namesake of PointsTo: we perform a context- and path-sensitive pointer analysis that tells us which function pointers point to which functions and when. Altogether, we can produce an error report that follows x through callback and back again.

Step 3: Dénouement

It’s time to report potential errors for expert analysis. PointsTo searches through the data-flow graph, looking for flows from assignments to FREE down to dereferences. These flows are converted into a program slice of the source code lines, showing the path that execution needs to follow in order to produce a use-after-free. Here’s an example program slice of a real bug:

LightHTTPD Use-After-Free

When describing this system to compiler folks, the usual first question is: but what about false-positives? What if we get a report about a use-after-free and it isn’t one? Here is where the priorities of program analysis for compilers and for vulnerabilities diverge.

False-positives in a compiler analysis can introduce bugs, and so compilers are usually conservative. That is, they trade false-positives for false-negatives. They might miss some optimization opportunities because they can’t prove something, but at least the program will be compiled correctly *cough*.

For vulnerability analysis, this is a bad trade. False-positives in a vulnerability analysis are inconvenient, but they’re a drop in the ocean when millions of lines of code need to be looked at. False-negatives, however, are unacceptable. A false-negative is a bug that is missed and might make it to production. A tool that always finds the bug and sometimes warns you about sketchy but correct code is an investment that saves time and money during code audits.

In summary, we conclude

Analyzing programs for bugs is hard. Industry best-practices like having extensive test suites should be followed. Developers should regularly run their programs through dynamic analysis tools to pick the low-hanging fruit. But more importantly, developers should understand that test suites and dynamic analyses are not a panacea. Bugs have a nasty habit of hiding behind rarely executed code paths. That’s why all paths need to be looked at. That’s why we made PointsTo.

PointsTo was a topic of discussion at a recent Empire Hacking, a bi-monthly meetup in NYC. The talk I gave there includes more information about the design and implementation of PointsTo and, for curious readers, the slides and video are reproduced below. We hope to release more videos from Empire Hacking in the future.

PointsTo was originally produced for Cyber Fast Track and we would like to thank DARPA for funding our work. Consultants at Trail of Bits use PointsTo and other internal tools for application security reviews. Contact us if you’re interested in a detailed audit of your code supported by tools like PointsTo and our CRS.

 

Software Security Ideas Ahead of Their Time

Every good security researcher has a well-curated list of blogs they subscribe to. At Trail of Bits, given our interest in software security and its intersections with programming languages, one of our favorites is The Programming Language Enthusiast by Michael Hicks.

Our primary activity is to describe and discuss research about — and the practical development and use of — programming languages and programming tools (PLPT). PLPT is a core area of computer science that bridges high-level algorithms/designs and their executable implementations. It is a field that has deep roots in mathematical logic and the theory of computation but also produces practical compilers and analysis tools.

One of our employees and PhD student at UMD, Andrew Ruef, has written a guest blog post for the PL Enthusiast on the topic of software security ideas that were ahead of their time.

As researchers, we are often asked to look into a crystal ball. We try to anticipate future problems so that work we begin now will address problems before they become acute. Sometimes, a researcher foresees a problem and its possible solution, but chooses not to pursue it. In a sense, she has found, and discarded, an idea ahead of its time.

Recently, a friend of Andrew’s pointed him to a 20-year-old email exchange on the “firewalls” mailing list that blithely suggests, and discards, problems and solutions that are now quite relevant, and on the cutting edge of software security research. The situation is both entertaining and instructive, especially in that the ideas are quite squarely in the domain of programming languages research, but were not considered by PL researchers at the time (as far as we know).

Read on for a deep dive into the firewalls listserv from 1995, prior to the publication of Smashing the Stack for Fun and Profit, as a few casual observers correctly anticipate the next 20 years of software security researchers.

If you enjoyed Andrew’s post on the PL Enthusiast, we recommend a few others that touch upon software security:

Hacking for Charity: Automated Bug-finding in LibOTR

At the end of last year, we had some free time to explore new and interesting uses of the automated bug-finding technology we developed for the DARPA Cyber Grand Challenge. While the rest of the competitors are quietly preparing for the CGC Final Event, we can entertain you with tales of running our bug-finding tools against real Linux applications.

Like many good stories, this one starts with a bet:

image01

On November 4, 2014, Thomas Ptacek (of Starfighter) bet Matthew Green (of Johns Hopkins) that libotr, a popular library used in secure messaging software, would have a high severity (e.g. remote code execution, information disclosure) bug in the next 12 months. Here at Trail of Bits, we like a good wager, especially when the proceeds go to charity. And we just happened to have an automated bug-finding system laying around, itching for something to do. The temptation was too much to resist: we decided to use our automated bug-finding system from the Cyber Grand Challenge to look for bugs in libotr.

Before we go on, we should state that this was not a security audit. We simply wanted to test how well our automated bug-finding system works on real Linux software and maybe win some money for charity.

We successfully enhanced our bug-finding system to support the libotr library and tested it extensively. Our system confirmed that there were no critical bugs in code paths that we tested; since no one else reported any bugs, the bet ended with Matthew Green donating $1000 to Partners in Health.

Read on to discover the challenges encrypted communications systems present for automated testing, how we solved them, and our testing methodology. Of course, just because our system didn’t find bugs in libotr does not mean that libotr is bug-free.

Background

The automated bug-finding system, known as a Cyber Reasoning System (CRS), that we built for the Cyber Grand Challenge operates on binary code for the DECREE operating system. While DECREE is based on Linux, it differs considerably from plain Linux. DECREE has no signals, no shared memory, no threads, no sockets, no files, and only seven system calls. This means that DECREE is not binary or source compatible with Linux libraries like libotr.

After weighing our options, we decided the easiest and fastest way to test libotr was to port it to DECREE, instead of adding full Linux support to our CRS. We attempted the port in a generic manner, to ensure we could use the lessons learned to test future Linux software.

To port libotr, we had to solve two major issues: shared library dependencies (libotr depends on libgpgerror and libgcrypt) and libc support. We used LLVM to solve both problems at once. First, we used whole-program-llvm to compile libotr and all dependencies to LLVM bitcode. We then merged all the shared libraries at the bitcode level, and aggressively optimized the resulting bitcode. In one move, we eliminated the need for shared libraries, and drastically reduced the amount of libc we’d have to implement, because unused libc calls were optimized out of the resulting bitcode. To build a libc that works on DECREE, we combined libc implementations from the challenge binaries, stubbed functions that don’t make sense in DECREE, and created new implementations based on DECREE calls where appropriate.

Automated Testing

Encrypted communications applications are, by design, difficult to automatically audit. This makes perfect sense: if an automated system can reason how ciphertext relates to plaintext, the encrypted communication system is already broken. These systems are also difficult to audit by random testing (e.g. fuzzing), because recipients will verify the integrity of every message. Typically when testing encrypted systems, the encryption is turned off (or data is manipulated prior to encryption or after decryption). We wanted to simulate testing a black-box binary, so we did not modify libotr in any way. Instead, we thought the best path was to make our CRS simulate a man-in-the-middle (MITM) attack. Because we tested an unmodified libotr, our CRS could not effectively attack code past message integrity checks. However, there was still much in the way of attack surface: message control data, headers, and possibility of flaws in decryption/authentication code. The problem was that our CRS was not designed to MITM. We instead architected the test application (not libotr) to be easier to attack, which results in the convoluted architecture below.

image02

The CRS acts as a man-in-the-middle between two applications communicating using libotr.

Creating the test application was more difficult than porting libotr to DECREE. The porting process was fairly straightforward and took about two weeks. The sample application took a bit longer, and was a much more frustrating experience: the official libotr distribution has no sample code, and the documentation leaves a lot to be desired.

Our testing was limited by the features of libotr exercised by our sample application (for instance, it doesn’t use SMP), and by the unusual test application we created. Additionally, some vulnerabilities may only occur after decryption, and modification of encrypted and authenticated data will never trigger these bugs.

Results

The results of testing libotr are very encouraging. We ran 48 Xeon CPUs for 24 hours against our libotr sample application, and did not identify any memory safety violations.

image00

The CRS acts as a man-in-the-middle between two applications communicating using libotr.

This negative result does not mean that libotr is bug free. We only tested a subset of libotr, and there are considerable parts that our CRS never audited. The lack of obvious bugs is however a very good sign.

Conclusion

The timeframe of the libotr bet has expired without any reported high severity vulnerabilities. We audited parts of libotr with our automated bug-finding tools, and also didn’t find memory corruption vulnerabilities. In the process of setting up this test, we learned how to port Linux applications to DECREE and verified that our CRS can identify real bugs in Linux programs. Better documentation, tests, and sample applications that exercise every libotr feature would simplify both automated and manual auditing. For this experiment we constrained ourselves to an unmodified libotr. We are planning a future test where we modify libotr to enable easier automated testing.

Summer @ Trail of Bits

This summer I’ve had the incredible opportunity to work with Trail of Bits as a high school intern. In return, I am obligated to write a blog post about this internship. So without further ado, here it is.

Starting with Fuzzing


The summer kicked off with fuzzing, a technique I had heard of but had never tried. The general concept is to throw input at a program until it crashes, then analyze the crash to find a vulnerability. Because of time constraints, it didn’t make sense to write my own fuzzer, so I began looking for pre-existing fuzzers online. The first tool that I found was Cert’s Failure Observation Engine (FOE), which seemed very promising. FOE has many options that allow for precise fine-tuning of the fuzzer, so it can be tweaked specifically for the target. However, my experience with FOE was fickle. With certain targets, the tool would run once and stop, instead of running continuously (as a fuzzer should). Just wanting to get started, I decided to move on to other tools instead. I settled on american fuzzy lop (afl) for Linux and Microsoft MiniFuzz for Windows. Each had their pros and cons. Afl works best with source code, which limits the scope to open-source software (there is experimental support for closed-source binaries, however it is considerably slower). Compiling from source with afl allows the fuzzer to ascertain code coverage and provide helpful feedback in its interface. MiniFuzz is the opposite: it runs on closed-source Windows software and provides very little feedback while it runs. However, the crash data is very helpful as it gives the values of all registers at the time of the program crash — something the other fuzzers did not provide. MiniFuzz was very click and run compared to afl’s more involved compiling setup.

Examining a Crash

Once the fuzzers were set up and running on targets (Video Lan’s VLC, Wireshark, and Image Magick just to name a few) it was time to start analyzing the crashes. Afl reported several crashes in VLC. While verifying that these crashes were reproducible, I noticed that several were segfaults while trying to free the address 0x7d. This struck me as odd because the address was so small, so on a hunch I opened up the crashing input in a hex editor and searched for ‘7d’. Sure enough, deep in the file was a match: 0x0000007d. I changed this to something easily recognizable, 0x41414141, and ran the file through again. This time the segfault was on, you guessed it, 0x41414141! Encouraged by the knowledge that I could control an address in the program from a file, I set out to find the bug. This involved a long process of becoming undesirably acquainted with both gdb and the VLC source code. The bug allows for the freeing of two arbitrary, user-controlled pointers.

The Bug in Detail

VLC reads in different parts of a file as boxes, which it categorizes in a tagged union. The bug is the result of a type confusion when the size of the stsd box in the file is changed, increasing its size so that it considers the following box, an stts box, to be its child. VLC reads boxes from the file by indexing into a function table based on the type of the box and the type of its parent. But with the wrong parent, it finds no match and instead uses a default read, which reads the file in as a vide type box. Later, when freeing the box, it finds the function only by checking its own type, so it triggers the correct function. VLC tries to free an stts box that was read in as a generic vide box, and frees two address straight from the stts box.

file_diagram

CVE-2015-5949

Controlling two freed addresses is plausibly exploitable, so it was time to report the bug. I went through oCERT who were very helpful in communicating the bug with the VLC developers to fix the issue and getting a CVE assigned (CVE-2015-5949). After some back and forth it was settled, and time to move on to something new.

Switching Gears to the Web

With half a summer done and another half to learn something new, I began to explore web security. I had slightly more of a background in this from some CTFs and from NYU Hack Night, but I wanted to get a more in-depth and practical understanding. Unlike fuzzing, where it was easy to hit the ground running, web security required a bit more knowledge beforehand. I spent a week trying to learn as much as possible from The Web Application Hacker’s Handbook and the corresponding MDSec labs. Armed with a solid foundation, I put this training to good use.

Bounty Hunting

HackerOne has a directory of companies that have bug bounty programs, and this seemed the best place to start. I sorted by date joined and picked the newest companies – they probably had not been looked at much yet. Using BurpSuite, an indispensable tool, I poked through these websites looking for anything amiss. Looking through sites like ok.ru, marktplaats.nl, and united.com, I searched for vulnerable functions and security issues, and submitted a few reports. I’ve had some success, but they are still going through disclosure.

Security Assessment

To conclude the internship, I performed a security assessment of a tech startup in NYC, applying the skills I’ve acquired. I found bugs in application logic, access controls, and session management, the most severe of which was a logic flaw that posed significant financial risk to the company. I then had the opportunity to present these bugs in a meeting with the company. The report was well-received and the company is now implementing fixes.

Signing Off

This experience at Trail of Bits has been fantastic. I’ve gotten a solid foundation in application and web security, and it’s been a great place to work. I’m taking a week off to look at colleges, but I’ll be back working part time during my senior year.

How We Fared in the Cyber Grand Challenge

The Cyber Grand Challenge qualifying event was held on June 3rd, at exactly noon Eastern time. At that instant, our Cyber Reasoning System (CRS) was given 131 purposely built insecure programs. During the following 24 hour period, our CRS was able to identify vulnerabilities in 65 of those programs and rewrite 94 of them to eliminate bugs built in their code. This proves, without a doubt, that it is not only possible but achievable to automate the actions of a talented software auditor.

Despite the success of our CRS at finding and patching vulnerabilities, we did not qualify for the final event, to be held next year. There was a fatal flaw that lowered our overall score to 9th, below the 7th place threshold for qualification. In this blog post we’ll discuss how our CRS works, how it performed against competitor systems, what doomed its score, and what we are going to do next.

Cyber Grand Challenge Background

The goal of the Cyber Grand Challenge (CGC) is to combine the speed and scale of automation with the reasoning capabilities of human experts. Multiple teams create Cyber Reasoning Systems (CRSs) that autonomously reason about arbitrary networked programs, prove the existence of flaws in those programs, and automatically formulate effective defenses against those flaws. How well these systems work is evaluated through head-to-head tournament-style competition.

The competition has two main events: the qualifying event and the final event. The qualifying event was held on June 3, 2015. The final event is set to take place during August 2016. Only the top 7 competitors from the qualifying event proceed to the final event.

During the qualifying event, each competitor was given the same 131 challenges, or purposely built vulnerable programs, each of which contained at least one intentional vulnerability. For 24 hours, the competing CRSes faced off against each other and were scored according to four criteria. The full details are in the CGC Rules, but here’s a quick summary:

  • The CRS had to work without human intervention. Any teams found to use human assistance were disqualified.
  • The CRS had to patch bugs in challenges. Points were gained for every bug successfully patched. Challenges with no patched bugs received zero points.
  • The CRS could prove bugs exist in challenges. The points from patched challenges were doubled if the CRS could generate an input that crashed the challenge.
  • The patched challenges had to function and perform almost as well as the originals. Points were lost based on performance and functionality loss in the patched challenges.

A spreadsheet with all the qualifying event scores and other data used to make the graphs in this post is available from DARPA (Trail of Bits is the ninth place team). With the scoring in mind, let’s review the Trail of Bits CRS architecture and the design decisions we made.

Preparation

We’re a small company with a distributed workforce, so we couldn’t physically host a lot of servers. Naturally, we went with cloud computing to do processing; specifically, Amazon EC2. Those who saw our tweets know we used a lot of EC2 time. Most of that usage was purely out of caution.

We didn’t know how many challenges would be in the qualifying event — just that it would be “more than 100.” We prepared for a thousand, with each accompanied by multi-gigabyte network traffic captures. We were also terrified of an EC2 region-wide failure, so we provisioned three different CRS instances, one in each US-based EC2 region, affectionately named Biggie (us-east-1), Tupac (us-west-2), and Dre (us-west-1).

It turns out that there were only 131 challenges and no gigantic network captures in the qualifying event. During the qualifying event, all EC2 regions worked normally. We could have comfortably done the qualifying event with 17 c4.8xlarge EC2 instances, but instead we used 297. Out of our abundance of caution, we over-provisioned by a factor of ~17x.

Bug Finding

The Trail of Bits CRS was ranked second by the number of verified bugs found (Figure 1). This result is impressive considering that we started with nothing while several other teams already had existing bug finding systems prior to CGC.

Figure 1: Teams in the qualifying event ranked by number of bugs found. Orange bars signify finalists.

Our CRS used a multi-pronged strategy to find bugs (Figure 2). First, there was fuzzing. Our fuzzer is implemented with a custom dynamic binary translator (DBT) capable of running several 32-bit challenges in a single 64-bit address space. This is ideal for challenges that feature multiple binaries communicating with one another. The fuzzer’s instrumentation and mutation are separated, allowing for pluggable mutation strategies. The DBT framework can also snapshot binaries at any point during execution. This greatly improves fuzzing speed, since it’s possible to avoid replaying previous inputs when exploring new input space.

Figure 2: Our bug finding architecture. It is a feedback-based architecture that explores the state space of a program using fuzzing and symbolic execution.

Figure 2: Our bug finding architecture. It is a feedback-based architecture that explores the state space of a program using fuzzing and symbolic execution.

In addition to fuzzing, we had not one but two symbolic execution engines. The first operated on the original unmodified binaries, and the second operated on the translated LLVM from mcsema. Each symbolic execution engine had its own strengths, and both contributed to bug finding.

The fuzzer and symbolic execution engines operate in a feedback loop mediated by a system we call MinSet. The MinSet uses branch coverage to maintain a minimum set of maximal coverage inputs. The inputs come from any source capable of generating them: PCAPs, fuzzing, symbolic execution, etc. Every tool gets original inputs from MinSet, and feeds any newly generated inputs into MinSet. This feedback loop lets us explore the possible input state with both fuzzers and symbolic execution in parallel. In practice this is very effective. We log the provenance of our crashes, and most of them look something like:

Network Capture ⇒ Fuzzer ⇒ SymEx1 ⇒ Fuzzer ⇒ Crash

Some bugs can only be triggered when the input replays a previous nonce, which would be different on every execution of the challenge. Our bug finding system can produce inputs that contain variables based on program outputs, enabling our CRS to handle such cases.

Additionally, our symbolic executors are able to identify which inputs affect program state at the point of a crash. This is a key requirement for the success of any team competing in the final as it enables the CRS to create a more controlled crash.

Patching

Our CRS’s patching effectiveness, as measured by the security score, ranks as fourth (Figure 3).

Figure 3: Teams in the qualifying event ranked by patch effectiveness (security score). Orange bars signify finalists.

Figure 3: Teams in the qualifying event ranked by patch effectiveness (security score). Orange bars signify finalists.

Our CRS patches bugs by translating challenges into LLVM bitcode with mcsema. Patches are applied to the LLVM bitcode, optimized, and then converted back into executable code. The actual patching works by gracefully terminating the challenge when invalid memory accesses are detected. Patching the LLVM bitcode representation of challenges provides us with enormous power and flexibility:

  • We can easily validate any memory access and keep track of all memory allocations.
  • Complex algorithms, such as dataflow tracking, dominator trees, dead store elimination, loop detection, etc., are very simple to implement using the LLVM compiler infrastructure.
  • Our patching method can be used on real-world software, not just CGC challenges.

We created two main patching strategies: generic patching and bug-based patching. Generic patching is an exclusion-based strategy: it first assumes that every memory access must be verified, and then excludes accesses that are provably safe. The benefit of generic patching is that it patches all possible invalid memory accesses in a challenge. Bug-based patching is an inclusion-based strategy: it first assumes only one memory access (where the CRS found a bug) must be verified, and then includes nearby accesses that may be unsafe. Each patching strategy has multiple heuristics to determine which accesses should be included or excluded from verification.

The inclusion and exclusion heuristics generate patched challenges with different security/performance tradeoffs. The patched challenges generated by these heuristics were tested for performance and security to determine which heuristic performed best while still fixing the bug. For the qualifying event, we evaluated both generic and bug-based patching, but ultimately chose a generic-only patching strategy. Bug-based patching was slightly more performant, but generic patching was more comprehensive and it patched bugs that our CRS couldn’t find.

Functionality and Performance

Functionality and performance scores combine to create an availability score. The availability score is used as a scaling factor for points gained by patching and bug finding. This scaling factor only matters for successfully patched challenges, since those are the only challenges that can score points. The following graphs only consider functionality and performance of successfully patched challenges.

Functionality

Out of the 94 challenges that our CRS successfully patched, 56 retained full functionality, 30 retained partial functionality, and 8 were nonfunctional. Of the top 10 teams in the qualifying event, our CRS ranks 5th in terms of fully functional patched challenges (Figure 4). We suspect our patched challenges lost functionality due to problems in mcsema, our x86 to LLVM translator. We hope to verify and address these issues once DARPA open-sources the qualifying event challenges.

Figure 4: The count of perfectly functional, partially functional, and nonfunctional challenges submitted by each of the top 10 teams in the qualifying event. Orange bars signify finalists.

Figure 4: The count of perfectly functional, partially functional, and nonfunctional challenges submitted by each of the top 10 teams in the qualifying event. Orange bars signify finalists.

Performance

The performance of patched challenges is how our CRS snatched defeat from the jaws of victory. Of the top ten teams in the qualifying event, our CRS placed last in terms of patched challenge performance (Figure 5).

Figure 5: Average and median performance scores of the top ten qualifying event participants. Orange bars signify finalists.

Figure 5: Average and median performance scores of the top ten qualifying event participants. Orange bars signify finalists.

Our CRS produces slow binaries for two reasons: technical and operational. The technical reason is that performance of our patched challenges is an artifact of our patching process, which translates challenges into LLVM bitcode and then re-emits them as executable binaries. The operational reason is that our patching was developed late and optimized for the wrong performance measurements.

So, why did we optimize for the wrong performance measurements? The official CGC performance measurement tools were kept secret, because the organizers wanted to ensure that no one could cheat by gaming the performance measurements. Therefore, we had to measure performance ourselves, and our metrics showed that CPU overhead of our patched challenges was usually negligible. The main flaw that we observed was that our patched challenges used too much memory. Because of this, we spent time and effort optimizing our patching to use less memory in favor of using more CPU time.

It turns out we optimized for the wrong thing, because our self-measurement did not agree with the official measurement tools (Table 1). When self-measuring, our worst-performing patching method had a median CPU overhead of 33% and a median memory overhead of 69%. The official qualifying event measured us at 76% CPU overhead and 28% memory overhead. Clearly, our self-measurements were considerably different from official measurements.

Measurement Median CPU Overhead Median Memory Overhead
Worst Self-Measured Patching Method 33% 69%
Official Qualifying Event 76% 28%

Table 1: Self measured CPU and memory overhead and the official qualifying event CPU and memory overhead.

Our CRS measured its overall score with our own performance metrics. The self-measured score of our CRS was 106, which would have put us in second place. The real overall score was 21.36, putting us in ninth.

An important aspect of software development is choosing where to focus your efforts, and we chose poorly. CGC participants had access to the official measuring system during two scored events held during the year, one in December 2014 and one in April 2015. We should have evaluated our patching system thoroughly during both scored events. Unfortunately, our patching wasn’t fully operational until after the second scored event, so we had no way to verify the accuracy of our self-measurement. The performance penalty of our patching isn’t a fundamental issue. Had we known how bad it was, we would have fixed it. However, according to our own measurements the patching was acceptable so we focused efforts elsewhere.

What’s Next?

According to the CGC FAQ (Question 46), teams are allowed to combine after the qualifying event. We hope to join forces with another team that qualified for the CGC final event, and use the best of both our technologies to win. The technology behind our CRS will provide a significant advantage to any team that partners with us. If you would like to discuss a potential partnership for the CGC final, please contact us at cgc@trailofbits.com.

If we cannot find a partner for the CGC final, we will focus our efforts on adapting our CRS to automatically find and patch vulnerabilities in real software. Our system is up to the task: it has already proven that it can find bugs, and all of its core components were derived from software that works on real Linux binaries. Several components even have Windows and 64-bit support, and adding support for other platforms is a possibility. If you are interested in commercial applications of our technology, please get in touch with us at cgc@trailofbits.com.

Finally, we plan to contribute back fixes and updates to the open source projects utilized in our CRS. We used numerous open source projects during development, and have made several custom fixes and modifications. We look forward to contributing these back to the community so that everyone benefits from our improvements.

Dear DARPA: Challenge Accepted.

CGC_Stacked_ColoronBlack

We are proud to have one of the only seven accepted funded-track proposals to DARPA’s Cyber Grand Challenge.

Computer security experts from academia, industry and the larger security community have organized themselves into more than 30 teams to compete in DARPA’s Cyber Grand Challenge —- a first-of-its-kind tournament designed to speed the development of automated security systems able to defend against cyberattacks as fast as they are launched. DARPA also announced today that it has reached an agreement to hold the 2016 Cyber Grand Challenge final competition in conjunction with DEF CON, one of the largest computer security conferences in the world.

More info from DARPA:

Press coverage:

Our participation in this program aligns with our development of Javelin, an automated system for simulating attacks against enterprise networks. We have assembled a world-class team of experts in software security, capture the flag, and program analysis to compete in this challenge. As much as we wish the other teams luck in this competition, Trail of Bits is playing to win. Game on!

Using Static Analysis and Clang To Find Heartbleed

Background

Friday night I sat down with a glass of Macallan 15 and decided to write a static checker that would find the Heartbleed bug. I decided that I would write it as an out-of-tree clang analyzer plugin and evaluate it on a few very small functions that had the spirit of the Heartbleed bug in them, and then finally on the vulnerable OpenSSL code-base itself.

The Clang project ships an analysis infrastructure with their compiler, it’s invoked via scan-build. It hooks whatever existing make system you have to interpose the clang analyzer into the build process and the analyzer is invoked with the same arguments as the compiler. This way, the analyzer can ‘visit’ every compilation unit in the program that compiles under clang. There are some limitations to clang analyzer that I’ll touch on in the discussion section.

This exercise added to my list of things that I can only do while drinking: I have the best success with first-order logic while drinking beer, and I have the best success with clang analyzer while drinking scotch.

Strategy

One approach to identify Heartbleed statically was proposed by Coverity recently, which is to taint the return values of calls to ntohl and ntohs as input data. One problem with doing static analysis on a big state machine like OpenSSL is that your analysis either has to know the state machine to be able to track what values are attacker influenced across the whole program, or, they have to have some kind of annotation in the program that tells the analysis where there is a use of input data.

I like this observation because it is pretty actionable. You mark ntohl calls as producing tainted data, which is a heuristic, but a pretty good one because programmers probably won’t htonl their own data.

What our clang analyzer plugin should do is identify locations in the program where variables are written using ntohl, taint them, and then alert when those tainted values are used as the size parameter to memcpy. Except, that isn’t quite right, it could be the use is safe. We’ll also check the constraints of the tainted values at the location of the call: if the tainted value hasn’t been constrained in some way by the program logic, and it’s used as an argument to memcpy, alert on a bug. This could also miss some bugs, but I’m writing this over a 24h period with some Scotch, so increasing precision can come later.

Clang analyzer details

The clang analyzer implements a type of symbolic execution to analyze C/C++ programs. Plugging in to this framework as an analyzer requires bending your mind around the clang analyzer view of program state. This is where I consumed the most scotch.

The analyzer, under the hood, performs a symbolic/abstract exploration of program state. This exploration is flow and path sensitive, so it is different from traditional compiler data flow analysis. The analysis maintains a “state” object for each path through the program, and in this state object are constraints and facts about the program’s execution on that path. This state object can be queried by your analyzer, and, your analyzer can change the state to include information produced by your analysis.

This was one of my biggest hurdles when writing the analyzer – once I have a “symbolic variable” in a particular state, how do I query the range of that symbolic variable? Say there is a program fragment that looks like this:

int data = ntohl(pkt_data);
if(data >= 0 && data < sizeof(global_arr)) {
 // CASE A
...
} else {
 // CASE B
 ...
}

When looking at this program from the analyzers point of view, the state “splits” at the if into two different states A and B. In state A, there is a constraint that data is between certain bounds, and in case B there is a constraint that data is NOT within certain bounds. How do you access this information from your checker?

If your checker calls the “dump” method on its given “state” object, data like the following will be printed out:

Ranges of symbol values:
 conj_$2{int} : { [-2147483648, -2], [0, 2147483647] }
 conj_$9{uint32_t} : { [0, 6] }

In this example, conj_$9{uint32_t} is our ‘data’ value above and the state is in the A state. We have a range on ‘data’ that places it between 0 and 6. How can we, as the checker, observe that there’s a difference between this range and an unconstrained range of say [-2147483648, 2147483648]?

The answer is, we create a formula that tests the symbolic value of ‘data’ against some conditions that we enforce, and then we ask the state what program states exist when this formula is true and when it is false. If a new formula contradicts an existing formula, the state is infeasible and no state is generated. So we create a formula that says, roughly, “data > 500” to ask if data could ever be greater than 500. When we ask the state for new states where this is true and where it is false, it will only give us a state where it is false.

This is the kind of idiom used inside of clang analyzer to answer questions about constraints on state. The arrays bounds checkers use this trick to identify states where the sizes of an array are not used as constraints on indexes into the array.

Implementation

Your analyzer is implemented as a C++ class. You define different “check” functions that you want to be notified of when the analyzer is exploring program state. For example, if your analyzer wants to consider the arguments to a function call before the function is called, you create a member method with a signature that looks like this:

void checkPreCall(const CallEvent &Call, CheckerContext &C) const;

Your analyzer can then match on the function about to be (symbolically) invoked. So our implementation works in three stages:

  1. Identify calls to ntohl/ntoh
  2. Taint the return value of those calls
  3. Identify unconstrained uses of tainted data

We accomplish the first and second with a checkPostCall visitor that roughly does this:

void NetworkTaintChecker::checkPostCall(const CallEvent &Call,
CheckerContext &C) const {
  const IdentifierInfo *ID = Call.getCalleeIdentifier();

  if(ID == NULL) {
    return;
  }

  if(ID->getName() == "ntohl" || ID->getName() == "ntohs") {
    ProgramStateRef State = C.getState();
    SymbolRef 	    Sym = Call.getReturnValue().getAsSymbol();

    if(Sym) {
      ProgramStateRef newState = State->addTaint(Sym);
      C.addTransition(newState);
    }
  }

Pretty straightforward, we just get the return value, if present, taint it, and add the state with the tainted return value as an output of our visit via ‘addTransition’.

For the third goal, we have a checkPreCall visitor that considers a function call parameters like so:

void NetworkTaintChecker::checkPreCall(const CallEvent &Call,
CheckerContext &C) const {
  ProgramStateRef State = C.getState();
  const IdentifierInfo *ID = Call.getCalleeIdentifier();

  if(ID == NULL) {
    return;
  }
  if(ID->getName() == "memcpy") {
    SVal            SizeArg = Call.getArgSVal(2);
    ProgramStateRef state =C.getState();

    if(state->isTainted(SizeArg)) {
      SValBuilder       &svalBuilder = C.getSValBuilder();
      Optional<NonLoc>  SizeArgNL = SizeArg.getAs<NonLoc>();

      if(this->isArgUnConstrained(SizeArgNL, svalBuilder, state) == true) {
        ExplodedNode  *loc = C.generateSink();
        if(loc) {
          BugReport *bug = new BugReport(*this->BT, "Tainted,
unconstrained value used in memcpy size", loc);
          C.emitReport(bug);
        }
      }
    }
  }

Also relatively straightforward, our logic to check if a value is unconstrained is hidden in ‘isArgUnConstrained’, so if a tainted, symbolic value has insufficient constraints on it in our current path, we report a bug.

Some implementation pitfalls

It turns out that OpenSSL doesn’t use ntohs/ntohl, they have n2s / n2l macros that re-implement the byte-swapping logic. If this was in LLVM IR, it would be tractable to write a “byte-swapping recognizer” that uses an amount of logic to prove when a piece of code approximates the semantics of a byte-swap.

There is also some behavior that I have not figured out in clang’s creation of the AST for openssl where calls to ntohs are replaced with __builtin_pre(__x), which has no IdentifierInfo and thus no name. To work around this, I replaced the n2s macro with a function call to xyzzy, resulting in linking failures, and adapted my function check from above to check for a function named xyzzy. This worked well enough to identify the Heartbleed bug.

Solution output with demo programs and OpenSSL

First let’s look at some little toy programs. Here is one toy example with output:

$ cat demo2.c

...

int data_array[] = { 0, 18, 21, 95, 43, 32, 51};

int main(int argc, char *argv[]) {
  int   fd;
  char  buf[512] = {0};

  fd = open("dtin", O_RDONLY);

  if(fd != -1) {
    int size;
    int res;

    res = read(fd, &size, sizeof(int));

    if(res == sizeof(int)) {
      size = ntohl(size);

      if(size < sizeof(data_array)) {
        memcpy(buf, data_array, size);
      }

      memcpy(buf, data_array, size);
    }

    close(fd);
  }

  return 0;
}

$ ../docheck.sh
scan-build: Using '/usr/bin/clang' for static analysis
/usr/bin/ccc-analyzer -o demo2 demo2.c
demo2.c:30:7: warning: Tainted, unconstrained value used in memcpy size
      memcpy(buf, data_array, size);
      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 warning generated.
scan-build: 1 bugs found.
scan-build: Run 'scan-view /tmp/scan-build-2014-04-26-223755-8651-1' to
examine bug reports.

And finally, to see it catching Heartbleed in both locations it was present in OpenSSL, see the following:

Image

Image

Discussion

The approach needs some improvement, we reason about if a tainted value is “appropriately” constrained or not in a very coarse-grained way. Sometimes that’s the best you can do though – if your analysis doesn’t know how large a particular buffer is, perhaps it’s enough to show to an analyst “hey, this value could be larger than 5000 and it is used as a parameter to memcpy, is that okay?”

I really don’t like the limitation in clang analyzer of operating on ASTs. I spent a lot of time fighting with the clang AST representation of ntohs and I still don’t understand what the source of the problem was. I kind of just want to consider a programs semantics in a virtual machine with very simple semantics, so LLVM IR seems ideal to me. This might just be my PL roots showing though.

I really do like the clang analyzers interface to path constraints. I think that interface is pretty powerful and once you get your head around how to apply your problem to asking states if new states satisfying your constraints are feasible, it’s pretty straightforward to write new analyses.

Edit: Code Post

I’ve posted the code for the checker to Github, here.

Introducing Javelin

Javelin

Javelin shows you how modern attackers would approach and exploit your enterprise. By simulating real-time, real-world attack techniques, Javelin identifies which employees are most likely to be targets of spearphishing campaigns, uncovers security infrastructure weaknesses, and compares overall vulnerability against industry competitors. Javelin benchmarks the efficacy of defensive strategies, and provides customized recommendations for improving security and accelerating threat detection. Highly automated, low touch, and designed for easy adoption, Javelin will harden your existing security and information technology infrastructure.

Read more about Javelin on the Javelin Blog.

Follow

Get every new post delivered to your Inbox.

Join 5,754 other followers