MUI: Visualizing symbolic execution with Manticore and Binary Ninja

By Alan Chang, University of Oxford

During my summer internship, I had the wonderful opportunity to work on the Manticore User Interface (MUI). The MUI project aims to combine the strength of both Manticore, a powerful symbolic execution library, and Binary Ninja, a popular binary analysis tool, to provide a more intuitive and visual interface for working with symbolic execution.

Exploring an ELF binary with the MUI

A gentle introduction to symbolic execution

When conducting vulnerability research and reverse engineering, researchers often wish to thoroughly test a piece of software and explore all its possible states. Symbolic execution is one method that can address this matter. As shown in the following code snippet, the program execution can go into either the if case or the else case.

Code snippet and constraints graph

To test this code with a concrete input for x (e.g., x = 4), only one of the two states will be reached during execution. This means multiple runs of the program will be necessary to fully explore the state space. However, if we consider x to be symbolic, similar to the notion of a variable in mathematics, both states can be explored simultaneously, and we just have to keep track of the constraints on our symbolic variables at each point of the exploration process.

Specifically in this example, the if-else statement will create two states, one with the constraint x < 5 exploring ① and another with x ≥ 5 exploring ②. This is the key concept behind symbolic execution, which can help us figure out whether a given code segment is reachable and what input would be necessary.

The problem of state explosion

The symbolic execution technique, however, is not without its drawbacks. Notably, there’s the issue of state explosion. When there are many conditional branches and loops in a program, the number of states that need to be explored can grow exponentially. It can quickly become infeasible to explore them all. The example below illustrates this point. With just three loop iterations, the code snippet will end up with eight states to explore, and this number rapidly explodes as the iteration increments.

Example of state explosion

This issue of state explosion is further exacerbated by the fact that most symbolic execution libraries lack a method of visualizing the state exploration process. That means that it’s often difficult to even pinpoint where state explosions occur, let alone to begin fixing them.

The MUI project aims to address these issues by providing an interactive user interface to better visualize this state exploration process in symbolic execution and to keep the human in the loop. More specifically, the MUI is a Binary Ninja plugin with custom Qt widgets that provide a visual and intuitive interface to interact with Manticore, the open-source symbolic execution library developed by Trail of Bits.

MUI features and demonstration

To illustrate some of the MUI features, let’s try and solve a simple crackme challenge inside the Manticore repository. The objective is straightforward: we need to determine the correct input for this program.

Running the challenge with an incorrect input

First attempt: Using the find and avoid commands

Opening the ELF binary in Binary Ninja, we can quickly spot the two puts calls in the main function. These two function calls are used for the success and failure cases, respectively. Now, our objective can be rephrased as finding an input so that the code execution reaches the success case.

We can convey this objective to the MUI using the find command; the instruction is highlighted with green. Similarly, we can tell the MUI to avoid the failure case using the avoid command.

Now, with an objective specified, we can run the MUI to find the solution to this crackme challenge.

As shown in the gif, we can use Manticore to explore the state space of the challenge within the Binary Ninja UI, and we obtain the solution coldlikeminisodas. Giving this answer back to the program verifies that we have indeed solved the challenge.

Running the program with a correct input

Turning our attention back to the MUI, we can use the custom State List widget to see the way the MUI got to this solution and all the states that it explored. State 34 in the list denotes the final state in which we reached the success case.

State List widget in action

To further visualize the relation between each of the states, we can use the Graph View widget. This widget shows a provenance tree containing all the states. Double-clicking on a state node will bring us to the last instruction before the state was forked or terminated. Using the tab shortcut, the tree graph can be expanded to show other terminated states.

Graph View widget in action

A second solution: Custom hooks

In addition to all the cool features that we have already demonstrated, the MUI still has more tricks up its sleeves. Let’s solve this challenge again using a different method.

Body of the main function shown again for convenience

If we spend some time understanding the decompiled code, we can see that our user input is compared against the correct answer one character at a time, and when one character of our input does not match the correct answer, we see the failure message. With this knowledge, we can prevent all state forking by explicitly telling the MUI which path we want to take. This can be achieved with a custom hook and the code snippet below:

global bv,m,addr
def hook(state):
   flag_byte = state.cpu.AL - 0xa
 
   with m.locked_context() as context:
       if 'solution' in context:
           context["solution"] += chr(flag_byte)
       else:
           context["solution"] = chr(flag_byte)
       print(f'flag: {context["solution"]}')
 
   state.cpu.RIP = 0x400a51
m.hook(addr)(hook)

The custom hook feature here is a kind of fallback that gives you the full power of the Manticore API without having to write a complete script in a different environment. This allows researchers to do everything inside of Binary Ninja and reduce the amount of context switching required.

How about EVM?

Manticore is well known for its support for smart contracts, and the MUI plugin also offers basic EVM support. Documentation can be found in the project README file.

To get around the lack of built-in EVM support in Binary Ninja, the MUI leverages a few other open-source tools developed by Trail of Bits. Smart contracts are compiled using crytic-compile, an abstraction layer for smart contract build systems, and the generation and visualization of disassembly code and CFGs is handled by ethersplay, an EVM disassembler.

With these tools, the EVM feature set in the MUI is now on par with the default Manticore CLI, and more features are under active development. But even with the same features, the MUI really outshines the CLI tool in its usability and discoverability. Instead of looking through documentation to find the right command-line argument, users can now see all the available Manticore run options and pick the ones they need using a dynamically generated and up-to-date UI panel. Furthermore, the tight integration with Binary Ninja also means that these options can be persistently saved inside Binary Ninja Database (BNDB) project files, offering more convenience.

EVM run dialog

Conclusions

I’m really proud of what I was able to achieve with the MUI project during my internship. The MUI is already available for many use cases for researchers to improve their workflow. We have achieved what we set out to do with this project: to provide an intuitive and visual interface for working with symbolic execution.

This internship has been a great learning opportunity for me. Through this internship, I gained a deeper understanding of how symbolic execution works and learned about a lot of different topics ranging from project planning and documentation to Qt UI development, from multi-threaded applications to the Git workflow, and much more.

I would like to thank my mentors, Eric Kilmer and Sonya Schriner, for their tremendous help during my internship. Both of them provided me with guidance when I needed it but also gave me enough freedom to explore and innovate during the development of the MUI. My internship experience would not have been the same without them. I’m truly grateful for this internship opportunity at Trail of Bits, and I cannot wait to see what the future brings for the MUI project.

How to choose an interesting project

By Trent Brunson, Head of Research & Engineering
Originally published on October 15, 2021

Come join our team today! Trail of Bits is hiring full-time Senior Software Engineers and Software Security Research Engineers.

Over the last nine years, I’ve interviewed hundreds of applicants for research and engineering positions. One of my favorite icebreakers is, What kind of project would you choose to work on if you were given a $500,000 budget and one year to work on it with no oversight or consequences? (There’s no wrong answer.) Surprisingly, most people have never indulged themselves in this thought experiment. Why? For one, thinking of a good project isn’t easy!

Trail of Bits engineers are encouraged to dedicate a portion of their workweek to an internal research and development (IRAD) project of their choosing, so they face a similar challenge of having to commit to a project they think they might like. But it’s easy to become rudderless in a sea of security topics, where you may find yourself aimlessly scrolling through HackerNews links, scanning through blogs, and poring over preprints on arXiv.org. So here, I’d like to share a simple exercise that I go through when I’m looking for a new project.

This isn’t about persuading others to buy into your idea or justifying why your project is worthwhile. This is about you, a hard-working and curious soul, discovering topics that you genuinely find interesting and want to pursue—free of judgment, free of consequences.

As you make your decision, think about the factors that may influence your choice:

  • What skills you have
  • What skills you wish you had
  • How much time you have available
  • Where you are at in your career
  • What it will do for your career
  • Whether you will have a team to help
  • What impact you are looking to make

Where you are at and where you want to be

Start collecting and organizing information about yourself by making these five lists:

1) Your current skill set. Write down your strengths in areas in which you consider yourself knowledgeable or well read. List broad topic areas to open up the possibility of trying something completely new, or if you want to steer your thinking toward a specific topic area, only include subcategories within a specific domain.

2) What you’re interested in. It’s really simple. What do you think is cool? Maybe you read an article or a blog you thought was clever. Maybe you admired someone’s conference presentation. For this, I prefer to categorize these interests according to how much exposure I’ve had. This way, I can see where I might need to set aside time to learn the basics before making real progress.

3) How long the project will be. This isn’t necessarily meant to be the final date on which you walk away from your work to go do something else. I see this as more of a timeline in which you can stop and ask yourself whether you’re happy continuing or whether you want to choose a different path.

4) How many hours per week you will work on it. This is meant for you to take a look at your current situation and realistically determine your level of dedication. How many hours per week do you see yourself focusing on your project, knowing your schedule, prior commitments, attention span, and ability to work without distractions?

5) Desired outcome. This is meant to tie everything together and ask yourself what it is you want to produce with your effort. The outcome may be subtle, like the satisfaction of learning something new, or ambitious, like publishing a book or writing a dissertation.

Arranging these lists side-by-side helps you see the big picture and discover the different pathways that may lead to a project. I did this for myself to demonstrate how it might look:

The topics in green are ones that I understand fairly well; I could work my way through an academic publication on these topics without much trouble. Those in yellow are ones that I’ve had some exposure to but would need to do some extra Googling and reading to understand some of the subtleties. Those in red are, for the most part, completely uncharted waters for me. They sound cool, but I would have no idea what I would be getting into.

Be resourceful—use mad libs

Reading this chart from left to right, I can begin to think about all the different possibilities.

Using my ____________ skills, I can learn more about ______________ in ___ months if I commit at least ___ hours per week to produce a ______________.

When you start to build statements from your lists, it should become clear what is and isn’t feasible and what you are and aren’t willing to commit to. Here are some examples:

Using my C++ skills, I can learn more about LLVM in 6 months if I commit at least 5 hours per week to produce a peer-reviewed publication.

Sounds nice, but a peer-reviewed publication might be a bit of a stretch. I’ve completed the LLVM Kaleidoscope Tutorial and written some analysis passes before, but I’ve never taken a compiler’s course nor am I familiar with compiler and programming language research. So a blog post or pull request might be more attainable with a 6-month, 120-hour commitment. Also, an LLVM project could be good for my career.

Using my statistics and numerical analysis skills, I can learn more about open-source intelligence in 12 months if I commit at least 8 hours per week to produce a new open-source tool.

I’ve been really interested in Bellingcat’s work ever since I read about how they tracked the downing of flight MH17 over Ukraine to a Russian missile system. Really cool stuff. I think this project and commitment level are similar to a typical IRAD project. At that level of commitment, I would want it to have an impact on my career, so I would need to try and link the project to one of Trail of Bits’ core values. The next step is to narrow the search and see where today’s open-source intelligence tools fall short.

Using my natural language processing skills, I can learn more about topic modeling in 9 months if I commit at least 3 hours per week to produce a blog post.

Three hours per week sounds reasonable for something like this for a personal project. It doesn’t really align with my career goals, but it’s something I’ve read about for the past few years and want to know more about. There are elements of statistics, programming, NLP, and machine learning involved. How cool is that!

Apply today!

At Trail of Bits, I encourage my team to allocate 20% of their workweek to an IRAD project. But when discussing ideas, it’s common to hear people say that they don’t think their idea is good or novel enough, that it isn’t likely to succeed, or that it’s either too ambitious or not ambitious enough.

What makes this exercise for choosing a project so effective is that all the work is simply put into drawing a line from what you do know to what you want to know. Your commitment level is likely to be predetermined by whatever situation you’re in. And the final goal or outcome will be informed by the other four parameters. If done in earnest, this method should produce a whole line-up of possible projects you could find yourself enjoying. I hope you try it, and I hope you find it motivating.

Once again, I’d like to invite our readers to check out our Careers page for all of our current open positions. And if you’re interested in the Senior Software Engineer or Software Security Research Engineer opening, I look forward to hearing more about the IRAD projects you hope to work on at Trail of Bits!

Motivating global stabilization

By Samuel Moelius, Staff Engineer
Originally published on October 12, 2021

Consensus protocols have come to play a critical role in many applications. Fischer, Lynch, and Paterson’s classic impossibility result showed that under reasonable assumptions, it can be impossible for a protocol to reach consensus. In Dwork, Lynch, and Stockmeyer’s paper “Consensus in the Presence of Partial Synchrony” (the DLS paper), the authors circumvent this impossibility result by introducing the following “global stabilization time” (GST) assumption.

For each execution there is a global stabilization time (GST), unknown to the processors, such that the message system respects the upper bound Δ from time GST onward.

In other words, GST is a point in time after which all network messages are delivered with a delay of at most Δ. Dwork, Lynch, and Stockmeyer showed that, under this assumption, one can construct a protocol that is guaranteed to reach consensus.

But at face value, the assumption can seem unrealistic. After all, real networks do not work this way! Think of the network in your home or office. It is unlikely that at some magical point in time, the network will become reliably responsive, from then through eternity.

So why should you care? Well, even though the GST assumption can seem unrealistic, it has tremendous utility, as we explain in this post:

  1. One can see consensus as a game between two players and the GST assumption as a means of limiting the moves of one of those players. From this perspective, the assumption is quite natural, elegant even.
  2. Protocols that achieve consensus under this assumption exhibit a very useful property: they can recover from any configuration that could result from delays. Thus, proving a protocol correct under this assumption is not just of theoretical significance—it has concrete, practical implications.

We begin by reviewing Fischer, Lynch, and Paterson’s classic impossibility result, which laid the groundwork for the DLS paper. We then discuss the GST assumption in the context of a game between two players, with GST as a means of limiting the moves of one of those players. Finally, we shine a light on the practical implications of proving a protocol correct under the GST assumption.

The FLP impossibility result

In “Impossibility of Distributed Consensus with One Faulty Process,” Fischer, Lynch, and Paterson showed that under very mild assumptions, it can be impossible for a set of processes to reach consensus. This has come to be known as the FLP impossibility result. The DLS paper introduced the GST assumption, which makes it possible to circumvent this result, as highlighted in EATCS’s citation awarding the paper the Edsger W. Dijkstra Prize in Distributed Computing in 2007:

The eventual synchrony [assuming GST] approach introduced in this paper … has since been established as the leading approach for circumventing the FLP impossibility result and solving asynchronous consensus, atomic broadcast, and state-machine replication.

In this section, we describe the main ideas of Fischer, Lynch, and Paterson’s proof to provide context for the work presented in the DLS paper.

The model

In the FLP model, two or more processes exchange messages in an attempt to agree on a value, either 0 or 1. Each process has a write-once output register that is initially blank. Once a process believes a value has been agreed upon, it writes the value to its output register.

The processes reach consensus if at least one process writes a value to its output register, and no other process writes the opposite value to its own output register. Note that this definition of consensus is extremely lenient. For example, one might expect that all processes have to write the same value to their output registers. Adopting such a lenient definition strengthens the FLP result.

A configuration consists of the internal state of all processes (including their output registers), along with all messages that have been sent but not yet delivered. An event e is a pair (p, m) consisting of a message, m, addressed to a process, p. In our discussion of the proof in the section that follows, we use language like “e is delivered” instead of “m is delivered to p.” Furthermore, we say “e0, e1, … are delivered in configuration C0” to mean the following:

  • e0 is delivered in configuration C0, resulting in some configuration C1.
  • e1 is delivered in configuration C1, resulting in some configuration C2.
  • And so on.

Things can go awry for a protocol in two ways:

  • Any message may be delayed an arbitrary but finite amount of time.
  • One process may crash; that is, one process may stop responding to messages.

Note that if a process doesn’t receive a message that it expects to receive, it has no way of knowing whether the message is simply delayed or whether the sender has crashed.

Fischer, Lynch, and Paterson showed that in this model, consensus cannot be guaranteed.

The proof

Here, we try to give the intuition for the FLP result’s proof. Our description is intentionally a little fast and loose. For a more detailed explanation, we recommend Henry Robinson’s blog post, “A Brief Tour of FLP Impossibility.”

Consider the authors’ central lemma, which says essentially the following. If C is a bivalent configuration—meaning that the decision values 0 and 1 are possible—and e is an event that is deliverable in C, then e can be delivered in a way that results in a bivalent configuration. In more precise terms, there is a sequence of events ending with e that are deliverable in C and that result in a bivalent configuration. Using this lemma, the authors show that it is possible to deliver all messages in a way that no value is ever agreed upon.

The lemma’s proof goes essentially as follows. Let C be any configuration from which either 0 or 1 could be agreed upon, and let e be any event that is deliverable in C. Consider a configuration that results from delivering a sequence of events ending in e. By way of contradiction, suppose that any such configuration allows only 0 or 1 to be agreed upon, but no such configuration allows both possibilities (i.e., no such configuration is bivalent).

The authors show that, under these assumptions, both types of configurations must exist: configurations that allow only 0 to be agreed upon, and configurations that allow only 1 to be agreed upon. In fact, the authors show that there must exist a configuration C0, an event e’ for the same process as e, and a value i ∈ {0, 1}, such that the following holds:

  • If e is delivered in C0, then only i can be agreed upon.
  • If e’ and then e are delivered in C0, then only 1 – i can be agreed upon.

(It may be helpful to glance ahead at figure 1 at this point.)

Essentially, delivering e in C0 locks the processes to value i, while delivering e’ and then e in C0 locks the processes to value 1 – i. (We found jbapple’s StackExchange answer helpful in understanding this part of the proof.)

Now, let p be the recipient process named in e and e’. Because process p could crash, there must exist a sequence of events σ that is deliverable in C0, that does not involve p, and that leads to an agreement. Let A be the resulting configuration, and let j be the value agreed upon. Note that agreeing on j here does not involve e or e’, as neither is in σ.

On the other hand, while process p could crash, it might not; messages to or from p might simply be delayed. Recall that other processes cannot tell the difference between these two cases. So configuration A should be reachable if messages to or from p are simply delayed. Moreover, events e and e’ should be deliverable in A.

Consider the configuration that results from delivering e in A. This configuration should be the same as the configuration that results from delivering e and then σ in C0, as σ does not involve p. Next, consider the configuration that results from delivering e’ and then e in A. Again, because σ does not involve p, the result should be the same as the result of delivering e’, then e, and then σ in C0. (See Figure 1.)

Figure 1: Diagram used in the proof of the FLP result’s central lemma

To summarize, from C0, there are three key waypoints:

  • If e is delivered, the processes agree on i (D0).
  • If e’ and then e are delivered, the processes agree on 1 – i (D1).
  • If σ is delivered, the processes agree on j (A). (Remember, neither e nor e’ is in σ.)

But e and e’ are deliverable in A. Delivering e in A should lead to an agreement on i(E0), while delivering e’ and then e in A should lead to an agreement on 1 – i(E1). However, j was already agreed upon in A, and j cannot be both i and 1 – i. Thus, a contradiction results.

Limiting the adversary

Let us return to the DLS paper, which contains the following passage:

It is helpful to view each situation as a game between a protocol designer and an adversary. … If Δ holds eventually [i.e., the GST assumption holds], the adversary picks Δ, the designer (knowing Δ) supplies a consensus protocol, and the adversary picks a time T when Δ must start holding.

We can take this metaphor a bit further. After committing to time T(GST), the adversary simulates the supplied consensus protocol. Recall that there are two ways in which things can go awry for the protocol: messages may be delayed, and one process may crash. One can think of delaying a message or crashing a process as moves are available to the adversary. The adversary wins if it can get the protocol to violate one of its correctness conditions. Specifically, the adversary wins if it can get two processes to write different values to their output registers, or if it can get the simulation to run forever without any process ever writing a value to its output register.

With this in mind, one might interpret the FLP result to mean that, unhindered by the GST assumption, the adversary cannot be defeated. In other words, in the game described above, the adversary always has a (possibly infinite) winning sequence of moves.

Let us give a face to the adversary. What kind of adversary could delay messages or cause a process to crash? We like to think of an electrical storm.

Imagine a bunch of residential homes trying to communicate over telephone lines. (This used to be a thing.) The storm can introduce noise into the lines through electrical interference, causing messages to have to be retransmitted and, thus, delayed (Figure 2). Furthermore, a lightning bolt could strike one unlucky home, causing it to … well, stop responding to messages (Figure 3).

Figure 2: As an electrical storm, the adversary can use electrical interference to cause messages to have to be retransmitted and, thus, delayed.

Figure 3: The adversary can cause one unlucky home (process) to crash (i.e., stop responding to messages).

How does GST fit into this analogy? The arrival of GST would mean that the sky has cleared. Here, the analogy is imperfect, so we have to skew reality a bit. Normally, one can look out one’s window and see that the sky has cleared, but for our purposes, the residents in figures 2 and 3 have no windows! Recall that in the passage cited above, the protocol designer supplies the protocol before the adversary chooses GST. So, effectively, the adversary knows GST, but the processes that carry out the protocol do not.

Our next question is, how might we level the playing field? What limitations could we impose upon the adversary to give the protocol a chance at winning?

The adversary is already limited in some ways. For example, while it can delay infinitely many messages, it can delay any message for only a finite amount of time. What if this finiteness restriction were extended to the set of messages that the adversary can affect? That is, what if the adversary were limited to delaying only a finite number of messages for a finite amount of time?

This proposal is, in fact, another way of stating the GST assumption, and the DLS paper shows that an adversary that is limited in this way can be defeated. In other words, there is a protocol that is guaranteed to reach a consensus under the GST assumption. In the context of the electrical storm metaphor, if the sky eventually clears, a decision can be reached (Figure 4).

Figure 4: Intuitively, the DLS result says that if the sky eventually clears, a decision can be reached.

How does assuming GST circumvent the FLP result? Recall that the proof’s central lemma relies on a process’s inability to tell whether another process p has crashed, or whether messages to or from p are delayed. Under the GST assumption, this reasoning is valid before GST but not after. To be more precise, a process cannot distinguish the following two cases:

  • Process p has crashed.
  • Messages to or from p are delayed and GST has not yet arrived.

Recall that the processes cannot tell when GST has arrived, so the difference between having the GST assumption and not having it is quite subtle. Nonetheless, the DLS paper shows that this difference is enough to guarantee that consensus can be reached.

From this perspective, the GST assumption is quite elegant. It is a minor adjustment to a logical formula. It expands a concept that was already present within the model, namely finiteness.

Now, we’re not suggesting that one should seek mathematical elegance over truth. We’re simply noting that, when viewed in this way, the GST assumption is elegant. Moreover, taking the next section’s observations into account, this elegance is simply icing on the cake.

Recovering from delays

In thinking of a consensus protocol as playing a game against an adversary, what does it mean for a protocol to win?

Recall that the adversary can delay any message an arbitrary but finite amount of time up until GST. At GST, the adversary must walk away. Because the adversary chooses GST, it can leave the protocol in any configuration that could result from delays.

Next, recall that a correct protocol must eventually reach a decision; that is, some process must write a value to its output register. So if a protocol is proved correct under the GST assumption, it must reach a decision regardless of its configuration at GST. As explained above, the protocol’s configuration at GST could be any configuration that could result from delays.

So if a protocol can defeat the adversary, it can reach a decision from any configuration that could result from delays.

This point is worth emphasizing, as the existence of such protocols is rather remarkable. Imagine all of the ways in which messages could be delayed. Note that the timeliness of a message’s delivery could affect what messages are subsequently produced. In other words, a process might decide which message to send based on whether a timeout occurs before some other message is received. So, very likely, the number of configurations that could result from delays is enormous. Pick any one such configuration, put the protocol into that configuration, and let it run. If the protocol is proved correct under the GST assumption, it will eventually reach a decision.

Keep in mind that any configuration that could result from delays does not mean any configuration. For example, if p could never send m, the adversary cannot put the protocol into a configuration in which p has sent m. Still, the number of possible configurations could be huge, and being able to reach a decision from any of them is quite amazing.

Conclusion

While the GST assumption may seem contrived, it is simply an elegant adjustment to a logical formula. Furthermore, proving a protocol correct under the GST assumption has profound implications: no matter how delays are imposed upon the protocol, so long as they subside, the protocol can recover and reach a decision.

Of course, the GST assumption may not be applicable to all protocols, such as those that are expected to run in environments in which delays are unending and too frequent for the GST assumption to be meaningful. A protocol such as this may require some other assumption to demonstrate its correctness.

Furthermore, proving a protocol correct does not imply that its implementation is correct. For example, a developer may think they are assuming GST when, in reality, they are assuming something stronger. For example, a developer might make assumptions about when GST will arrive.

At Trail of Bits, we recommend a security audit for every consensus protocol implementation. If we could be of help to you in this regard, please reach out!

As a final note, while preparing this post, we came across a post by Ittai Abraham, “Flavours of Partial Synchrony,” which looks at the GST assumption from a slightly different perspective. We encourage you to read his post as well.

Acknowledgments

This research was conducted by Trail of Bits based upon work supported by DARPA under Contract No. HR001120C0084 (Distribution Statement A, Approved for Public Release: Distribution Unlimited). Any opinions, findings, conclusions, or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the United States Government or DARPA.

Announcing osquery 5: Now with EndpointSecurity on macOS

By Sharvil Shah, Senior Software Engineer
Originally published on October 6, 2021

TL;DR: Version 5.0.1 of osquery, a cross-platform, open-source endpoint visibility agent, is now available. This release is an exciting milestone for the project, as it introduces an EndpointSecurity-based process events table for macOS. Read on to learn how we integrated EndpointSecurity into osquery and how you can begin using it in your organization.

Apple’s New Rules for macOS Security

Over the years, Apple has been gradually taking pages from its iOS playbook to spruce up macOS security, beginning five years ago with the introduction of System Integrity Protection (SIP) to contain the root user in OS X 10.11 El Capitan. Since then, Apple has accelerated its efforts to improve macOS security by introducing stricter requirements for GateKeeper and the enforcement of code signing and of notarizing application binaries and packages.

Entitlements are another feature strengthening macOS security. Granted by Apple and baked in with a corresponding code signature, an entitlement allows an application or binary to use restricted APIs or frameworks. These new locked-down APIs replace the APIs that were formerly available only in kernel-mode “kernel extensions.” As a user-mode-only executable, following the same out-from-the-kernel OS integrity trends that many platforms are adopting, the osquery project was already well positioned to adopt these new APIs.

What is EndpointSecurity?

Apple has gradually deprecated kernel extensions with its recent releases of macOS. To replace kernel extensions, Apple developed the EndpointSecurity framework and API. When combined with the required entitlements, the EndpointSecurity framework enables user-mode processes to subscribe to events of interest from the macOS kernel in real time. EndpointSecurity replaces kauth, the kernel-mode authorization framework, and OpenBSM, the legacy framework used to grab the audit trail from the kernel.

Compared to OpenBSM, EndpointSecurity is more reliable, is more performant, and anecdotally captures more process events. For a more in-depth review of EndpointSecurity, check out our Sinter blog post, our team’s first demonstration of EndpointSecurity.

These security features are a great boon to end users. We were on a steep learning curve as we retrofitted osquery—which has always been deployed as a basic, standalone CLI executable—with new signing and packaging procedures, but we believe it was well worth the effort.

How to Use osquery with EndpointSecurity: A Mini Tutorial

With the 5.0.1 release of osquery, we have implemented the es_process_events table. Check the schema for this table before following along with the tutorial.

Following along in osqueryi

The simplest way to get started with osquery is by using osqueryi, the interactive osquery shell. Download the official macOS installer package from osquery.io and install it as you would any other application.

With the release of version 5.0.1, osquery is now installed as an app bundle in /opt/osquery/lib/osquery.app, and osqueryi is a symlink in /usr/local/bin.

Next up, grant your terminal emulator application—whether it be Terminal.app, iTerm2.app, or any other terminal emulator—Full Disk Access permissions in System Preferences. Full Disk Access is part of Apple’s Transparency Consent and Control (TCC) framework, another macOS security feature, and is required to enable EndpointSecurity. In the next section, we explain how to grant this permission automatically for Macs that are enrolled in a mobile device management (MDM) solution.

Finally, run osqueryi with root permissions and provide the –disable_events=false and –disable_endpointsecurity=falseflags to launch osquery interactively, with ephemeral events and the EndpointSecurity-based es_process_events table enabled.

Below is an example of osqueryi capturing recent process events that have occurred since the last time osqueryi was launched.

➜  ~ sudo osqueryi --disable_events=false --disable_endpointsecurity=false
Using a virtual database. Need help, type '.help'
osquery> .mode line
osquery> select * from es_process_events;
 
        version = 4
        seq_num = 178
 global_seq_num = 574
            pid = 8522
           path = /Applications/Xcode.app/Contents/Developer/usr/share/xcs/Nginx/sbin/nginx
         parent = 1
original_parent = 1
        cmdline = /Library/Developer/XcodeServer/CurrentXcodeSymlink/Contents/Developer/usr/share/xcs/Nginx/sbin/nginx -c /Library/Developer/XcodeServer/CurrentXcodeSymlink/Contents/Developer/usr/share/xcs/xcsnginx/xcsnginx.conf
  cmdline_count = 3
            env = XPC_SERVICE_NAME=com.apple.xcsnginx PATH=/usr/bin:/bin:/usr/sbin:/sbin XPC_FLAGS=1 LOGNAME=_xcsnginx USER=_xcsnginx HOME=/var/_xcsnginx SHELL=/bin/false TMPDIR=/var/folders/xl/xl5_qxqd1095w75dfmq92c4w0000f3/T/
      env_count = 8
            cwd = /Applications/Xcode.app/Contents/Developer/usr/share/xcs/Nginx
            uid = 451
           euid = 450
            gid = 450
           egid = 450
       username = _xcsnginx
     signing_id = com.apple.nginx
        team_id =
         cdhash = 7fde0ccc9dcdb7d994e82a880d684c5418368460
platform_binary = 1
      exit_code =
      child_pid =
           time = 1631617834
     event_type = exec
 
 
        version = 4
        seq_num = 193
 global_seq_num = 552
            pid = 8077
           path = /System/Library/Frameworks/CoreServices.framework/Versions/A/Frameworks/Metadata.framework/Versions/A/Support/mdworker_shared
         parent = 1
original_parent = 1
        cmdline =
  cmdline_count = 0
            env =
      env_count = 0
            cwd =
            uid = 501
           euid = 20
            gid = 20
           egid = 20
       username = sharvil
     signing_id = com.apple.mdworker_shared
        team_id =
         cdhash = 993abbb7ffcd0d3216808513f8212f4fa1fa07d7
platform_binary = 1
      exit_code = 9
      child_pid =
           time = 1631617820
     event_type = exit

Deploying a PPPC Profile for an MDM-Managed Mac

While osqueryi is a great tool for interactively introspecting, monitoring, and developing queries suitable to your environment, most deployments of osquery are in daemon mode with a configuration file.

For a Mac host that is enrolled in MDM, you can grant Full Disk Access permissions automatically and silently by pushing a Privacy Preferences Policy Control (PPPC) configuration profile. For this profile, you need both the systemPolicyAllFiles key, which grants Full Disk Access, and a CodeRequirement key.

Use the codesign tool to output CodeRequirement and copy everything in the output after
designated =>.

> codesign  -dr - /opt/osquery/lib/osquery.app/Contents/MacOS/osqueryd
Executable=/opt/osquery/lib/osquery.app/Contents/MacOS/osqueryd
designated => identifier "io.osquery.agent" and anchor apple generic and 
certificate 1[field.1.2.840.113635.100.6.2.6] /* exists */ and certificate 
leaf[field.1.2.840.113635.100.6.1.13] /* exists */ and certificate 
leaf[subject.OU] = "3522FA9PXF"

To complete systemPolicyAllFiles, the Identifier should be io.osquery.agent and IdentifierType should be bundleID.

Pulling it all together, below is an example of a complete PPPC profile granting Full Disk Access to osquery. Please note that you will need to update the PayloadOrganization and other relevant fields, shown here in bold.

You may need to consult the documentation of your MDM provider for more information on PPPC profiles.

Migrating from osquery 4.x to 5.x

With the release of version 5.0.1, osquery now installs on macOS as an app bundle in /opt/osquery/lib/osquery.app, and the new package identifier is io.osquery.agent. If you are upgrading osquery from version 4.9, you need to stop the osquery launchd service and restart it after version 5.0.1 is installed, since osquery itself doesn’t provide a mechanism to clean up artifacts, binaries, and configuration files for older versions. Also of note is that the package installer does not install a LaunchDaemon to start osqueryd. You may use the provided osqueryctl start script to copy the sample LaunchDaemon plist and associated configuration to start the osqueryd daemon.

Similar changes apply to Linux and Windows hosts. Please consult the installation guide on the osquery wiki to learn more.

A Stronger Foundation

We developed a working proof-of-concept of osquery with an EndpointSecurity integration rather quickly; in fact, we merged the feature into osquery months before the version 5.0.1 release. But to actually “switch on” the EndpointSecurity functionality, we had to conquer a mountain of technical debt:

And finally, we had to socialize all these breaking changes and navigate all the community’s feedback. Indeed, the number of changes we needed to make warranted a new major version of osquery, from version 4.9 to 5.0.1.

In June 2019, Facebook and the Linux Foundation formed the osquery Foundation, a new community entity intended to accelerate the development and open participation around the osquery project. A multi-stakeholder Technical Steering Committee (which includes Trail of Bits) has been updating and maintaining osquery ever since. Throughout the project’s development, one of the biggest technical obstacles to the project’s independence was the lack of an automated packaging and code-signing pipeline. Thanks to the community’s efforts this year to integrate EndpointSecurity into osquery, this pipeline is finally in place. Facebook’s original osquery developers can now fully hand over the keys (literally and figuratively) to the community.

The osquery project is undoubtedly only made possible by the continuing involvement and support of its open-source community. We especially want to thank our client sponsor, Atlassian, and our fellow contributors to the osquery community, past and present.

Future Directions

Now that we have integrated EndpointSecurity into osquery, the tool comes with a variety of new detection capabilities on macOS. It should now be much easier to add a file event monitor, kernel extension loading events, and even memory mapping events. The automated code-signing and packaging groundwork that we’ve laid for EndpointSecurity could pave the way for other permissioned/entitled macOS event monitoring frameworks, like NetworkExtension, to be integrated into osquery.

Trail of Bits has been at the forefront of osquery development for years. Where do you want us to take the project next? Drop us a line to chat about how we can help implement your idea!

All your tracing are belong to BPF

By Alessandro Gario, Senior Software Engineer
Originally published August 11, 2021

TL;DR: These simpler, step-by-step methods equip you to apply BPF tracing technology to real-word problems—no specialized tools or libraries required.

BPF, a tracing technology in the Linux kernel for network stack tracing, has become popular recently thanks to new extensions that enable novel use-cases outside of BPF’s original scope. Today it can be used to implement program performance analysis tools, system and program dynamic tracing utilities, and much more.

In this blog post we’ll show you how to use the Linux implementation of BPF to write tools that access system and program events. The excellent tools from IO Visor make it possible for users to easily harness BPF technology without the considerable time investment of writing specialized tools in native code languages.

What the BPF?

BPF itself is just a way to express a program, and a runtime interpreter for executing that program “safely.” It’s a set of specifications for virtual architecture, detailing how virtual machines dedicated to running its code should behave. The latest extensions to BPF have not only introduced new, really useful helper functions (such as reading a process’ memory), but also new registers and more stack space for the BPF bytecode.

Our main goal is to help you to take advantage of BPF and apply it to real-world problems without depending on external tools or libraries that may have been written with different goals and requirements in mind.

You can find the examples in this post in our repository. Please note that the code is simplified to focus on the concepts. This means that, where possible, we skip error checking and proper resource cleanup.

BPF program limitations

Even though we won’t be handwriting BPF assembly, it’s useful to know the code limitations since the in-kernel verifier will reject our instructions if we break its rules.

BPF programs are extremely simple, being made of only a single function. Instructions are sent to the kernel as an array of opcodes, meaning there’s no executable file format involved. Without sections, it’s not possible to have things like global variables or string literals; everything has to live on the stack, which can only hold up to 512 bytes. Branches are allowed, but it is only since kernel version 5.3 that jump opcodes can go backward—provided the verifier can prove the code will not execute forever.

The only other way to use loops without requiring recent kernel versions is to unroll them, but this will potentially use a lot of instructions, and older Linux versions will not load any program that exceeds the 4096 opcode count limit (see BPF_MAXINSNS under linux/bpf_common.h). Error handling in some cases is mandatory, and the verifier will prevent you from using resources that may fail initialization by rejecting the program.

These limitations are extremely important since these programs can get hooked on kernel code. When a verifier challenges the correctness of the code, it’s possible to prevent system crashes or slowdowns from loading malformed code.

External resources

To make BPF programs truly useful, they need ways to communicate with a user mode process and manage long-term data, i.e., via maps and perf event outputs.

Although many map types exist, they all essentially behave like key-value databases, and are commonly used to share data between user modes and/or other programs. Some of these types store data in per-CPU storage, making it easy to save and retrieve state when the same BPF program is run concurrently from different CPU cores.

Perf event outputs are generally used to send data to user mode programs and services, and are implemented as circular buffers.

Event sources

Without some data to process, our programs will just sit around doing nothing. BPF probes on Linux can be attached to several different event sources. For our purpose, we’re mainly interested in function tracing events.

Dynamic instrumentation

Similar to code hooking, BPF programs can be attached to any function. The probe type depends on where the target code lives. Kprobes are used when tracing kernel functions, while Uprobes are used when working with user mode libraries or binaries.

While Kprobe and Uprobe events are emitted when entering the monitored functions, Kretprobe and Uretprobe events are generated whenever the function returns. This works correctly even if the function being traced has multiple exit points. This kind of event does not forward typed syscall parameters and only comes with a pt_regs structure that contains the register values at the time of the call. Knowledge about the function prototype and system ABI is required to map back the function arguments to the right register.

Static instrumentation

It’s not always ideal to rely on function hooking when writing a tool, because the risk of breakage increases as the kernel or software gets updated. In most cases, it’s best to use a more stable event source such as a tracepoint.

There are two types of tracepoints:

  • One for user mode code (USDT, a.k.a. User-Level Statically Defined Tracepoints)
  • One for kernel mode code (interestingly, they are referred to as just “tracepoints”).

Both types of tracepoints are defined in the source code by the programmer, essentially defining a stable interface that shouldn’t change unless strictly necessary.

If DebugFS has been enabled and mounted, registered tracepoints will all appear under the /sys/kernel/debug/tracing folder. Similar to Kprobes and Kretprobes, each system call defined in the Linux kernel comes with two different tracepoints. The first one, sys_enter, is activated whenever a program in the system transitions to a syscall handler inside the kernel, and carries information about the parameters that have been received. The second (and last) one, sys_exit, only contains the exit code of the function and is invoked whenever the syscall function terminates.

BPF development prerequisites

Even though there’s no plan to use external libraries, we still have a few dependencies. The most important thing is to have access to a recent LLVM toolchain compiled with BPF support. If your system does not satisfy this requirement, it is possible—and actually encouraged—to make use of the osquery toolchain. You’ll also need CMake, as that’s what I use for the sample code.

When running inside the BPF environment, our programs make use of special helper functions that require a kernel version that’s at least above 4.18. While it’s possible to avoid using them, it would severely limit what we can do from our code.

Using Ubuntu 20.04 or equivalent is a good bet, as it comes with both a good kernel version and an up-to-date LLVM toolchain with BPF support.

Some LLVM knowledge is useful, but the code doesn’t require any advanced LLVM expertise. The Kaleidoscope language tutorial on the official site is a great introduction if needed.

Writing our first program

There are many new concepts to introduce, so we’ll start simple: our first example loads a program that returns without doing anything.

First, we create a new LLVM module and a function that contains our logic:

std::unique_ptr createBPFModule(llvm::LLVMContext &context) {
  auto module = std::make_unique("BPFModule", context);
  module->setTargetTriple("bpf-pc-linux");
  module->setDataLayout("e-m:e-p:64:64-i64:64-n32:64-S128");
  
  return module;
}
  
std::unique_ptr generateBPFModule(llvm::LLVMContext &context) {
  // Create the LLVM module for the BPF program
  auto module = createBPFModule(context);
  
  // BPF programs are made of a single function; we don't care about parameters
  // for the time being
  llvm::IRBuilder<> builder(context);
  auto function_type = llvm::FunctionType::get(builder.getInt64Ty(), {}, false);
  
  auto function = llvm::Function::Create(
      function_type, llvm::Function::ExternalLinkage, "main", module.get());  
      
  // Ask LLVM to put this function in its own section, so we can later find it
  // more easily after we have compiled it to BPF code
  function->setSection("bpf_main_section");
  
  // Create the entry basic block and assemble the printk code using the helper
  // we have written
  auto entry_bb = llvm::BasicBlock::Create(context, "entry", function);
  
  builder.SetInsertPoint(entry_bb);
  builder.CreateRet(builder.getInt64(0));
  
  return module;
}

Since we’re not going to handle event arguments, the function we created does not accept any parameters. Not much else is happening here except the return instruction. Remember, each BPF program has exactly one function, so it’s best to ask LLVM to store them in separate sections. This makes it easier to retrieve them once the module is compiled.

We can now JIT our module to BPF bytecode using the ExecutionEngine class from LLVM:

SectionMap compileModule(std::unique_ptr module) {
  // Create a new execution engine builder and configure it
  auto exec_engine_builder =
      std::make_unique(std::move(module));
  
  exec_engine_builder->setMArch("bpf");
  
  SectionMap section_map;
  exec_engine_builder->setMCJITMemoryManager(
      std::make_unique(section_map));
      
  // Create the execution engine and build the given module
  std::unique_ptr execution_engine(
      exec_engine_builder->create());
      
  execution_engine->setProcessAllSections(true);
  execution_engine->finalizeObject();
  
  return section_map;
}

Our custom SectionMemoryManager class mostly acts as a passthrough to the original SectionMemoryManager class from LLVM—it’s only there to keep track of the sections that the ExecutionEngine object creates when compiling our IR.

Once the code is built, we get back a vector of bytes for each function that was created inside the module:

int loadProgram(const std::vector &program) {
  // The program needs to be aware how it is going to be used. We are
  // only interested in tracepoints, so we'll hardcode this value
  union bpf_attr attr = {};
  attr.prog_type = BPF_PROG_TYPE_TRACEPOINT;
  attr.log_level = 1U;
  
  // This is the array of (struct bpf_insn) instructions we have received
  // from the ExecutionEngine (see the compileModule() function for more
  // information)
  auto instruction_buffer_ptr = program.data();
  std::memcpy(&attr.insns, &instruction_buffer_ptr, sizeof(attr.insns));
  
  attr.insn_cnt =
      static_cast(program.size() / sizeof(struct bpf_insn));
      
  // The license is important because we will not be able to call certain
  // helpers within the BPF VM if it is not compatible
  static const std::string kProgramLicense{"GPL"};
  
  auto license_ptr = kProgramLicense.c_str();
  std::memcpy(&attr.license, &license_ptr, sizeof(attr.license));
  
  // The verifier will provide a text disasm of our BPF program in here.
  // If there is anything wrong with our code, we'll also find some
  // diagnostic output
  std::vector log_buffer(4096, 0);
  attr.log_size = static_cast<__u32>(log_buffer.size());
  
  auto log_buffer_ptr = log_buffer.data();
  std::memcpy(&attr.log_buf, &log_buffer_ptr, sizeof(attr.log_buf));
  
  auto program_fd =
      static_cast(::syscall(__NR_bpf, BPF_PROG_LOAD, &attr, sizeof(attr)));
      
  if (program_fd < 0) {
    std::cerr << "Failed to load the program: " << log_buffer.data() << "\n";
  }
  
  return program_fd;
}

Loading the program is not hard, but as you may have noticed, there is no helper function defined for the bpf() system call we’re using. The tracepoint is the easiest event type to set up, and it’s what we’re using for the time being.

Once the BPF_PROG_LOAD command is issued, the in-kernel verifier will validate our program and also provide a disassembly of it inside the log buffer we’ve provided. The operation will fail if kernel output is longer than the bytes available, so only provide a log buffer in production code if the load has already failed.

Another important field in the attr union is the program license; specifying any value other than GPL may disable some of the features that are exposed to BPF. I’m not a licensing expert, but it should be possible to use different licenses for the generator and the generated code (but please speak to a lawyer and/or your employer first!).

We can now assemble the main() function using the helpers we built:

int main() {
  initializeLLVM();
  
  // Generate our BPF program
  llvm::LLVMContext context;
  auto module = generateBPFModule(context);
  
  // JIT the module to BPF code using the execution engine
  auto section_map = compileModule(std::move(module));
  if (section_map.size() != 1U) {
    std::cerr << "Unexpected section count\n";
    return 1;
  }
  
  // We have previously asked LLVM to create our function inside a specific
  // section; get our code back from it and load it
  const auto &main_program = section_map.at("bpf_main_section");
  auto program_fd = loadProgram(main_program);
  if (program_fd < 0) {
    return 1;
  }
  
  releaseLLVM();
  return 0;
}

If everything works correctly, no error is printed when the binary is run as the root user. You can find the source code for the empty program in the 00-empty folder of the companion code repository.

But…this program isn’t very exciting, since it doesn’t do anything! Now we’ll update it so we can execute it when a certain system event happens.

Creating our first useful program

In order to actually execute our BPF programs, we have to attach them to an event source.

Creating a new tracepoint event is easy; it only involves reading and writing some files from under the debugfs folder:

int createTracepointEvent(const std::string &event_name) {
  const std::string kBaseEventPath = "/sys/kernel/debug/tracing/events/";
  
  // This special file contains the id of the tracepoint, which is
  // required to initialize the event with perf_event_open  
  std::string event_id_path = kBaseEventPath + event_name + "/id";
  
  // Read the tracepoint id and convert it to an integer
  auto event_file = std::fstream(event_id_path, std::ios::in);
  if (!event_file) {
    return -1;
  }
  
  std::stringstream buffer;
  buffer << event_file.rdbuf();
  
  auto str_event_id = buffer.str();
  auto event_identifier = static_cast(
      std::strtol(str_event_id.c_str(), nullptr, 10));
      
  // Create the event
  struct perf_event_attr perf_attr = {};
  perf_attr.type = PERF_TYPE_TRACEPOINT;
  perf_attr.size = sizeof(struct perf_event_attr);
  perf_attr.config = event_identifier;
  perf_attr.sample_period = 1;
  perf_attr.sample_type = PERF_SAMPLE_RAW;
  perf_attr.wakeup_events = 1;
  perf_attr.disabled = 1;
  
  int process_id{-1};
  int cpu_index{0};
  
  auto event_fd =
      static_cast(::syscall(__NR_perf_event_open, &perf_attr, process_id,
                                 cpu_index, -1, PERF_FLAG_FD_CLOEXEC));  
  
  return event_fd;
}

To create the event file descriptor, we have to find the tracepoint identifier, which is in a special file called (unsurprisingly) “id.”

For our last step, we attach the program to the tracepoint event we just created. This is trivial and can be done with a couple of ioctl calls on the event’s file descriptor:

bool attachProgramToEvent(int event_fd, int program_fd) {
  if (ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, program_fd) < 0) {
    return false;
  }

  if (ioctl(event_fd, PERF_EVENT_IOC_ENABLE, 0) < 0) {
    return false;
  }

  return true;
}

Our program should finally succeed in running our BPF code, but no output is generated yet since our module only really contained a return opcode. The easiest way to generate some output is to use the bpf_trace_printk helper to print a fixed string:

void generatePrintk(llvm::IRBuilder<> &builder) {
  // The bpf_trace_printk() function prototype can be found inside
  // the /usr/include/linux/bpf.h header file
  std::vector argument_type_list = {builder.getInt8PtrTy(),
                                                  builder.getInt32Ty()};

  auto function_type =
      llvm::FunctionType::get(builder.getInt64Ty(), argument_type_list, true);

  auto function =
      builder.CreateIntToPtr(builder.getInt64(BPF_FUNC_trace_printk),
                             llvm::PointerType::getUnqual(function_type));

  // Allocate 8 bytes on the stack
  auto buffer = builder.CreateAlloca(builder.getInt64Ty());

  // Copy the string characters to the 64-bit integer
  static const std::string kMessage{"Hello!!"};

  std::uint64_t message{0U};
  std::memcpy(&message, kMessage.c_str(), sizeof(message));

  // Store the characters inside the buffer we allocated on the stack
  builder.CreateStore(builder.getInt64(message), buffer);

  // Print the characters
  auto buffer_ptr = builder.CreateBitCast(buffer, builder.getInt8PtrTy());

#if LLVM_VERSION_MAJOR < 11
  auto function_callee = function;
#else
  auto function_callee = llvm::FunctionCallee(function_type, function);
#endif

  builder.CreateCall(function_callee, {buffer_ptr, builder.getInt32(8U)});
}

Importing new helper functions from BPF is quite easy. The first thing we need is the prototype, which can be taken from the linux/bpf.h include header. The one relative to printk reads as follows:

 * int bpf_trace_printk(const char *fmt, u32 fmt_size, ...)
 * 	Description
 * 		This helper is a "printk()-like" facility for debugging. It
 * 		prints a message defined by format *fmt* (of size *fmt_size*)
 * 		to file *\/sys/kernel/debug/tracing/trace* from DebugFS, if
 * 		available. It can take up to three additional **u64**
 * 		arguments (as an eBPF helpers, the total number of arguments is
 * 		limited to five).

Once the function type matches, we only have to assemble a call that uses the helper function ID as the destination address: BPF_FUNC_trace_printk. The generatePrintk function can now be added to our program right before we create the return instruction inside generateBPFModule.

The full source code for this program can be found in the 01-hello_open folder.

Running the program again will show the “Hello!!” string inside the /sys/kernel/debug/tracing/trace_pipe file every time the tracepoint event is emitted. Using text output can be useful, but due to the BPF VM limitations the printf helper is not as useful as can be in a standard C program.

In the next section, we’ll take a look at maps and how to use them as data storage for our programs.

Profiling system calls

Using maps to store data

Maps are a major component in most programs, and can be used in a number of different ways. Since they’re accessible from both kernel and user mode, they can be useful in storing data for later processing either from additional probes or user programs. Given the limitations that BPF imposes, they’re also commonly used to provide scratch space for handling temporary data that does not fit on the stack.

There are many map types; some are specialized for certain uses, such as storing stack traces. Others are more generic, and suitable for use as custom data containers.

Concurrency and thread safety are not just user mode problems, and BPF comes with two really useful special map types that have dedicated storage for storing values in CPU scope. These maps are commonly used to replace the stack, as a per-CPU map can be easily referenced by programs without having to worry about synchronization.

It’s rather simple to create and use maps since they all share the same interface, regardless of type. The following table, taken from the BPF header file comments, documents the most common operations:

  • BPF_MAP_CREATE: Create a map and return a file descriptor that refers to the map. The close-on-exec file descriptor flag (see fcntl(2)) is automatically enabled for the new file descriptor.
  • BPF_MAP_LOOKUP_ELEM: Look up an element by key in a specified map and return its value.
  • BPF_MAP_UPDATE_ELEM: Create or update an element (key/value pair) in a specified map.
  • BPF_MAP_DELETE_ELEM: Look up and delete an element by key in a specified map.

    The only important thing to remember is that when operating on per-CPU maps the value is not just a single entry, but an array of values that has as many items as CPU cores.

    Creating a map

    Before we can create our map, we have to determine which type we want to use. The following enum declaration has been taken from the linux/bpf.h header file:

    enum bpf_map_type {
      BPF_MAP_TYPE_UNSPEC,  /* Reserve 0 as invalid map type */
      BPF_MAP_TYPE_HASH,
      BPF_MAP_TYPE_ARRAY,
      BPF_MAP_TYPE_PROG_ARRAY,
      BPF_MAP_TYPE_PERF_EVENT_ARRAY,
      BPF_MAP_TYPE_PERCPU_HASH,
      BPF_MAP_TYPE_PERCPU_ARRAY,
      BPF_MAP_TYPE_STACK_TRACE,
      BPF_MAP_TYPE_CGROUP_ARRAY,
      BPF_MAP_TYPE_LRU_HASH,
      BPF_MAP_TYPE_LRU_PERCPU_HASH,
      BPF_MAP_TYPE_LPM_TRIE,
      BPF_MAP_TYPE_ARRAY_OF_MAPS,
      BPF_MAP_TYPE_HASH_OF_MAPS,
      BPF_MAP_TYPE_DEVMAP,
      BPF_MAP_TYPE_SOCKMAP,
      BPF_MAP_TYPE_CPUMAP,
    };
    

    Most of the time we’ll use hash maps and arrays. We have to create a bpf_attr union, initializing key and value size as well as the maximum amount of entries it can hold.

    int createMap(bpf_map_type type, std::uint32_t key_size,
                  std::uint32_t value_size, std::uint32_t key_count) {
    
      union bpf_attr attr = {};
    
      attr.map_type = type;
      attr.key_size = key_size;
      attr.value_size = value_size;
      attr.max_entries = key_count;
    
      return static_cast(
          syscall(__NR_bpf, BPF_MAP_CREATE, &attr, sizeof(attr)));
    }
    

    Not every available operation always makes sense for all map types. For example, it’s not possible to delete entries when working with an array. Lookup operations are also going to behave differently, as they will only fail when the specified index is beyond the last element.

    Here’s the code to read a value from a map:

    // Error codes for map operations; depending on the map type, reads may
    // return NotFound if the specified key is not present
    enum class ReadMapError { Succeeded, NotFound, Failed };
    
    // Attempts to read a key from the specified map. Values in per-CPU maps
    // actually have multiple entries (one per CPU)
    ReadMapError readMapKey(std::vector &value, int map_fd,
                            const void *key) {
    
      union bpf_attr attr = {};
    
      // Use memcpy to avoid string aliasing issues
      attr.map_fd = static_cast<__u32>(map_fd);
      std::memcpy(&attr.key, &key, sizeof(attr.key));
    
      auto value_ptr = value.data();
      std::memcpy(&attr.value, &value_ptr, sizeof(attr.value));
    
      auto err =
          ::syscall(__NR_bpf, BPF_MAP_LOOKUP_ELEM, &attr, sizeof(union bpf_attr));
    
      if (err >= 0) {
        return ReadMapError::Succeeded;
      }
    
      if (errno == ENOENT) {
        return ReadMapError::NotFound;
      } else {
        return ReadMapError::Failed;
      }
    }
    

    Writing a BPF program to count syscall invocations

    In this example we’ll build a probe that counts how many times the tracepoint we’re tracing gets called. We’ll create a counter for each processor core, using a per-CPU array map that only contains a single item.

    auto map_fd = createMap(BPF_MAP_TYPE_PERCPU_ARRAY, 4U, 8U, 1U);
    if (map_fd < 0) {
      return 1;
    }
    

    Referencing this map from the BPF code is not too hard but requires some additional operations:

    1. Convert the map file descriptor to a map address
    2. Use the bpf_map_lookup_elem helper function to retrieve the pointer to the desired map entry
    3. Check the returned pointer to make sure the operation has succeeded (the validator will reject our program otherwise)
    4. Update the counter value

    The map address can be obtained through a special LLVM intrinsic called “pseudo.”

    // Returns the pseudo intrinsic, useful to convert file descriptors (like maps
    // and perf event outputs) to map addresses so they can be used from the BPF VM
    llvm::Function *getPseudoFunction(llvm::IRBuilder<> &builder) {
      auto &insert_block = *builder.GetInsertBlock();
      auto &module = *insert_block.getModule();
    
      auto pseudo_function = module.getFunction("llvm.bpf.pseudo");
    
      if (pseudo_function == nullptr) {
        // clang-format off
        auto pseudo_function_type = llvm::FunctionType::get(
          builder.getInt64Ty(),
    
          {
            builder.getInt64Ty(),
            builder.getInt64Ty()
          },
    
          false
        );
        // clang-format on
    
        pseudo_function = llvm::Function::Create(pseudo_function_type,
                                                 llvm::GlobalValue::ExternalLinkage,
                                                 "llvm.bpf.pseudo", module);
      }
    
      return pseudo_function;
    }
    
    // Converts the given (map or perf event output) file descriptor to a map
    // address
    llvm::Value *mapAddressFromFileDescriptor(int fd, llvm::IRBuilder<> &builder) {
      auto pseudo_function = getPseudoFunction(builder);
    
      // clang-format off
      auto map_integer_address_value = builder.CreateCall(
        pseudo_function,
    
        {
          builder.getInt64(BPF_PSEUDO_MAP_FD),
          builder.getInt64(static_cast(fd))
        }
      );
      // clang-format on
    
      return builder.CreateIntToPtr(map_integer_address_value,
                                    builder.getInt8PtrTy());
    }
    

    Importing the bpf_map_lookup_elem helper function follows the same procedure we used to import the bpf_trace_printk one. Looking at the linux/bpf.h, the prototype reads:

     * void *bpf_map_lookup_elem(struct bpf_map *map, const void *key)
     * 	Description
     * 		Perform a lookup in *map* for an entry associated to *key*.
     * 	Return
     * 		Map value associated to *key*, or **NULL** if no entry was
     * 		found.
    

    Notice how the key parameter is passed by pointer and not by value. We’ll have to allocate the actual key on the stack using CreateAlloca. Since allocations should always happen in the first (entry) basic block, our function will accept a pre-filled buffer as key. The return type is a void pointer, but we can save work if we directly declare the function with the correct value type.

    // Attempts to retrieve a pointer to the specified key inside the map_fd map
    llvm::Value *bpfMapLookupElem(llvm::IRBuilder<> &builder, llvm::Value *key,
                                  llvm::Type *value_type, int map_fd) {
    
      std::vector argument_type_list = {builder.getInt8PtrTy(),
                                                      builder.getInt32Ty()};
    
      auto function_type = llvm::FunctionType::get(value_type->getPointerTo(),
                                                   argument_type_list, false);
    
      auto function =
          builder.CreateIntToPtr(builder.getInt64(BPF_FUNC_map_lookup_elem),
                                 llvm::PointerType::getUnqual(function_type));
    
      auto map_address = mapAddressFromFileDescriptor(map_fd, builder);
    
    #if LLVM_VERSION_MAJOR < 11
      auto function_callee = function;
    #else
      auto function_callee = llvm::FunctionCallee(function_type, function);
    #endif
    
      return builder.CreateCall(function_callee, {map_address, key});
    }
    

    Back to the BPF program generator, we can now call the new bpfMapLookupElem to retrieve the first value in our array map:

    auto map_key_buffer = builder.CreateAlloca(builder.getInt32Ty());
    builder.CreateStore(builder.getInt32(0U), map_key_buffer);
    
    auto counter_ptr =
        bpfMapLookupElem(builder, map_key_buffer, builder.getInt32Ty(), map_fd);
    

    Since we are using a per-CPU array map, the pointer that returns from this function references a private array entry for the core we’re running on. Before we can use it, however, we have to test whether the function has succeeded; otherwise, the verifier will reject the program. This is trivial and can be done with a comparison instruction and a new basic block.

    auto null_ptr = llvm::Constant::getNullValue(counter_ptr->getType());
    auto cond = builder.CreateICmpEQ(null_ptr, counter_ptr);
    
    auto error_bb = llvm::BasicBlock::Create(context, "error", function);
    auto continue_bb = llvm::BasicBlock::Create(context, "continue", function);
    
    builder.CreateCondBr(cond, error_bb, continue_bb);
    
    builder.SetInsertPoint(error_bb);
    builder.CreateRet(builder.getInt64(0));
    
    builder.SetInsertPoint(continue_bb);
    

    The pointer to the counter value can now be dereferenced without causing a validation error from the verifier.

    auto counter = builder.CreateLoad(counter_ptr);
    auto new_value = builder.CreateAdd(counter, builder.getInt32(1));
    
    builder.CreateStore(new_value, counter_ptr);
    builder.CreateRet(builder.getInt64(0));
    

    There is no need to import and use the bpf_map_update_elem() helper function since we can directly increment the value from the pointer we received. We only have to load the value from the pointer, increment it, and then store it back where it was.

    Once we have finished with our tracer, we can retrieve the counters and inspect them:

    auto processor_count = getProcessorCount();
    std::vector value(processor_count * sizeof(std::uint64_t));
    
    std::uint32_t key{0U};
    auto map_error = readMapKey(value, map_fd, &key);
    
    if (map_error != ReadMapError::Succeeded) {
      std::cerr << "Failed to read from the map\n";
      return 1;
    }
    
    std::vector per_cpu_counters(processor_count);
    std::memcpy(per_cpu_counters.data(), value.data(), value.size());
    

    When dealing with per-CPU maps, it is important to not rely on get_nprocs_conf and use /sys/devices/system/cpu/possible instead. On VMware Fusion for example, the vcpu.hotadd setting will cause Linux to report 128 possible CPUs when enabled, regardless of how many cores have been actually assigned to the virtual machine.

    The full sample code can be found in the 02-syscall_counter folder.

    One interesting experiment is to attach this program to the system call tracepoint used by the chmod command line tool to update file modes. The strace debugging utility can help determine which syscall is being used. In this case we are going to be monitoring the following tracepoint: syscalls/sys_enter_fchmodat.

    The taskset command can be altered to force the fchmodat syscall to be called from a specific processor:

    taskset 1 chmod /path/to/file # CPU 1
    taskset 2 chmod /path/to/file # CPU 2
    

    Using perf event outputs

    Maps can be a really powerful way to store data for later processing, but it’s impossible for user mode programs to know when and where new data is available for reading.

    Perf event outputs can help solve this problem, since they enable the program to be notified whenever new data is available. Additionally, since they behave like a circular buffer, we do not have the same size limitations we have when setting map values.

    In this section, we’ll build an application that can measure how much time it takes to handle a system call. To make this work, we’ll attach a program to both the entry and exit points of a tracepoint to gather timestamps.

    Initialization

    Before we start creating our perf output, we have to create a structure to hold our resources. In total, we’ll have a file descriptor for the map and then a perf output per processor, along with its own memory mapping.

    struct PerfEventArray final {
      int fd;
    
      std::vector output_fd_list;
      std::vector mapped_memory_pointers;
    };
    

    To initialize it, we have to create a BPF map of the type PERF_EVENT_ARRAY first. This special data structure maps a specific CPU index to a private perf event output specified as a file descriptor. For it to function properly, we must use the following parameters when creating the map:

    1. Key size must be set to 4 bytes (CPU index).
    2. Value size must be set to 4 bytes (size of a file descriptor specified with an int).
    3. Entry count must be set to a value greater than or equal to the number of processors.
    auto processor_count = getProcessorCount();
    
    // Create the perf event array map
    obj.fd = createMap(BPF_MAP_TYPE_PERF_EVENT_ARRAY, 4U, 4U, processor_count);
    if (obj.fd < 0) {
      return false;
    }
    

    When we looked at maps in the previous sections, we only focused on reading. For the next steps we also need to write new values, so let’s take a look at how to set keys.

    ReadMapError setMapKey(std::vector &value, int map_fd,
                           const void *key) {
    
      union bpf_attr attr = {};
      attr.flags = BPF_ANY; // Always set the value
      attr.map_fd = static_cast<__u32>(map_fd);
    
      // Use memcpy to avoid string aliasing issues
      std::memcpy(&attr.key, &key, sizeof(attr.key));
    
      auto value_ptr = value.data();
      std::memcpy(&attr.value, &value_ptr, sizeof(attr.value));
    
      auto err = ::syscall(__NR_bpf, BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr));
    
      if (err < 0) {
        return ReadMapError::Failed;
      }
    
      return ReadMapError::Succeeded;
    }
    

    This is not too different from how we read map values, but this time we don’t have to deal with the chance that the key may not be present. As always when dealing with per-CPU maps, the data pointer should be considered as an array containing one value per CPU.

    The next step is to create a perf event output for each online processor with the perf_event_open system call, using the special PERF_COUNT_SW_BPF_OUTPUT config value.

    struct perf_event_attr attr {};
    attr.type = PERF_TYPE_SOFTWARE;
    attr.size = sizeof(attr);
    attr.config = PERF_COUNT_SW_BPF_OUTPUT;
    attr.sample_period = 1;
    attr.sample_type = PERF_SAMPLE_RAW;
    attr.wakeup_events = 1;
    
    std::uint32_t processor_index;
    for (processor_index = 0U; processor_index < processor_count;
         ++processor_index) {
    
      // clang-format off
      auto perf_event_fd = ::syscall(
        __NR_perf_event_open,
        &attr,
        -1,               // Process ID (unused)
        processor_index,  // 0 -> getProcessorCount()
        -1,               // Group ID (unused)
        0                 // Flags (unused)
      );
      // clang-format on
    
      if (perf_event_fd == -1) {
        return false;
      }
    
      obj.output_fd_list.push_back(static_cast(perf_event_fd));
    }
    

    Now that we have the file descriptors, we can populate the perf event array map we created:

    // Set the perf event output file descriptors inside the map
    processor_index = 0U;
    
    for (auto perf_event_fd : obj.output_fd_list) {
      std::vector value(4);
      std::memcpy(value.data(), &perf_event_fd, sizeof(perf_event_fd));
    
      auto err = setMapKey(value, obj.fd, &processor_index);
      if (err != ReadMapError::Succeeded) {
        return false;
      }
    
      ++processor_index;
    }
    

    Finally, we create a memory mapping for each perf output:

    // Create a memory mapping for each output
    auto size = static_cast(1 + std::pow(2, page_count));
    size *= static_cast(getpagesize());
    
    for (auto &perf_event_fd : obj.output_fd_list) {
      auto ptr = mmap(nullptr,                // Desired base address (unused)
                      size,                   // Mapped memory size
                      PROT_READ | PROT_WRITE, // Memory protection
                      MAP_SHARED,             // Flags
                      perf_event_fd,          // The perf output handle
                      0                       // Offset (unused)
      );
    
      if (ptr == MAP_FAILED) {
        return false;
      }
    
      obj.mapped_memory_pointers.push_back(ptr);
    }
    

    This is the memory we’ll read from when capturing the BPF program output.

    Writing a BPF program to profile system calls

    Now that we have a file descriptor of the perf event array map, we can use it from within the BPF code to send data with the bpf_perf_event_output helper function. Here’s the prototype from linux/bpf.h:

     * int bpf_perf_event_output(struct pt_reg *ctx, struct bpf_map *map, u64 flags, void *data, u64 size)
     * 	Description
     * 		Write raw *data* blob into a special BPF perf event held by
     * 		*map* of type **BPF_MAP_TYPE_PERF_EVENT_ARRAY**. This perf
     * 		event must have the following attributes: **PERF_SAMPLE_RAW**
     * 		as **sample_type**, **PERF_TYPE_SOFTWARE** as **type**, and
     * 		**PERF_COUNT_SW_BPF_OUTPUT** as **config**.
     *
     * 		The *flags* are used to indicate the index in *map* for which
     * 		the value must be put, masked with **BPF_F_INDEX_MASK**.
     * 		Alternatively, *flags* can be set to **BPF_F_CURRENT_CPU**
     * 		to indicate that the index of the current CPU core should be
     * 		used.
     *
     * 		The value to write, of *size*, is passed through eBPF stack and
     * 		pointed by *data*.
     *
     * 		The context of the program *ctx* needs also be passed to the
     * 		helper.
    

    The ctx parameter must be always set to the value of the first argument received in the entry point function of the BPF program.

    The map address is obtained with the LLVM pseudo intrinsic that we imported in the previous section. Data and size are self-explanatory, but it is important to remember that the memory pointer must reside inside the BPF program (i.e., we can’t pass a user pointer).

    The last parameter, flags, can be used as a CPU index mask to select the perf event output this data should be sent to. A special value can be passed to ask the BPF VM to automatically use the index of the processor we’re running on.

    // Sends the specified buffer to the map_fd perf event output
    llvm::Value *bpfPerfEventOutput(llvm::IRBuilder<> &builder, llvm::Value *ctx,
                                    int map_fd, std::uint64_t flags,
                                    llvm::Value *data, llvm::Value *size) {
    
      // clang-format off
      std::vector argument_type_list = {
        // Context
        ctx->getType(),
    
        // Map address
        builder.getInt8PtrTy(),
    
        // Flags
        builder.getInt64Ty(),
    
        // Data pointer
        data->getType(),
    
        // Size
        builder.getInt64Ty()
      };
      // clang-format on
    
      auto function_type =
          llvm::FunctionType::get(builder.getInt32Ty(), argument_type_list, false);
    
      auto function =
          builder.CreateIntToPtr(builder.getInt64(BPF_FUNC_perf_event_output),
                                 llvm::PointerType::getUnqual(function_type));
    
      auto map_address = mapAddressFromFileDescriptor(map_fd, builder);
    
    #if LLVM_VERSION_MAJOR < 11
      auto function_callee = function;
    #else
      auto function_callee = llvm::FunctionCallee(function_type, function);
    #endif
    
      return builder.CreateCall(
          function_callee, {ctx, map_address, builder.getInt64(flags), data, size});
    }
    

    The file descriptor and flags parameters are most likely known at compile time, so we can make the function a little more user friendly by accepting integer types. The buffer size, however, is often determined at runtime, so it’s best to use an llvm::Value pointer.

    While it’s possible to just send the raw timestamps whenever we enter and leave the system call of our choice, it’s much easier and more efficient to compute what we need directly inside the BPF code. To do this we’ll use a per-CPU hash map shared across two different BPF programs: one for the sys_enter event, and another one for the sys_exit.

    From the enter program, we’ll save the system timestamp in the map. When the exit program is invoked, we’ll retrieve it and use it to determine how much time it took. The resulting value is then sent to the user mode program using the perf output.

    Creating the map is easy, and we can re-use the map helpers we wrote in the previous sections. Both the timestamp and the map key are 64-bit values, so we’ll use 8 bytes for both:

    auto map_fd = createMap(BPF_MAP_TYPE_HASH, 8U, 8U, 100U);
    if (map_fd < 0) {
      std::cerr << "Failed to create the map\n";
      return 1;
    }
    

    Writing the enter program

    We will need to generate a key for our map. A combination of the process ID and thread ID is a good candidate for this:

     * u64 bpf_get_current_pid_tgid(void)
     * 	Return
     * 		A 64-bit integer containing the current tgid and pid, and
     * 		created as such:
     * 		*current_task*\ **->tgid << 32 \|**
     * 		*current_task*\ **->pid**.
    

    Then the system timestamp needs to be acquired. Even though the ktime_get_ns helper function counts the time from the boot, it’s still a good alternative since we only have to use it to calculate the execution time.

     * u64 bpf_ktime_get_ns(void)
     *  Description
     *    Return the time elapsed since system boot, in nanoseconds.
     *  Return
     *    Current *ktime*.
    

    By now you should be well versed in importing them, so here are the two definitions:

    // Returns a 64-bit integer that contains both the process and thread id
    llvm::Value *bpfGetCurrentPidTgid(llvm::IRBuilder<> &builder) {
      auto function_type = llvm::FunctionType::get(builder.getInt64Ty(), {}, false);
    
      auto function =
          builder.CreateIntToPtr(builder.getInt64(BPF_FUNC_get_current_pid_tgid),
                                 llvm::PointerType::getUnqual(function_type));
    
    #if LLVM_VERSION_MAJOR < 11
      auto function_callee = function;
    #else
      auto function_callee = llvm::FunctionCallee(function_type, function);
    #endif
    
      return builder.CreateCall(function_callee, {});
    }
    
    // Returns the amount of nanoseconds elapsed from system boot
    llvm::Value *bpfKtimeGetNs(llvm::IRBuilder<> &builder) {
      auto function_type = llvm::FunctionType::get(builder.getInt64Ty(), {}, false);
    
      auto function =
          builder.CreateIntToPtr(builder.getInt64(BPF_FUNC_ktime_get_ns),
                                 llvm::PointerType::getUnqual(function_type));
    
    #if LLVM_VERSION_MAJOR < 11
      auto function_callee = function;
    #else
      auto function_callee = llvm::FunctionCallee(function_type, function);
    #endif
    
      return builder.CreateCall(function_callee, {});
    }
    

    We can now use the newly defined functions to generate a map key and acquire the system timestamp:

    // Map keys and values are passed by pointer; create two buffers on the
    // stack and initialize them
    auto map_key_buffer = builder.CreateAlloca(builder.getInt64Ty());
    auto timestamp_buffer = builder.CreateAlloca(builder.getInt64Ty());
    
    auto current_pid_tgid = bpfGetCurrentPidTgid(builder);
    builder.CreateStore(current_pid_tgid, map_key_buffer);
    
    auto timestamp = bpfKtimeGetNs(builder);
    builder.CreateStore(timestamp, timestamp_buffer);
    

    For this program we have replaced the array map we used in the previous sections with a hash map. It’s no longer possible to use the bpf_map_lookup_elem() helper since the map key we have will fail with ENOENT if the element does not exist.

    To fix this, we have to import a new helper named bpf_map_update_elem():

    * int bpf_map_update_elem(struct bpf_map *map, const void *key, const void *value, u64 flags)
    * 	Description
    * 		Add or update the value of the entry associated to *key* in
    * 		*map* with *value*. *flags* is one of:
    *
    * 		**BPF_NOEXIST**
    * 			The entry for *key* must not exist in the map.
    * 		**BPF_EXIST**
    * 			The entry for *key* must already exist in the map.
    * 		**BPF_ANY**
    * 			No condition on the existence of the entry for *key*.
    *
    * 		Flag value **BPF_NOEXIST** cannot be used for maps of types
    * 		**BPF_MAP_TYPE_ARRAY** or **BPF_MAP_TYPE_PERCPU_ARRAY**  (all
    * 		elements always exist), the helper would return an error.
    * 	Return
    * 		0 on success, or a negative error in case of failure.
    

    We’ll keep the map file descriptor and flag values as integers, since we know their values before the module is compiled.

    // Updates the value of the specified key inside the map_fd BPF map
    llvm::Value *bpfMapUpdateElem(llvm::IRBuilder<> &builder, int map_fd,
                                  llvm::Value *key, llvm::Value *value,
                                  std::uint64_t flags) {
    
      // clang-format off
      std::vector argument_type_list = {
        // Map address
        builder.getInt8PtrTy(),
    
        // Key
        key->getType(),
    
        // Value
        value->getType(),
    
        // Flags
        builder.getInt64Ty()
      };
      // clang-format on
    
      auto function_type =
          llvm::FunctionType::get(builder.getInt64Ty(), argument_type_list, false);
    
      auto function =
          builder.CreateIntToPtr(builder.getInt64(BPF_FUNC_map_update_elem),
                                 llvm::PointerType::getUnqual(function_type));
    
      auto map_address = mapAddressFromFileDescriptor(map_fd, builder);
    
    #if LLVM_VERSION_MAJOR < 11
      auto function_callee = function;
    #else
      auto function_callee = llvm::FunctionCallee(function_type, function);
    #endif
    
      return builder.CreateCall(function_callee,
                                {map_address, key, value, builder.getInt64(flags)});
    }
    

    We can now store the timestamp inside the map and close the enter program:

    // Save the timestamp inside the map
    bpfMapUpdateElem(builder, map_fd, map_key_buffer, timestamp_buffer, BPF_ANY);
    builder.CreateRet(builder.getInt64(0));
    

    ‍Writing the exit program

    In this program, we’ll retrieve the timestamp we stored and use it to measure how much time we’ve spent inside the system call. Once we have the result, we’ll send it to user mode using the perf output.

    When creating the llvm::Function for this program, we must define at least one argument. This value will be required later for the ctx parameter that we have to pass to the bpf_perf_event_output() helper.

    First, we have to acquire the map entry; as always, we must check for any possible error or the verifier will not let us load our program.

    // Create the entry basic block
    auto entry_bb = llvm::BasicBlock::Create(context, "entry", function);
    builder.SetInsertPoint(entry_bb);
    
    // Map keys are passed by pointer; create a buffer on the stack and initialize
    // it
    auto map_key_buffer = builder.CreateAlloca(builder.getInt64Ty());
    auto current_pid_tgid = bpfGetCurrentPidTgid(builder);
    builder.CreateStore(current_pid_tgid, map_key_buffer);
    
    // Check the pointer and make sure the lookup has succeeded; this is
    // mandatory, or the BPF verifier will refuse to load our program
    auto timestamp_ptr =
        bpfMapLookupElem(builder, map_key_buffer, builder.getInt64Ty(), map_fd);
    
    auto null_ptr = llvm::Constant::getNullValue(timestamp_ptr->getType());
    auto cond = builder.CreateICmpEQ(null_ptr, timestamp_ptr);
    
    auto error_bb = llvm::BasicBlock::Create(context, "error", function);
    auto continue_bb = llvm::BasicBlock::Create(context, "continue", function);
    
    builder.CreateCondBr(cond, error_bb, continue_bb);
    
    // Terminate the program if the pointer is not valid
    builder.SetInsertPoint(error_bb);
    builder.CreateRet(builder.getInt64(0));
    
    // In this new basic block, the pointer is valid
    builder.SetInsertPoint(continue_bb);
    

    Next, we want to read our previous timestamp and subtract it from the current time:

    // Read back the old timestamp and obtain the current one
    auto enter_timestamp = builder.CreateLoad(timestamp_ptr);
    auto exit_timestamp = bpfKtimeGetNs(builder);
    
    // Measure how much it took to go from the first instruction to the return
    auto time_consumed = builder.CreateSub(exit_timestamp, enter_timestamp);
    

    The bpf_perf_event_output expects a buffer, so we have to store our result somewhere in memory. We can re-use the map value address so we don’t have to allocate more stack space:

    builder.CreateStore(time_consumed, timestamp_ptr);
    

    Remember, we have to pass the first program argument to the ctx parameter; the arg_begin method of an llvm::Function will return exactly that. When sending data, the bpf_perf_event_output() helper expects a pointer. We can re-use the timestamp pointer we obtained from the map and avoid allocating additional memory to the very limited stack we have:

    builder.CreateStore(time_consumed, timestamp_ptr);
    
    // Send the result to the perf event array
    auto ctx = function->arg_begin();
    bpfPerfEventOutput(builder, ctx, perf_fd, static_cast(-1UL),
                       timestamp_ptr, builder.getInt64(8U));
    

    Using -1UL as the flag value means that BPF will automatically send this data to the perf event output associated with the CPU we’re running on.

    Reading data from the perf outputs

    In our user mode program, we can access the perf buffers through the memory mappings we created. The list of perf event output descriptors can be used together with the poll() function using an array of pollfd structures. When one of the fd we have set is readable, the corresponding memory mapping will contain the data sent by the BPF program.

    // Uses poll() to wait for the next event happening on the perf even toutput
    bool waitForPerfData(std::vector &readable_outputs,
                         const PerfEventArray &obj, int timeout) {
    
      readable_outputs = {};
    
      // Collect all the perf event output file descriptors inside a
      // pollfd structure
      std::vector poll_fd_list;
      for (auto fd : obj.output_fd_list) {
        struct pollfd poll_fd = {};
        poll_fd.fd = fd;
        poll_fd.events = POLLIN;
    
        poll_fd_list.push_back(std::move(poll_fd));
      }
    
      // Use poll() to determine which outputs are readable
      auto err = ::poll(poll_fd_list.data(), poll_fd_list.size(), timeout);
      if (err < 0) {
        if (errno == EINTR) {
          return true;
        }
    
        return false;
    
      } else if (err == 0) {
        return true;
      }
    
      // Save the index of the outputs that can be read inside the vector
      for (auto it = poll_fd_list.begin(); it != poll_fd_list.end(); ++it) {
        auto ready = ((it->events & POLLIN) != 0);
    
        if (ready) {
          auto index = static_cast(it - poll_fd_list.begin());
          readable_outputs.push_back(index);
        }
      }
    
      return true;
    }
    

    Inside the memory we have mapped, the perf_event_mmap_page header will describe the properties and boundaries of the allocated circular buffer.

    The structure is too big to be reported here, but the most important fields are:

    __u64  data_head;   /* head in the data section */
    __u64  data_tail;   /* user-space written tail */
    __u64  data_offset; /* where the buffer starts */
    __u64  data_size;   /* data buffer size */
    

    The base of the data allocation is located at the offset data_offset; to find the start of our buffer, however, we have to add it to the data_tail value, making sure to wrap around whenever we exceed the data allocation size specified by the data_size field:

    buffer_start = mapped_memory + data_offset + (data_tail % data_size)

    Similarly, the data_head field can be used to find the end of the buffer:

    buffer_end = mapped_memory + data_offset + (data_head % data_size)

    If the end of the buffer is at a lower offset compared to the start, then data is wrapping at the data_size edge and the read has to happen with two operations.

    When extracting data, the program is expected to confirm the read by updating the data_tail value and adding the number of bytes processed, while the kernel will advance the data_head field automatically as new bytes are received. Data is lost when the data_head offset wraps around and crosses data_tail; a special structure inside this buffer will warn the program if this happens.

    Program data is packaged inside the data we have just extracted, preceded by two headers. The first one is the perf_event_header structure:

    struct perf_event_header {
      u32 type;
      u16 misc;
      u16 size;
    };
    

    The second one is an additional 32-bit size field that accounts for itself and the data that follows. Multiple consecutive writes from the BPF program may be added under the same object. Data is, however, grouped by type, which can be used to determine what kind of data to expect after the header. When using BPF, we’ll only have to deal with either our data or a notification of type PERF_RECORD_LOST, which is used to inform the program that a bpf_perf_event_output() call has overwritten data in the ring buffer before we could have a chance to read it.

    Here’s some annotated code that shows how the whole procedure works:

    using PerfBuffer = std::vector;
    using PerfBufferList = std::vector;
    
    // Reads from the specified perf event array, appending new bytes to the
    // perf_buffer_context. When a new complete buffer is found, it is moved
    // inside the the 'data' vector
    bool readPerfEventArray(PerfBufferList &data,
                            PerfBufferList &perf_buffer_context,
                            const PerfEventArray &obj, int timeout) {
    
      // Keep track of the offsets we are interested in to avoid
      // strict aliasing issues
      static const auto kDataOffsetPos{
          offsetof(struct perf_event_mmap_page, data_offset)};
    
      static const auto kDataSizePos{
          offsetof(struct perf_event_mmap_page, data_size)};
    
      static const auto kDataTailPos{
          offsetof(struct perf_event_mmap_page, data_tail)};
    
      static const auto kDataHeadPos{
          offsetof(struct perf_event_mmap_page, data_head)};
    
      data = {};
    
      if (perf_buffer_context.empty()) {
        auto processor_count = getProcessorCount();
        perf_buffer_context.resize(processor_count);
      }
    
      // Use poll() to determine which perf event outputs are readable
      std::vector readable_outputs;
      if (!waitForPerfData(readable_outputs, obj, timeout)) {
        return false;
      }
    
      for (auto perf_output_index : readable_outputs) {
        // Read the static header fields
        auto perf_memory = static_cast(
            obj.mapped_memory_pointers.at(perf_output_index));
    
        std::uint64_t data_offset{};
        std::memcpy(&data_offset, perf_memory + kDataOffsetPos, 8U);
    
        std::uint64_t data_size{};
        std::memcpy(&data_size, perf_memory + kDataSizePos, 8U);
    
        auto edge = perf_memory + data_offset + data_size;
    
        for (;;) {
          // Read the dynamic header fields
          std::uint64_t data_head{};
          std::memcpy(&data_head, perf_memory + kDataHeadPos, 8U);
    
          std::uint64_t data_tail{};
          std::memcpy(&data_tail, perf_memory + kDataTailPos, 8U);
    
          if (data_head == data_tail) {
            break;
          }
    
          // Determine where the buffer starts and where it ends, taking into
          // account the fact that it may wrap around
          auto start = perf_memory + data_offset + (data_tail % data_size);
          auto end = perf_memory + data_offset + (data_head % data_size);
    
          auto byte_count = data_head - data_tail;
          auto read_buffer = PerfBuffer(byte_count);
    
          if (end < start) {
            auto bytes_until_wrap = static_cast(edge - start);
            std::memcpy(read_buffer.data(), start, bytes_until_wrap);
    
            auto remaining_bytes =
                static_cast(end - (perf_memory + data_offset));
    
            std::memcpy(read_buffer.data() + bytes_until_wrap,
                        perf_memory + data_offset, remaining_bytes);
    
          } else {
            std::memcpy(read_buffer.data(), start, byte_count);
          }
    
          // Append the new data to our perf buffer
          auto &perf_buffer = perf_buffer_context[perf_output_index];
    
          auto insert_point = perf_buffer.size();
          perf_buffer.resize(insert_point + read_buffer.size());
    
          std::memcpy(&perf_buffer[insert_point], read_buffer.data(),
                      read_buffer.size());
    
          // Confirm the read
          std::memcpy(perf_memory + kDataTailPos, &data_head, 8U);
        }
      }
    
      // Extract the data from the buffers we have collected
      for (auto &perf_buffer : perf_buffer_context) {
        // Get the base header
        struct perf_event_header header = {};
        if (perf_buffer.size() < sizeof(header)) {
          continue;
        }
    
        std::memcpy(&header, perf_buffer.data(), sizeof(header));
        if (header.size > perf_buffer.size()) {
          continue;
        }
    
        if (header.type == PERF_RECORD_LOST) {
          std::cout << "One or more records have been lost\n";
    
        } else {
          // Determine the buffer boundaries
          auto buffer_ptr = perf_buffer.data() + sizeof(header);
          auto buffer_end = perf_buffer.data() + header.size;
    
          for (;;) {
            if (buffer_ptr + 4U >= buffer_end) {
              break;
            }
    
            // Note: this is data_size itself + bytes used for the data
            std::uint32_t data_size = {};
            std::memcpy(&data_size, buffer_ptr, 4U);
    
            buffer_ptr += 4U;
            data_size -= 4U;
    
            if (buffer_ptr + data_size >= buffer_end) {
              break;
            }
    
            auto program_data = PerfBuffer(data_size);
            std::memcpy(program_data.data(), buffer_ptr, data_size);
            data.push_back(std::move(program_data));
    
            buffer_ptr += 8U;
            data_size -= 8U;
          }
        }
    
        // Erase the chunk we consumed from the buffer
        perf_buffer.erase(perf_buffer.begin(), perf_buffer.begin() + header.size);
      }
    
      return true;
    }
    

    Writing the main function

    While it is entirely possible (and sometimes useful, in order to share types) to use a single LLVM module and context for both the enter and exit programs, we will create two different modules to avoid changing the previous sample code we’ve built.

    The program generation goes through the usual steps, but now we are loading two instead of one, so the previous code has been changed to reflect that.

    The new and interesting part is the main loop where the perf event output data is read and processed:

    // Incoming data is appended here
    PerfBufferList perf_buffer;
    
    std::uint64_t total_time_used{};
    std::uint64_t sample_count{};
    
    std::cout << "Tracing average time used to service the following syscall: "
              << kSyscallName << "\n";
    
    std::cout << "Collecting samples for 10 seconds...\n";
    
    auto start_time = std::chrono::system_clock::now();
    
    for (;;) {
      // Data that is ready for processing is moved inside here
      PerfBufferList data;
      if (!readPerfEventArray(data, perf_buffer, perf_event_array, 1)) {
        std::cerr << "Failed to read from the perf event array\n";
        return 1;
      }
    
      // Inspect the buffers we have received
      for (const auto &buffer : data) {
        if (buffer.size() != 8U) {
          std::cout << "Unexpected buffer size: " << buffer.size() << "\n";
          continue;
        }
    
        // Read each sample and update the counters; use memcpy to avoid
        // strict aliasing issues
        std::uint64_t time_used{};
        std::memcpy(&time_used, buffer.data(), 8U);
    
        total_time_used += time_used;
        ++sample_count;
    
        std::cout << time_used << "ns\n";
      }
    
      // Exit after 10 seconds
      auto elapsed_msecs = std::chrono::duration_cast(
                               std::chrono::system_clock::now() - start_time)
                               .count();
    
      if (elapsed_msecs > 10000) {
        break;
      }
    }
    
    // Print a summary of the data we have collected
    std::cout << "Total time used: " << total_time_used << " nsecs\n";
    std::cout << "Sample count: " << sample_count << "\n";
    std::cout << "Average: " << (total_time_used / sample_count) << " nsecs\n";
    

    The full source code can be found in the 03-syscall_profiler folder.

    Running the sample program as root should print something similar to the following output:

    Tracing average time used to service the following syscall: fchmodat
    Collecting samples for 10 seconds...
    178676ns
    72886ns
    80481ns
    147897ns
    171152ns
    80803ns
    69208ns
    75273ns
    76981ns
    Total time used: 953357 nsecs
    Sample count: 9
    Average: 105928 nsecs
    

    Writing a BPF program to do ANYTHING

    BPF is in active development and is becoming more and more useful with each update, enabling new use cases that extend the original vision. Recently, newly added BPF functionality allowed us to write a simple system-wide syscall fault injector using nothing but BPF and a compatible kernel that supported the required bpf_override_return functionality.

    If you want to keep up with how this technology evolves, one of the best places to start with is Brendan’s Gregg blog. The IO Visor Project repository also contains a ton of code and documentation that is extremely useful if you plan on writing your own BPF-powered tools.

    Want to integrate BPF into your products? We can help! Contact us today, and check out our ebpfpub library.

  • PrivacyRaven: Implementing a proof of concept for model inversion

    By Philip Wang, Intern
    Originally published August 3, 2021

    During my Trail of Bits winternship and springternship, I had the pleasure of working with Suha Hussain and Jim Miller on PrivacyRaven, a Python-based tool for testing deep-learning frameworks against a plethora of privacy attacks. I worked on improving PrivacyRaven’s versatility by adding compatibility for services such as Google Colab and expanding its privacy attack and assurance functionalities.

    What is PrivacyRaven?

    PrivacyRaven is a machine-learning assurance and research tool that simulates privacy attacks against trained machine-learning models. It supports model extraction and label-only membership inference attacks, with support for model inversion currently in development.

    In a model extraction attack, a user attempts to steal or extract a trained deep-learning model that outputs either a probability vector corresponding to each class or a simple classification. For instance, consider a classifier that detects particular emotions in human faces. This classifier will return either a vector specifying the likelihood of each emotion or simply the most likely emotion as its classification. Importantly, the user will have only black-box query access to the classifier and will not receive any other information about it.

    To perform a model extraction attack, the user first queries the model with random unlabeled data to identify all of the classifications returned by the model. This information can then be used to approximate the target classifier. Specifically, the attacker uses a public data source to obtain synthetic data (i.e., similar data such as a dataset of facial images and emotion classifications) and trains a substitute fixed-architecture model on that data.

    If successful, such an attack could have drastic consequences, especially for services that hide their actual models behind a paid API; for example, an attacker, for a low cost, could approximate a service’s model with a small decrease in accuracy, thereby gaining a massive economic advantage over the victim service. Since PrivacyRaven operates under the most restrictive threat model, simulating attacks in which the user has only query access, model extraction is also a critical component of PrivacyRaven’s other attacks (i.e., membership inference attacks).

    I worked on implementing support for a model inversion attack aimed at recovering private data used to train a target model. The model inversion attack uses model extraction to secure white-box access to a substitute classifier that faithfully approximates the target.

    What is model inversion?

    In a model inversion attack, a malicious user targets a classifier that predicts a vector of confidence values for each class; the user then attempts to recover its training data to compromise its privacy

    It turns out that a user with background knowledge about the model may actually be able to obtain a reasonable approximation of the target model’s training data. (For an example of such an approximation, see the image recovered from a facial recognition system in the figure below.) This is the core idea of several papers, including “Adversarial Neural Network Inversion via Auxiliary Knowledge Alignment,” by Ziqi Yang et al., which formed the basis of my implementation of model inversion.

    Model inversion attacks based on background knowledge alignment have plenty of use cases and consequences in the real world. Imagine that a malicious user targeted the aforementioned facial emotion classifier. The user could construct his or her own dataset by scraping relevant images from a search engine, run the images through the classifier to obtain their corresponding confidence values, and construct probability vectors from the values; the user could then train an inversion model capable of reconstructing approximations of the images from the given vectors.

    An inversion model is designed to be the inverse of the target model (hence the use of “inversion”). Instead of inputting images and receiving an emotion classification, the attacker can supply arbitrary emotion-prediction vectors and obtain reconstructed images from the training set.

    Using the above example, let’s walk through how the user would run a model inversion attack in greater detail. Assume that the user has access to an emotion classifier that outputs confidence values for some of its classes and knows that the classifier was trained on images of faces; the user therefore has background knowledge on the classifier.

    The user, via this background knowledge on the model, creates an auxiliary dataset by scraping images of faces from public sites and processing them. The user also chooses an inversion model architecture capable of upscaling the constructed prediction vectors to a “reconstructed” image.

    To train the inversion model, the user queries the classifier with each image in the auxiliary dataset to obtain the classifier’s confidence values for the images. That information is used to construct a prediction vector. Since this classifier might output only the top confidence values for the input, the inversion process assumes that it does in fact truncate the prediction vectors and that the rest of the entries are zeroed out. For example, if the classifier is trained on 5 emotions but outputs only the top 2, with confidence values of 0.5 and 0.3, the user will be able to construct the vector (0.5, 0, 0.3, 0, 0) from those values.

    The user can then input the prediction vector into the inversion model, which will upscale the vector to an image. As a training objective, the user would like to minimize the inversion model’s mean squared error (MSE) loss function, which is calculated pixelwise between images in the auxiliary set and their reconstructions outputted by the model; the user then repeats this training process for many epochs.

    An MSE close to 0 means that the reconstructed image is a sound approximation of the ground truth, and an MSE of 0 means that the reconstructed and ground truth images are identical. Once the model has been sufficiently trained, the user can feed prediction vectors into the trained inversion model to obtain a reconstructed data point representative of each class.

    Note that the model inversion architecture itself is similar to an autoencoder, as mentioned in the paper. Specifically, the classifier may be thought of as the encoder, and the inversion network, as the decoder, with the prediction vector belonging to the latent space. The key differences are that, in model inversion, the classifier is given and fixed, and the training data of the classifier is not available to train the inversion network.

    My other contributions to PrivacyRaven

    While I focused on implementing model inversion support in PrivacyRaven, in the first few weeks of my internship, I helped improve PrivacyRaven’s documentation and versatility. To better demonstrate certain of PrivacyRaven’s capabilities, I added detailed example Python scripts; these include scripts showcasing how to mount attacks and register custom callbacks to obtain more thorough information during attack runs. I also added Docker and Google Colab support, which allows users to containerize and ship attack setups as well as to coordinate and speed up an attack’s development.

    Caveats and room to grow

    PrivacyRaven is designed around usability and is intended to abstract away as much of the tedium of data and neural network engineering as possible. However, model inversion is a relatively fragile attack that depends on fine-tuning certain parameters, including the output dimensionality of both the classifier and inversion model as well as the inversion model’s architecture. Thus, striking a balance between usability and model inversion fidelity proved to be a challenge.

    Another difficulty stemmed from the numerous assumptions that need to be satisfied for a model inversion attack to produce satisfactory results. One assumption is that the user will be able to recover the number of classes that the target classifier was trained on by querying it with numerous data points.

    For example, if the user were trying to determine the number of classes that an object recognition classifier was trained on, the user could query the classifier with a large number of images of random objects and add the classes identified in that process to a set. However, this might not always work if the number of training classes is large; if the user isn’t able to recover all of the classes, the quality of the inversion will likely suffer, though the paper by Yang et al. doesn’t analyze the extent of the impact on inversion quality.

    In addition, Yang et al. were not explicit about the reasoning behind the design of their classifier and inversion model architectures. When conducting their experiments, the authors used the CelebA and MNIST datasets and resized the images in them. They also used two separate inversion architectures for the datasets, with the CelebA inversion architecture upscaling from a prediction vector of length 530 to a 64 x 64 image and the MNIST inversion architecture upscaling from a prediction vector of length 10 to a 32 x 32 image. As you can imagine, generalizing this attack such that it can be used against arbitrary classifiers is difficult, as the optimal inversion architecture changes for each classifier.

    Finally, the authors focused on model inversion in a white-box scenario, which isn’t directly adaptable to PrivacyRaven’s black-box-only threat model. As previously mentioned, PrivacyRaven assumes that the user has no knowledge about the classifier beyond its output; while the general model inversion process remains largely the same, a black-box scenario requires the user to make many more assumptions, particularly on the dimensions of the training data and the classifier’s output. Each additional assumption about dimensionality needs to be considered and addressed, and this inherent need for customization makes designing a one-size-fits-all API for model inversion very difficult.

    Next steps

    PrivacyRaven does not yet have a fully stable API for model inversion, but I have completed a proof-of-concept implementation of the paper. Some design decisions for the model inversion’s API still need to mature, but the plan is for the API to support both white-box and black-box model inversion attacks and to make the model inversion and extraction parameters as customizable as possible without sacrificing usability. I believe that, with this working proof of concept of model inversion, the development of the model inversion API should be a relatively smooth process.

    Inversion results

    The inversion results produced by the proof of concept are displayed below. This attack queries an MNIST-trained victim classifier with extended-MNIST (EMNIST) data to train a substitute model that can then be used to perform a white-box inversion attack. The model inversion-training process was run for 300 epochs with batch sizes of 100, producing a final MSE loss of 0.706. The inversion quality changed substantially depending on the orientation of the auxiliary set images. The images on the left are samples of the auxiliary set images taken from the MNIST set, with their labels in parentheses; their reconstructions are on the right.

    Images of a 0, for example, tended to have fairly accurate reconstructions:

    Other auxiliary set images had reconstructions that looked similar to them but contradicted their labels. For example, the below images appear to depict a 4 but in reality depict a 2 rotated by 90 degrees.

    Other images also had poor or ambiguous reconstructions, such as the following rotated images of an 8 and 9.

    Overall, these results demonstrate that model inversion is a fragile attack that doesn’t always produce high-quality reconstructions. However, one must also consider that the above inversion attack was conducted using only black-box queries to the classifier. Recall that in PrivacyRaven’s current model inversion pipeline, a model extraction attack is first executed to grant the user access to a white-box approximation of the target classifier. Since information is lost during both model extraction and inversion, the reconstruction quality of a black-box model inversion attack is likely to be significantly worse than that of its white-box counterpart. As such, the inversion model’s ability to produce faithful reconstructions for some images even under the most restrictive assumptions does raise significant privacy concerns for the training data of deep-learning classifiers.

    Takeaways

    I thoroughly enjoyed working on PrivacyRaven, and I appreciate the support and advice that Jim and Suha offered me to get me up to speed. I am also grateful for the opportunity to learn about the intersectionality of machine learning and security, particularly in privacy assurance, and to gain valuable experience with deep-learning frameworks including PyTorch. My experience working in machine-learning assurance kindled within me a newfound interest in the field, and I will definitely be delving further into deep learning and privacy in the future.

    Write Rust lints without forking Clippy

    By Samuel Moelius, Staff Engineer
    Originally published May 20, 2021

    This blog post introduces Dylint, a tool for loading Rust linting rules (or “lints”) from dynamic libraries. Dylint makes it easy for developers to maintain their own personal lint collections.

    Previously, the simplest way to write a new Rust lint was to fork Clippy, Rust’s de facto linting tool. But this approach has drawbacks in how one runs and maintains new lints. Dylint minimizes these distractions so that developers can focus on actually writing lints.

    First, we’ll go over the current state of Rust linting and the workings of Clippy. Then, we’ll explain how Dylint improves upon the status quo and offer some tips on how to begin using it. Skip to the last section, If you want to get straight to writing lints.

    Rust linting and Clippy

    Tools like Clippy take advantage of the Rust compiler’s dedicated support for linting. A Rust linter’s core component, called a “driver,” links against an appropriately named library, rustc_driver. By doing so, the driver essentially becomes a wrapper around the Rust compiler.

    To run the linter, the RUSTC_WORKSPACE_WRAPPER environment variable is set to point to the driver and runs cargo check. Cargo notices that the environment variable has been set and calls the driver instead of calling rustc. When the driver is called, it sets a callback in the Rust compiler’s Config struct. The callback registers some number of lints, which the Rust compiler then runs alongside its built-in lints.

    Clippy performs a few checks to ensure it is enabled but otherwise works in the above manner. (See Figure 1 for Clippy’s architecture.) Although it may not be immediately clear upon installation, Clippy is actually two binaries: a Cargo command and a rustc driver. You can verify this by typing the following:

    which cargo-clippy
    which clippy-driver
    

    Now suppose you want to write your own lints. What should you do? Well, you’ll need a driver to run them, and Clippy has a driver, so forking Clippy seems like a reasonable step to take. But there are drawbacks to this solution, namely in running and maintaining the lints that you’ll develop.

    First, your fork will have its own copies of the two binaries, and it’s a hassle to ensure that they can be found. You’ll have to make sure that at least the Cargo command is in your PATH, and you’ll probably have to rename the binaries so that they won’t interfere with Clippy. Clearly, these steps don’t pose insurmountable problems, but you would probably rather avoid them.

    Second, all lints (including Clippy’s) are built upon unstable compiler APIs. Lints compiled together must use the same version of those APIs. To understand why this is an issue, we’ll refer to clippy_utils—a collection of utilities that the Clippy authors have generously made public. Note that clippy_utils uses the same compiler APIs that lints do, and similarly provides no stability guarantees. (See below.)

    Suppose you have a fork of Clippy to which you want to add a new lint. Clearly, you’ll want your new lint to use the most recent version of clippy_utils. But suppose that version uses compiler version B, while your fork of Clippy uses compiler version A. Then you’ll be faced with a dilemma: Should you use an older version of clippy_utils (one that uses compiler version A), or should you upgrade all of the lints in your fork to use compiler version B? Neither is a desirable choice.

    Dylint addresses both of these problems. First, it provides a single Cargo command, saving you from having to manage multiple such commands. Second, for Dylint, lints are compiled together to produce dynamic libraries. So in the above situation, you could simply store your new lint in a new dynamic library that uses compiler version B. You could use this new library alongside your existing libraries for as long as you’d like and upgrade your existing libraries to the newer compiler version if you so choose.

    Dylint provides an additional benefit related to reusing intermediate compilation results. To understand it, we need to examine how Dylint works.

    How Dylint works

    Like Clippy, Dylint provides a Cargo command. The user specifies to that command the dynamic libraries from which the user wants to load lints. Dylint runs cargo check in a way that ensures the lints are registered before control is handed over to the Rust compiler.

    The lint-registration process is more complicated for Dylint than for Clippy, however. All of Clippy’s lints use the same compiler version, so only one driver is needed. But a Dylint user could choose to load lints from libraries that use different compiler versions.

    Dylint handles such situations by building new drivers on-the-fly as needed. In other words, if a user wants to load lints from a library that uses compiler version A and no driver can be found for compiler version A, Dylint will build a new one. Drivers are cached in the user’s home directory, so they are rebuilt only when necessary.

    This brings us to the additional benefit alluded to in the previous section. Dylint groups libraries by the compiler version they use. Libraries that use the same compiler version are loaded together, and their lints are run together. This allows intermediate compilation results (e.g., symbol resolution, type checking, trait solving, etc.) to be shared among the lints.

    For example, in Figure 2, if libraries U and V both used compiler version A, the libraries would be grouped together. The driver for compiler version A would be invoked only once. The driver would register the lints in libraries U and V before handing control over to the Rust compiler.

    To understand why this approach is beneficial, consider the following. Suppose that lints were stored directly in compiler drivers rather than dynamic libraries, and recall that a driver is essentially a wrapper around the Rust compiler. So if one had two lints in two compiler drivers that used the same compiler version, running those two drivers on the same code would amount to compiling that code twice. By storing lints in dynamic libraries and grouping them by compiler version, Dylint avoids these inefficiencies.

    An application: Project-specific lints

    Did you know that Clippy contains lints whose sole purpose is to lint Clippy’s code? It’s true. Clippy contains lints to check, for example, that every lint has an associated LintPass, that certain Clippy wrapper functions are used instead of the functions they wrap, and that every lint has a non-default description. It wouldn’t make sense to apply these lints to code other than Clippy’s. But there’s no rule that all lints must be general purpose, and Clippy takes advantage of this liberty.

    Dylint similarly includes lints whose primary purpose is to lint Dylint’s code. For example, while developing Dylint, we found ourselves writing code like the following:

    let rustup_toolchain = std::env::var("RUSTUP_TOOLCHAIN")?;
    ...
    std::env::remove_var("RUSTUP_TOOLCHAIN");

    This was bad practice. Why? Because it was only a matter of time until we fat-fingered the string literal:

    std::env::remove_var("RUSTUP_TOOLCHIAN"); // Oops

    A better approach is to use a constant instead of a string literal, as in the below code:

    const RUSTUP_TOOLCHAIN: &str = "RUSTUP_TOOLCHAIN";
    ...
    std::env::remove_var(RUSTUP_TOOLCHAIN);
    

    So while working on Dylint, we wrote a lint to check for this bad practice and to make an appropriate suggestion. We applied (and still apply) that lint to the Dylint source code. The lint is called env_literal and the core of its current implementation is as follows:

    impl<'tcx> LateLintPass<'tcx> for EnvLiteral {
       fn check_expr(&mut self, cx: &LateContext<'tcx>, expr: &Expr<'_>) {
          if_chain! {
             if let ExprKind::Call(callee, args) = expr.kind;
             if is_expr_path_def_path(cx, callee, &REMOVE_VAR)
                || is_expr_path_def_path(cx, callee, &SET_VAR)
                || is_expr_path_def_path(cx, callee, &VAR);
             if !args.is_empty();
             if let ExprKind::Lit(lit) = &args[0].kind;
             if let LitKind::Str(symbol, _) = lit.node;
             let ident = symbol.to_ident_string();
             if is_upper_snake_case(&ident);
             then {
                span_lint_and_help(
                   cx,
                   ENV_LITERAL,
                   args[0].span,
                   "referring to an environment variable with a string literal is error prone",
                   None,
                   &format!("define a constant `{}` and use that instead", ident),
                );
             }
          }
       }
    }

    Here is an example of a warning it could produce:

    warning: referring to an environment variable with a string literal is error prone
     --> src/main.rs:2:27
      |
    2 |     let _ = std::env::var("RUSTFLAGS");
      |                           ^^^^^^^^^^^
      |
      = note: `#[warn(env_literal)]` on by default
      = help: define a constant `RUSTFLAGS` and use that instead
    

    Recall that neither the compiler nor clippy_utils provide stability guarantees for its APIs, so future versions of env_literal may look slightly different. (In fact, a change to clippy_utils‘ APIs resulted in a change env_literal’s implementation while this article was being written!) The current version of env_literal can always be found in the examples directory of the Dylint repository.

    Clippy “lints itself” in a slightly different way than Dylint, however. Clippy’s internal lints are compiled into a version of Clippy with a particular feature enabled. But for Dylint, the env_literal lint is compiled into a dynamic library. Thus, env_literal is not a part of Dylint. It’s essentially input.

    Why is this important? Because you can write custom lints for your project and use Dylint to run them just as Dylint runs its own lints on itself. There’s nothing significant about the source of the lints that Dylint runs on the Dylint repository. Dylint would just as readily run your repository’s lints on your repository.

    The bottom line is this: If you find yourself writing code you do not like and you can detect that code with a lint, Dylint can help you weed out that code and prevent its reintroduction.

    Get to linting

    Install Dylint with the following command:

    cargo install cargo-dylint

    We also recommend installing the dylint-link tool to facilitate linking:

    cargo install dylint-link

    The easiest way to write a Dylint library is to fork the dylint-template repository. The repository produces a loadable library right out of the box. You can verify this as follows:

    git clone https://github.com/trailofbits/dylint-template
    cd dylint-template
    cargo build
    DYLINT_LIBRARY_PATH=$PWD/target/debug cargo dylint fill_me_in --list

    All you have to do is implement the LateLintPass trait and accommodate the symbols asking to be filled in.

    Helpful resources for writing lints include the following:

    Adding a new lint (targeted at Clippy but still useful)

    Also consider using the clippy_utils crate mentioned above. It includes functions for many low-level tasks, such as looking up symbols and printing diagnostic messages, and makes writing lints significantly easier.

    We owe a sincere thanks to the Clippy authors for making the clippy_utils crate available to the Rust community. We would also like to thank Philipp Krones for providing helpful comments on an earlier version of this post.

    Discovering goroutine leaks with Semgrep

    By Alex Useche, Security Engineer
    Originally published May 10, 2021

    While learning how to write multithreaded code in Java or C++ can make computer science students reconsider their career choices, calling a function asynchronously in Go is just a matter of prefixing a function call with the go keyword. However, writing concurrent Go code can also be risky, as vicious concurrency bugs can slowly sneak into your application. Before you know it, there could be thousands of hanging goroutines slowing down your application, ultimately causing it to crash. This blog post provides a Semgrep rule that can be used in a bug-hunting quest and includes a link to a repository of specialized Semgrep rules that we use in our audits. It also explains how to use one of those rules to find a particularly pesky type of bug in Go: goroutine leaks.

    The technique described in this post is inspired by GCatch, a tool that uses interprocedural analysis and the Z3 solver to detect misuse-of-channel bugs that may lead to hanging goroutines. The technique and development of the tool are particularly exciting because of the lack of research on concurrency bugs caused by the incorrect use of Go-specific structures such as channels.

    Although the process of setting up this sort of tool, running it, and using it in a practical context is inherently complex, it is worthwhile. When we closely analyzed confirmed bugs reported by GCatch, we noticed patterns in their origins. We were then able to use those patterns to discover alternative ways of identifying instances of these bugs. Semgrep, as we will see, is a good tool for this job, given its speed and the ability to easily tweak Semgrep rules.

    Goroutine leaks explained

    Perhaps the best-known concurrency bugs in Go are race conditions, which often result from improper memory aliasing when working with goroutines inside of loops. Goroutine leaks, on the other hand, are also common concurrency bugs but are seldom discussed. This is partially because the consequences of a goroutine leak only become apparent after several of them occur; the leaks begin to affect performance and reliability in a noticeable way.

    Goroutine leaks typically result from the incorrect use of channels to synchronize a message passed between goroutines. This problem often occurs when unbuffered channels are used for logic in cases when buffered channels should be used. This type of bug may cause goroutines to hang in memory and eventually exhaust a system’s resources, resulting in a system crash or a denial-of-service condition.

    Let’s look at a practical example:

    import (
      "fmt"
      "runtime"
      "time"
    )
    
      func main() {
      requestData(1)
      time.Sleep(time.Second * 1)
      fmt.Printf("Number of hanging goroutines: %d", runtime.NumGoroutine() - 1)
    }
    
    func requestData(timeout time.Duration) string {
     dataChan := make(chan string)
    
     go func() {
         newData := requestFromSlowServer()
         dataChan <- newData // block
     }()
     select {
     case result := <- dataChan:
         fmt.Printf("[+] request returned: %s", result)
         return result
     case <- time.After(timeout):
         fmt.Println("[!] request timeout!")
             return ""
     }
    }
    
    func requestFromSlowServer() string {
     time.Sleep(time.Second * 1)
     return "very important data"
    }
    

    In the above code, a channel write operation on line 21 blocks the anonymous goroutine that encloses it. The goroutine declared on line 19 will be blocked until a read operation occurs on dataChan. This is because read and write operations block goroutines when unbuffered channels are used, and every write operation must have a corresponding read operation.

    There are two scenarios that cause anonymous goroutine leaks:

    • If the second case, case <- time.After(timeout), occurs before the read operation on line 24, the requestData function will exit, and the anonymous goroutine inside of it will be leaked.
    • If both cases are triggered at the same time, the scheduler will randomly select one of the two cases. If the second case is selected, the anonymous goroutine will be leaked.

    When running the code, you’ll get the following output:

    [!] request timeout!
    Number of hanging goroutines: 1
    Program exited.

    The hanging goroutine is the anonymous goroutine on line 19.

    Using buffered channels would fix the above issue. While reading or writing to an unbuffered channel results in a goroutine block, executing a send (a write) to a buffered channel results in a block only when the channel buffer is full. Similarly, a receive operation will cause a block only when the channel buffer is empty.

    To prevent a goroutine leak, all we need to do is add a length to the channel on line 17, which gives us the following:

    func requestData(timeout time.Duration) string {
     dataChan := make(chan string, 1)
    
     go func() {
         newData := requestFromSlowServer()
         dataChan <- newData // block
     }()
    

    After running the updated program, we can confirm that there are no more hanging goroutines.

    [!] request timeout!
    Number of hanging goroutines: 0
    Program exited.
    

    This bug may seem minor, but in certain situations, it could lead to a goroutine leak. For an example of a goroutine leak, see this PR in the Kubernetes repository. While running 1,496 goroutines, the author of the patch experienced an API server crash resulting from a goroutine leak.

    Finding the bug

    The process of debugging concurrency issues is so complex that a tool like Semgrep may seem ill-equipped for it. However, when we closely examined common Go concurrency bugs found in the wild, we identified patterns that we could easily leverage to create Semgrep rules. Those rules enabled us to find even complex bugs of this kind, largely because Go concurrency bugs can often be described by a few sets of simple patterns.

    Before using Semgrep, it is important to recognize the limitations on the types of issues that it can solve. When searching for concurrency bugs, the most significant limitation is Semgrep’s inability to conduct interprocedural analysis. This means that we’ll need to target bugs that are contained within individual functions. This is a manageable problem when working in Go and won’t prevent us from using Semgrep, since Go programmers often rely on anonymous goroutines defined within individual functions.

    Now we can begin to construct our Semgrep rule, basing it on the following typical manifestation of a goroutine leak:

    1. An unbuffered channel, C, of type T is declared.
    2. A write/send operation to channel C is executed in an anonymous goroutine, G.
    3. C is read/received in a select block (or another location outside of G).
    4. The program follows an execution path in which the read operation of C does not occur before the enclosing function is terminated.

    It is the last step that generally causes a goroutine leak.

    Bugs that result from the above conditions tend to cause patterns in the code, which we can detect using Semgrep. Regardless of the forms that these patterns take, there will be an unbuffered channel declared in the program, which we’ll want to analyze:

    - pattern-inside: |
           $CHANNEL := make(...)
           ...
    

    We’ll also need to exclude instances in which the channel is declared as a buffered channel:

    - pattern-not-inside: |
           $CHANNEL := make(..., $T)
           ...
    

    To detect the goroutine leak from our example, we can use the following pattern:

    - pattern: |
             go func(...){
               ...
               $CHANNEL <- $X
               ...
             }(...)
             ...
             select {
             case ...
             case $Y := <- $CHANNEL:
             ...
             }
    

    This code tells Semgrep to look for a send operation to the unbuffered channel, $CHANNEL, executed inside an anonymous goroutine, as well as a subsequent receive operation inside of a select block. Slight variations in this pattern may occur, so we will need to account for those in our Semgrep rule.

    For instance, the receive expression could use the assignment (=) operator rather than the declaration (:=) operator, which would require a new pattern. We won't go over every possible variation here, but you can skip ahead and view the completed rule if you’d like. The finished rule also includes cases in which there could be more send operations than receive operations for an unbuffered channel, $CHANNEL.

    We should also exclude instances of false positives, such as that from the Moby repository, in which a read operation on the blocking channel prevents the channel from causing a block before exiting the function.

    - pattern-not-inside: |
           ...
           select {
           case ...
           case ...:
             ...
             ... =<- $CHANNEL
             ...
           }
       - pattern-not-inside: |
           ...
           select {
           case ...
           case ...:
             ...
             <-$CHANNEL
             ...
           }
    

    Once we have completed our pattern, we can run it against the code. (Try it out using the Semgrep playground.) Running the pattern from the command line returns the following output:

    $ semgrep --config ./hanging-goroutine.yml
    
    running 1 rules...
    test.go
    severity:warning rule:semgrep.go.hanging-goroutine: Potential goroutine leak due to unbuffered channel send inside loop or unbuffered channel read in select block.
    
    18: go func() {
    19:     newData := requestFromSlowServer()
    20:     dataChan <- newData // block
    21: }()
    22: select {
    23: case result := <-dataChan:
    24:     fmt.Printf("[+] request returned: %s", result)
    25:     return result
    26: case <-time.After(timeout):
    27:     fmt.Println("[!] request timeout!")
    28:     return ""
    

    We ran this pattern against an unpatched release of the Docker codebase and compared the matches with those reported by GCatch and documented on its repository. Our Semgrep rule missed only 5 out of all the goroutine leak bugs found by GCatch that were reported to the Docker team via PRs. We also used this rule to find bugs on the Kubernetes and Minikube repositories, amazon-ecs-agent (Amazon Elastic Container Service Agent), as well as in two open-source Microsoft projects (azure-container-networking and hcsshim), and submitted the patches as PRs.

    GCatch uses a technique that is smarter and more sophisticated than the one above. As a result, it can analyze multiple execution paths to find more complex instances of this bug. However, there are advantages to using Semgrep instead of a complex tool:

    Semgrep can analyze code more quickly, because it focuses on discovering pattern matches rather than conducting taint and data flow analysis.
    Semgrep rules are very easy to understand and update.
    The setup is more straightforward and reliable.

    Of course, the drawback is that we miss complex issues, such as cases in which the send operation occurs inside a separate named function (as opposed to an anonymous function). However, Semgrep has experimental support for data flow analysis, taint tracking, and basic cross-function analysis, and we look forward to testing and developing more complicated rules as support for those features continues to mature.

    Finding other types of concurrency bugs

    Our new semgrep-rules repository contains the Semgrep rule for the above bug, as well as other rules we developed and use in our code audits to find Go concurrency bugs. These include Semgrep rules used to catch race conditions in anonymous goroutines and unsafe uses of sleep functions for synchronization.

    Like the goroutine leak we examined, these types of bugs are manifested in repeatable patterns, making them detectable by Semgrep. Keep an eye on the repository, as we will be adding and refining Semgrep rules for Go and other languages. We will also continue researching ways to debug concurrency bugs in Go and will look forward to sharing more findings in the future.

    Solar: Context-free, interactive analysis for Solidity

    We’re hiring for our Research + Engineering team! 

    By Aaron Yoo, University of California, Los Angeles

    As an intern at Trail of Bits, I worked on Solar, a proof-of-concept static analysis framework. Solar is unique because it enables context-free interactive analysis of Solidity smart contracts. A user can direct Solar to explore program paths (e.g., to expand function calls or follow if statements) and assign constraints or values to variables, all without reference to a concrete execution. As a complement to Solar, I created an integrated development environment (IDE)-like tool that illustrates the types of interactivity and analysis that Solar can provide.

    Solar user interface.

    The Solar UI has two main panels, “source” and “IR” (intermediate representation). The source panel displays the source code of the functions being analyzed and the IR panel translates those functions into an enhanced variant of SlithIR, our open-source IR. The green highlights show the lines of the function that are being analyzed. The two panes display similar information intended for different uses: The source pane serves as a map, aiding in navigation and providing context. The IR pane enables interactivity as well as visualization of the information deduced by Solar.

    Analyses built on top of Solar are called “strategies.” Strategies are modular, and each defines a different mode of operation for Solar. Three strategies are shown in the above screenshot: concrete, symbolic, and constraint. Although each analysis has its own workflow, Solar needs all three to support the simplify, downcall, and upcall primitives, which are explained below.

    Solar primitives

    Simplify

    The simplify primitive instructs the framework to make as many inferences as possible based on the current information. Solar can receive new information either through user input or by deriving it as an axiom from the program. In the screencast below, Solar axiomatically deduces the value of the constant, 2. If the user tells Solar to assume that the value of x is 5 and then clicks the “simplify” button, Solar will deduce the return value.

    Downcall

    The downcall primitive instructs the framework to inline a function call. Function calls are inlined so that the user can see the entire path in one pane. In a traditional IDE, the user would have to navigate to the function in question. In our framework, the downcalled function is inlined directly into the IR pane, and its source is brought into the source pane.

    Upcall

    The upcall primitive inlines the current context into a calling function. In other words, upcalling implies traversal up the call stack and is generally the opposite of downcalling. By upcalling, Solar can traverse program paths up through function calls, as demonstrated below. (Pay special attention to the green-highlighted lines.)

    Together, these three primitives give Solar its defining properties: context insensitivity and interactivity. Solar is context-free (context-insensitive) because the user can start analysis from any function. It is interactive because the exact program path is determined by the upcalls and downcalls—which are chosen by the user.

    Demonstration for reasoning about integer overflow

    Solar can help a user reason about nontrivial program properties such as integer overflow. Consider the following program:

    contract Overflow {
        function z(uint32 x, uint32 y) public returns (uint32) {
            uint32 ret = 0;
            if (x < 1000 && y < 1000) {
                will_it_overflow(x, y);
            }
            return ret;
        }
    
        function will_it_overflow(uint32 x, uint32 y) public returns (uint32) {
            return x * y;
        }
    }

    Here, we want to find out whether will_it_overflow will ever cause an integer overflow. Integer overflow occurs when the mathematical result of an arithmetic operation cannot fit into the physical space allocated for its storage.

    Looking at the will_it_overflow function, it’s clear that integer overflow may be possible, as two 32-bit numbers are multiplied and placed into a 32-bit result. However, based on the call sites of will_it_overflow, if z calls will_it_overflow, there can never be an integer overflow; this is because z verifies that arguments to will_it_overflow are small. Let’s see how Solar would reach this conclusion.

    Performing this analysis with Solar requires use of the constraint strategy, which works by attempting to find a single valid execution of the program. The user can constrain the execution with arbitrary polynomial constraints. To start the analysis, we select the will_it_overflow function from the left pane to indicate it as the desired starting point. Here is the initial analysis view:

    Solar provides one possible execution that evaluates all values to zero. The next step is constraining the values of x and y. We can provide the following constraints (in terms of IR variables, not source variables) to the constraint strategy:

    1.   x_1 < (2 ** 32)
    2.   y_1 < (2 ** 32)
    3.   x_1 * y_1 >= (2 ** 32)

    The first two constraints bind x_1 and y_1 to 32-bit integers. The third causes the solver to try to find an execution in which x_1 * y_1 overflows. It is common practice to prove properties using an SAT/SMT solver by showing that the negation of the property is unsatisfiable. With that in mind, we are going to use Solar to show that there are no executions in which x_1 * y_1 overflows, thereby implying that will_it_overflow does not overflow. After simplifying the use of these constraints, we get the following:

    Again, Solar provides a single possible execution based on the constraints. Upcalling once puts us at line 5:

    Because the green-highlighted lines do not form a logical contradiction, Solar can still return a valid program execution. However, upcalling again returns a different result:

    The question marks on the right indicate that the solver cannot find any possible execution given the program path. This is because line 4 contradicts the inequality we wrote earlier: x_1 * y_1 >= (2 ** 32). The above steps constitute informal proof that overflow is not possible if will_it_overflow is called from z.

    Conclusion

    I am proud of what Solar became. Although it is a prototype, Solar represents a novel type of analysis platform that prioritizes interactivity. Given the potential applications for code auditing and IDE-style semantic checking, I am excited to see what the future holds for Solar and its core ideas. I would like to give a big thank-you to my mentor, Peter Goodman, for making this internship fun and fulfilling. Peter accomplished perhaps the most challenging task for a mentor: striking the delicate balance between providing me guidance and freedom of thought. I would also like to extend thanks to Trail of Bits for hosting the internship. I look forward to seeing the exciting projects that future interns create!

    A Year in the Life of a Compiler Fuzzing Campaign

    By Alex Groce, Northern Arizona University

    In the summer of 2020, we described our work fuzzing the Solidity compiler, solc. So now we’d like to revisit this project, since fuzzing campaigns tend to “saturate,” finding fewer new results over time. Did Solidity fuzzing run out of gas? Is fuzzing a high-stakes project worthwhile, especially if it has its own active and effective fuzzing effort?

    The first bugs from that fuzzing campaign were submitted in February of 2020 using an afl variant. Since then, we’ve submitted 74 reports. Sixty-seven have been confirmed as bugs, and 66 of those have been fixed. Seven were duplicates or not considered to be true bugs.


    Given that 21 of these bugs were submitted since last December, it’s fair to say that our fuzzing campaign makes a strong case for why independent fuzzing is still important and can find different bugs, even when OSSFuzz-based testing is also involved.

    Why is it useful to keep fuzzing such a project perhaps indefinitely? The answer has three parts. First, more fuzzing covers more code execution paths and long fuzzing runs are especially helpful. It would be hard to get anywhere near path-coverage saturation with any fuzzer we know of. Even when running afl for 30 or more days, our tests still find new paths around every 1-2 hours and sometimes find new edges. Some of our reported bugs were discovered only after a month of fuzzing.

    A production compiler is fantastically complex, so fuzzing in-depth is useful. It takes time to generate inputs such as the most recent bug we found fuzzing the Solidity level of solc:

    pragma experimental SMTChecker;
    contract A {
        function f() internal virtual {
            v();
        }
        function v() internal virtual {
        }
    }
    
    contract B is A {
        function f() internal virtual override {
            super.f();
        }
    }
    
    contract C is B {
        function v() internal override {
            if (0==1)
                f();
        }
    }
    
    

    This code should compile without error, but it didn’t. Until the bug was fixed, it caused the SMT Checker to crash, throwing a fatal “Super contract not available” error, due to incorrect contract context used for variable access in virtual calls inside branches.

    Compilers should undergo fuzz testing for long stretches of time because of their complexity and number of possible execution paths. A rule of thumb is that afl hasn’t really started in earnest on any non-trivial target until it hits one million executions, and compilers likely require much more than this. In our experience, compiler runs vary anywhere from less than one execution per second to as many as 40 executions per second. Just getting to one million executions can take a few days!

    The second reason we want to fuzz independently alongside OSSFuzz is to approach the target from a different angle. Instead of strictly using a dictionary- or grammar-based approach with traditional fuzzer mutation operators, we used ideas from any-language mutation testing to add “code-specific” mutation operators to afl, and rely mostly (but not exclusively) on those, as opposed to afl’s even more generic mutations, which tend to be focused on binary format data. Doing something different is likely going to be a good solution to fuzzer saturation.

    Finally, we keep grabbing the latest code and start fuzzing on new versions of solc. Since the OSSFuzz continuous integration doesn’t include our techniques, bugs that are hard for other fuzzers but easy for our code-mutation approach will sometimes appear, and our fuzzer will find them almost immediately.

    But we don’t grab every new release and start over, because we don’t want to lose the ground that was gained with our long fuzzing campaigns. We also don’t continually take up where we left off since the tens of thousands of test corpora that afl can generate are likely full of uninteresting paths that might make finding bugs in the new code easier. We sometimes resume from an existing run, but only infrequently.

    Finding bugs in a heavily-fuzzed program like solc is not easy. The next best independent fuzzing effort to ours, that of Charalambos Mitropoulos, also mentioned by the solc team in their post of the OSSFuzz fuzzing, has only discovered 8 bugs, even though it’s been ongoing since October 2019.

    Other Languages, Other Compilers

    Our success with solc inspired us to fuzz other compilers. First, we tried fuzzing the Vyper compiler—a language intended to provide a safer, Python-like, alternative to Solidity for writing Ethereum blockchain smart contracts. Our previous Vyper fuzzing uncovered some interesting bugs using essentially a grammar-based approach with the TSTL (Template Scripting Testing Language) Python library via python-afl. We found a few bugs during this campaign, but chose not to go to extremes, because of the poor speed and throughput of the instrumented Python testing.

    In contrast, my collaborator, Rijnard van Tonder at Sourcegraph, had much greater success fuzzing the Diem project’s Move language—the language for the blockchain formerly known as Facebook’s Libra. Here, the compiler is fast and the instrumentation is cheap. Rijnard has reported 14 bugs in the compiler, so far, all of which have been confirmed and assigned, and 11 of which have been fixed. Given that the fuzzing began just two months ago, this is an impressive bug haul!

    Using Rijnard’s notes on fuzzing Rust code using afl.rs, I tried our tools on Fe, a new smart contract language supported by the Ethereum foundation. Fe is, in a sense, a successor to Vyper, but with more inspiration from Rust and a much faster compiler. I began fuzzing Fe on the date of its first alpha release and submitted my first issue nine days later.

    To support my fuzzing campaign, the Fe team changed failures in the Yul backend, which uses solc to compile Yul, to produce Rust panics visible to afl, and we were off to the races. So far, this effort has produced 31 issues, slightly over 18% of all GitHub issues for Fe, including feature requests. Of these, 14 have been confirmed as bugs, and ten of those have been fixed; the remaining bugs are still under review.

    We didn’t just fuzz smart contract languages. Rijnard fuzzed the Zig compiler—a new systems programming language that aims at simplicity and transparency and found two bugs (confirmed, but not fixed).

    The Future of Our Fuzzing Campaigns

    We uncovered 88 bugs that were fixed during our afl compiler fuzzing campaign, plus an additional 14 confirmed, but not yet fixed bugs.

    What’s interesting is that the fuzzers aren’t using dictionaries or grammars. They know nothing about any of these languages beyond what is expressed by a modest corpus of example programs from the test cases. So how can we fuzz compilers this effectively?

    The fuzzers operate at a regular-expression level. They don’t even use context-free language information. Most of the fuzzing has used fast C string-based heuristics to make “code-like” changes, such as removing code between brackets, changing arithmetic or logical operators, or just swapping lines of code, as well as changing if statements to while and removing function arguments. In other words, they apply the kind of changes a mutation testing tool would. This approach works well even though Vyper and Fe aren’t very C-like and only Python’s whitespace, comma, and parentheses usage are represented.

    Custom dictionaries and language-aware mutation rules may be more effective, but the goal is to provide compiler projects with effective fuzzing without requiring many resources. We also want to see the impact that a good fuzzing strategy can have on a project during the early stages of development, as with the Fe language. Some of the bugs we’ve reported highlighted tricky corner cases for developers much earlier than might have otherwise been the case. We hope that discussions such as this one will help produce a more robust language and compiler with fewer hacks made to accommodate design flaws detected too late to easily change.

    We plan to keep fuzzing most of these compilers since the solc effort has shown that a fuzzing campaign can remain viable for a long time, even if there are other fuzzing efforts targeting the same compiler.

    Compilers are complex and most are also rapidly changing. For example, Fe is a brand-new language that isn’t really fully designed, and Solidity is well-known for making dramatic changes to both user-facing syntax and compiler internals.

    We’re also talking to Bhargava Shastry, who leads the internal fuzzing effort of Solidity, and applying some of the semantic checks they apply in their protobuf-fuzzing of the Yul optimization level ourselves. We started directly fuzzing Yul via solc’s strict-assembly option, and we already found one amusing bug that was quickly fixed and incited quite a bit of discussion! We hope that the ability to find more than just inputs that crash solc will take this fuzzing to the next level.

    The issue at large is whether fuzzing is limited to the bugs it can find due to the inability of a compiler to detect many wrong-code errors. Differential comparison of two compilers, or of a compiler with its own output when optimizations are turned off, usually requires a much more restricted form of the program, which limits the bugs you find, since programs must be compiled and executed to compare results.

    One way to get around this problem is to make the compiler crash more often. We imagine a world where compilers include something like a testing option that enables aggressive and expensive checks that wouldn’t be practical in normal runs, such as sanity-checks on register allocation. Although these checks would likely be too expensive for normal runs, they could be turned on for both some fuzzing runs, since the programs compiled are usually small, and, perhaps even more importantly, in final production compilation for extremely critical code (Mars Rover code, nuclear-reactor control code — or high-value smart contracts) to make sure no wrong-code bugs creep into such systems.

    Finally, we want to educate compiler developers and developers of other tools that take source code as input, that effective fuzzing doesn’t have to be a high-cost effort requiring significant developer time. Finding crashing inputs for a compiler is often easy, using nothing more than some spare CPU cycles, a decent set of source code examples in the language, and the afl-compiler-fuzzer tool!

    We hope you enjoyed learning about our long-term compiler fuzzing project, and we’ve love to hear about your own fuzzing experiences on Twitter @trailofbits.