Announcing osquery 5: Now with EndpointSecurity on macOS

By Sharvil Shah, Senior Software Engineer
Originally published on October 6, 2021

TL;DR: Version 5.0.1 of osquery, a cross-platform, open-source endpoint visibility agent, is now available. This release is an exciting milestone for the project, as it introduces an EndpointSecurity-based process events table for macOS. Read on to learn how we integrated EndpointSecurity into osquery and how you can begin using it in your organization.

Apple’s New Rules for macOS Security

Over the years, Apple has been gradually taking pages from its iOS playbook to spruce up macOS security, beginning five years ago with the introduction of System Integrity Protection (SIP) to contain the root user in OS X 10.11 El Capitan. Since then, Apple has accelerated its efforts to improve macOS security by introducing stricter requirements for GateKeeper and the enforcement of code signing and of notarizing application binaries and packages.

Entitlements are another feature strengthening macOS security. Granted by Apple and baked in with a corresponding code signature, an entitlement allows an application or binary to use restricted APIs or frameworks. These new locked-down APIs replace the APIs that were formerly available only in kernel-mode “kernel extensions.” As a user-mode-only executable, following the same out-from-the-kernel OS integrity trends that many platforms are adopting, the osquery project was already well positioned to adopt these new APIs.

What is EndpointSecurity?

Apple has gradually deprecated kernel extensions with its recent releases of macOS. To replace kernel extensions, Apple developed the EndpointSecurity framework and API. When combined with the required entitlements, the EndpointSecurity framework enables user-mode processes to subscribe to events of interest from the macOS kernel in real time. EndpointSecurity replaces kauth, the kernel-mode authorization framework, and OpenBSM, the legacy framework used to grab the audit trail from the kernel.

Compared to OpenBSM, EndpointSecurity is more reliable, is more performant, and anecdotally captures more process events. For a more in-depth review of EndpointSecurity, check out our Sinter blog post, our team’s first demonstration of EndpointSecurity.

These security features are a great boon to end users. We were on a steep learning curve as we retrofitted osquery—which has always been deployed as a basic, standalone CLI executable—with new signing and packaging procedures, but we believe it was well worth the effort.

How to Use osquery with EndpointSecurity: A Mini Tutorial

With the 5.0.1 release of osquery, we have implemented the es_process_events table. Check the schema for this table before following along with the tutorial.

Following along in osqueryi

The simplest way to get started with osquery is by using osqueryi, the interactive osquery shell. Download the official macOS installer package from osquery.io and install it as you would any other application.

With the release of version 5.0.1, osquery is now installed as an app bundle in /opt/osquery/lib/osquery.app, and osqueryi is a symlink in /usr/local/bin.

Next up, grant your terminal emulator application—whether it be Terminal.app, iTerm2.app, or any other terminal emulator—Full Disk Access permissions in System Preferences. Full Disk Access is part of Apple’s Transparency Consent and Control (TCC) framework, another macOS security feature, and is required to enable EndpointSecurity. In the next section, we explain how to grant this permission automatically for Macs that are enrolled in a mobile device management (MDM) solution.

Finally, run osqueryi with root permissions and provide the –disable_events=false and –disable_endpointsecurity=falseflags to launch osquery interactively, with ephemeral events and the EndpointSecurity-based es_process_events table enabled.

Below is an example of osqueryi capturing recent process events that have occurred since the last time osqueryi was launched.

➜  ~ sudo osqueryi --disable_events=false --disable_endpointsecurity=false
Using a virtual database. Need help, type '.help'
osquery> .mode line
osquery> select * from es_process_events;
 
        version = 4
        seq_num = 178
 global_seq_num = 574
            pid = 8522
           path = /Applications/Xcode.app/Contents/Developer/usr/share/xcs/Nginx/sbin/nginx
         parent = 1
original_parent = 1
        cmdline = /Library/Developer/XcodeServer/CurrentXcodeSymlink/Contents/Developer/usr/share/xcs/Nginx/sbin/nginx -c /Library/Developer/XcodeServer/CurrentXcodeSymlink/Contents/Developer/usr/share/xcs/xcsnginx/xcsnginx.conf
  cmdline_count = 3
            env = XPC_SERVICE_NAME=com.apple.xcsnginx PATH=/usr/bin:/bin:/usr/sbin:/sbin XPC_FLAGS=1 LOGNAME=_xcsnginx USER=_xcsnginx HOME=/var/_xcsnginx SHELL=/bin/false TMPDIR=/var/folders/xl/xl5_qxqd1095w75dfmq92c4w0000f3/T/
      env_count = 8
            cwd = /Applications/Xcode.app/Contents/Developer/usr/share/xcs/Nginx
            uid = 451
           euid = 450
            gid = 450
           egid = 450
       username = _xcsnginx
     signing_id = com.apple.nginx
        team_id =
         cdhash = 7fde0ccc9dcdb7d994e82a880d684c5418368460
platform_binary = 1
      exit_code =
      child_pid =
           time = 1631617834
     event_type = exec
 
 
        version = 4
        seq_num = 193
 global_seq_num = 552
            pid = 8077
           path = /System/Library/Frameworks/CoreServices.framework/Versions/A/Frameworks/Metadata.framework/Versions/A/Support/mdworker_shared
         parent = 1
original_parent = 1
        cmdline =
  cmdline_count = 0
            env =
      env_count = 0
            cwd =
            uid = 501
           euid = 20
            gid = 20
           egid = 20
       username = sharvil
     signing_id = com.apple.mdworker_shared
        team_id =
         cdhash = 993abbb7ffcd0d3216808513f8212f4fa1fa07d7
platform_binary = 1
      exit_code = 9
      child_pid =
           time = 1631617820
     event_type = exit

Deploying a PPPC Profile for an MDM-Managed Mac

While osqueryi is a great tool for interactively introspecting, monitoring, and developing queries suitable to your environment, most deployments of osquery are in daemon mode with a configuration file.

For a Mac host that is enrolled in MDM, you can grant Full Disk Access permissions automatically and silently by pushing a Privacy Preferences Policy Control (PPPC) configuration profile. For this profile, you need both the systemPolicyAllFiles key, which grants Full Disk Access, and a CodeRequirement key.

Use the codesign tool to output CodeRequirement and copy everything in the output after
designated =>.

> codesign  -dr - /opt/osquery/lib/osquery.app/Contents/MacOS/osqueryd
Executable=/opt/osquery/lib/osquery.app/Contents/MacOS/osqueryd
designated => identifier "io.osquery.agent" and anchor apple generic and 
certificate 1[field.1.2.840.113635.100.6.2.6] /* exists */ and certificate 
leaf[field.1.2.840.113635.100.6.1.13] /* exists */ and certificate 
leaf[subject.OU] = "3522FA9PXF"

To complete systemPolicyAllFiles, the Identifier should be io.osquery.agent and IdentifierType should be bundleID.

Pulling it all together, below is an example of a complete PPPC profile granting Full Disk Access to osquery. Please note that you will need to update the PayloadOrganization and other relevant fields, shown here in bold.

You may need to consult the documentation of your MDM provider for more information on PPPC profiles.

Migrating from osquery 4.x to 5.x

With the release of version 5.0.1, osquery now installs on macOS as an app bundle in /opt/osquery/lib/osquery.app, and the new package identifier is io.osquery.agent. If you are upgrading osquery from version 4.9, you need to stop the osquery launchd service and restart it after version 5.0.1 is installed, since osquery itself doesn’t provide a mechanism to clean up artifacts, binaries, and configuration files for older versions. Also of note is that the package installer does not install a LaunchDaemon to start osqueryd. You may use the provided osqueryctl start script to copy the sample LaunchDaemon plist and associated configuration to start the osqueryd daemon.

Similar changes apply to Linux and Windows hosts. Please consult the installation guide on the osquery wiki to learn more.

A Stronger Foundation

We developed a working proof-of-concept of osquery with an EndpointSecurity integration rather quickly; in fact, we merged the feature into osquery months before the version 5.0.1 release. But to actually “switch on” the EndpointSecurity functionality, we had to conquer a mountain of technical debt:

And finally, we had to socialize all these breaking changes and navigate all the community’s feedback. Indeed, the number of changes we needed to make warranted a new major version of osquery, from version 4.9 to 5.0.1.

In June 2019, Facebook and the Linux Foundation formed the osquery Foundation, a new community entity intended to accelerate the development and open participation around the osquery project. A multi-stakeholder Technical Steering Committee (which includes Trail of Bits) has been updating and maintaining osquery ever since. Throughout the project’s development, one of the biggest technical obstacles to the project’s independence was the lack of an automated packaging and code-signing pipeline. Thanks to the community’s efforts this year to integrate EndpointSecurity into osquery, this pipeline is finally in place. Facebook’s original osquery developers can now fully hand over the keys (literally and figuratively) to the community.

The osquery project is undoubtedly only made possible by the continuing involvement and support of its open-source community. We especially want to thank our client sponsor, Atlassian, and our fellow contributors to the osquery community, past and present.

Future Directions

Now that we have integrated EndpointSecurity into osquery, the tool comes with a variety of new detection capabilities on macOS. It should now be much easier to add a file event monitor, kernel extension loading events, and even memory mapping events. The automated code-signing and packaging groundwork that we’ve laid for EndpointSecurity could pave the way for other permissioned/entitled macOS event monitoring frameworks, like NetworkExtension, to be integrated into osquery.

Trail of Bits has been at the forefront of osquery development for years. Where do you want us to take the project next? Drop us a line to chat about how we can help implement your idea!

All your tracing are belong to BPF

By Alessandro Gario, Senior Software Engineer
Originally published August 11, 2021

TL;DR: These simpler, step-by-step methods equip you to apply BPF tracing technology to real-word problems—no specialized tools or libraries required.

BPF, a tracing technology in the Linux kernel for network stack tracing, has become popular recently thanks to new extensions that enable novel use-cases outside of BPF’s original scope. Today it can be used to implement program performance analysis tools, system and program dynamic tracing utilities, and much more.

In this blog post we’ll show you how to use the Linux implementation of BPF to write tools that access system and program events. The excellent tools from IO Visor make it possible for users to easily harness BPF technology without the considerable time investment of writing specialized tools in native code languages.

What the BPF?

BPF itself is just a way to express a program, and a runtime interpreter for executing that program “safely.” It’s a set of specifications for virtual architecture, detailing how virtual machines dedicated to running its code should behave. The latest extensions to BPF have not only introduced new, really useful helper functions (such as reading a process’ memory), but also new registers and more stack space for the BPF bytecode.

Our main goal is to help you to take advantage of BPF and apply it to real-world problems without depending on external tools or libraries that may have been written with different goals and requirements in mind.

You can find the examples in this post in our repository. Please note that the code is simplified to focus on the concepts. This means that, where possible, we skip error checking and proper resource cleanup.

BPF program limitations

Even though we won’t be handwriting BPF assembly, it’s useful to know the code limitations since the in-kernel verifier will reject our instructions if we break its rules.

BPF programs are extremely simple, being made of only a single function. Instructions are sent to the kernel as an array of opcodes, meaning there’s no executable file format involved. Without sections, it’s not possible to have things like global variables or string literals; everything has to live on the stack, which can only hold up to 512 bytes. Branches are allowed, but it is only since kernel version 5.3 that jump opcodes can go backward—provided the verifier can prove the code will not execute forever.

The only other way to use loops without requiring recent kernel versions is to unroll them, but this will potentially use a lot of instructions, and older Linux versions will not load any program that exceeds the 4096 opcode count limit (see BPF_MAXINSNS under linux/bpf_common.h). Error handling in some cases is mandatory, and the verifier will prevent you from using resources that may fail initialization by rejecting the program.

These limitations are extremely important since these programs can get hooked on kernel code. When a verifier challenges the correctness of the code, it’s possible to prevent system crashes or slowdowns from loading malformed code.

External resources

To make BPF programs truly useful, they need ways to communicate with a user mode process and manage long-term data, i.e., via maps and perf event outputs.

Although many map types exist, they all essentially behave like key-value databases, and are commonly used to share data between user modes and/or other programs. Some of these types store data in per-CPU storage, making it easy to save and retrieve state when the same BPF program is run concurrently from different CPU cores.

Perf event outputs are generally used to send data to user mode programs and services, and are implemented as circular buffers.

Event sources

Without some data to process, our programs will just sit around doing nothing. BPF probes on Linux can be attached to several different event sources. For our purpose, we’re mainly interested in function tracing events.

Dynamic instrumentation

Similar to code hooking, BPF programs can be attached to any function. The probe type depends on where the target code lives. Kprobes are used when tracing kernel functions, while Uprobes are used when working with user mode libraries or binaries.

While Kprobe and Uprobe events are emitted when entering the monitored functions, Kretprobe and Uretprobe events are generated whenever the function returns. This works correctly even if the function being traced has multiple exit points. This kind of event does not forward typed syscall parameters and only comes with a pt_regs structure that contains the register values at the time of the call. Knowledge about the function prototype and system ABI is required to map back the function arguments to the right register.

Static instrumentation

It’s not always ideal to rely on function hooking when writing a tool, because the risk of breakage increases as the kernel or software gets updated. In most cases, it’s best to use a more stable event source such as a tracepoint.

There are two types of tracepoints:

  • One for user mode code (USDT, a.k.a. User-Level Statically Defined Tracepoints)
  • One for kernel mode code (interestingly, they are referred to as just “tracepoints”).

Both types of tracepoints are defined in the source code by the programmer, essentially defining a stable interface that shouldn’t change unless strictly necessary.

If DebugFS has been enabled and mounted, registered tracepoints will all appear under the /sys/kernel/debug/tracing folder. Similar to Kprobes and Kretprobes, each system call defined in the Linux kernel comes with two different tracepoints. The first one, sys_enter, is activated whenever a program in the system transitions to a syscall handler inside the kernel, and carries information about the parameters that have been received. The second (and last) one, sys_exit, only contains the exit code of the function and is invoked whenever the syscall function terminates.

BPF development prerequisites

Even though there’s no plan to use external libraries, we still have a few dependencies. The most important thing is to have access to a recent LLVM toolchain compiled with BPF support. If your system does not satisfy this requirement, it is possible—and actually encouraged—to make use of the osquery toolchain. You’ll also need CMake, as that’s what I use for the sample code.

When running inside the BPF environment, our programs make use of special helper functions that require a kernel version that’s at least above 4.18. While it’s possible to avoid using them, it would severely limit what we can do from our code.

Using Ubuntu 20.04 or equivalent is a good bet, as it comes with both a good kernel version and an up-to-date LLVM toolchain with BPF support.

Some LLVM knowledge is useful, but the code doesn’t require any advanced LLVM expertise. The Kaleidoscope language tutorial on the official site is a great introduction if needed.

Writing our first program

There are many new concepts to introduce, so we’ll start simple: our first example loads a program that returns without doing anything.

First, we create a new LLVM module and a function that contains our logic:

std::unique_ptr createBPFModule(llvm::LLVMContext &context) {
  auto module = std::make_unique("BPFModule", context);
  module->setTargetTriple("bpf-pc-linux");
  module->setDataLayout("e-m:e-p:64:64-i64:64-n32:64-S128");
  
  return module;
}
  
std::unique_ptr generateBPFModule(llvm::LLVMContext &context) {
  // Create the LLVM module for the BPF program
  auto module = createBPFModule(context);
  
  // BPF programs are made of a single function; we don't care about parameters
  // for the time being
  llvm::IRBuilder<> builder(context);
  auto function_type = llvm::FunctionType::get(builder.getInt64Ty(), {}, false);
  
  auto function = llvm::Function::Create(
      function_type, llvm::Function::ExternalLinkage, "main", module.get());  
      
  // Ask LLVM to put this function in its own section, so we can later find it
  // more easily after we have compiled it to BPF code
  function->setSection("bpf_main_section");
  
  // Create the entry basic block and assemble the printk code using the helper
  // we have written
  auto entry_bb = llvm::BasicBlock::Create(context, "entry", function);
  
  builder.SetInsertPoint(entry_bb);
  builder.CreateRet(builder.getInt64(0));
  
  return module;
}

Since we’re not going to handle event arguments, the function we created does not accept any parameters. Not much else is happening here except the return instruction. Remember, each BPF program has exactly one function, so it’s best to ask LLVM to store them in separate sections. This makes it easier to retrieve them once the module is compiled.

We can now JIT our module to BPF bytecode using the ExecutionEngine class from LLVM:

SectionMap compileModule(std::unique_ptr module) {
  // Create a new execution engine builder and configure it
  auto exec_engine_builder =
      std::make_unique(std::move(module));
  
  exec_engine_builder->setMArch("bpf");
  
  SectionMap section_map;
  exec_engine_builder->setMCJITMemoryManager(
      std::make_unique(section_map));
      
  // Create the execution engine and build the given module
  std::unique_ptr execution_engine(
      exec_engine_builder->create());
      
  execution_engine->setProcessAllSections(true);
  execution_engine->finalizeObject();
  
  return section_map;
}

Our custom SectionMemoryManager class mostly acts as a passthrough to the original SectionMemoryManager class from LLVM—it’s only there to keep track of the sections that the ExecutionEngine object creates when compiling our IR.

Once the code is built, we get back a vector of bytes for each function that was created inside the module:

int loadProgram(const std::vector &program) {
  // The program needs to be aware how it is going to be used. We are
  // only interested in tracepoints, so we'll hardcode this value
  union bpf_attr attr = {};
  attr.prog_type = BPF_PROG_TYPE_TRACEPOINT;
  attr.log_level = 1U;
  
  // This is the array of (struct bpf_insn) instructions we have received
  // from the ExecutionEngine (see the compileModule() function for more
  // information)
  auto instruction_buffer_ptr = program.data();
  std::memcpy(&attr.insns, &instruction_buffer_ptr, sizeof(attr.insns));
  
  attr.insn_cnt =
      static_cast(program.size() / sizeof(struct bpf_insn));
      
  // The license is important because we will not be able to call certain
  // helpers within the BPF VM if it is not compatible
  static const std::string kProgramLicense{"GPL"};
  
  auto license_ptr = kProgramLicense.c_str();
  std::memcpy(&attr.license, &license_ptr, sizeof(attr.license));
  
  // The verifier will provide a text disasm of our BPF program in here.
  // If there is anything wrong with our code, we'll also find some
  // diagnostic output
  std::vector log_buffer(4096, 0);
  attr.log_size = static_cast<__u32>(log_buffer.size());
  
  auto log_buffer_ptr = log_buffer.data();
  std::memcpy(&attr.log_buf, &log_buffer_ptr, sizeof(attr.log_buf));
  
  auto program_fd =
      static_cast(::syscall(__NR_bpf, BPF_PROG_LOAD, &attr, sizeof(attr)));
      
  if (program_fd < 0) {
    std::cerr << "Failed to load the program: " << log_buffer.data() << "\n";
  }
  
  return program_fd;
}

Loading the program is not hard, but as you may have noticed, there is no helper function defined for the bpf() system call we’re using. The tracepoint is the easiest event type to set up, and it’s what we’re using for the time being.

Once the BPF_PROG_LOAD command is issued, the in-kernel verifier will validate our program and also provide a disassembly of it inside the log buffer we’ve provided. The operation will fail if kernel output is longer than the bytes available, so only provide a log buffer in production code if the load has already failed.

Another important field in the attr union is the program license; specifying any value other than GPL may disable some of the features that are exposed to BPF. I’m not a licensing expert, but it should be possible to use different licenses for the generator and the generated code (but please speak to a lawyer and/or your employer first!).

We can now assemble the main() function using the helpers we built:

int main() {
  initializeLLVM();
  
  // Generate our BPF program
  llvm::LLVMContext context;
  auto module = generateBPFModule(context);
  
  // JIT the module to BPF code using the execution engine
  auto section_map = compileModule(std::move(module));
  if (section_map.size() != 1U) {
    std::cerr << "Unexpected section count\n";
    return 1;
  }
  
  // We have previously asked LLVM to create our function inside a specific
  // section; get our code back from it and load it
  const auto &main_program = section_map.at("bpf_main_section");
  auto program_fd = loadProgram(main_program);
  if (program_fd < 0) {
    return 1;
  }
  
  releaseLLVM();
  return 0;
}

If everything works correctly, no error is printed when the binary is run as the root user. You can find the source code for the empty program in the 00-empty folder of the companion code repository.

But…this program isn’t very exciting, since it doesn’t do anything! Now we’ll update it so we can execute it when a certain system event happens.

Creating our first useful program

In order to actually execute our BPF programs, we have to attach them to an event source.

Creating a new tracepoint event is easy; it only involves reading and writing some files from under the debugfs folder:

int createTracepointEvent(const std::string &event_name) {
  const std::string kBaseEventPath = "/sys/kernel/debug/tracing/events/";
  
  // This special file contains the id of the tracepoint, which is
  // required to initialize the event with perf_event_open  
  std::string event_id_path = kBaseEventPath + event_name + "/id";
  
  // Read the tracepoint id and convert it to an integer
  auto event_file = std::fstream(event_id_path, std::ios::in);
  if (!event_file) {
    return -1;
  }
  
  std::stringstream buffer;
  buffer << event_file.rdbuf();
  
  auto str_event_id = buffer.str();
  auto event_identifier = static_cast(
      std::strtol(str_event_id.c_str(), nullptr, 10));
      
  // Create the event
  struct perf_event_attr perf_attr = {};
  perf_attr.type = PERF_TYPE_TRACEPOINT;
  perf_attr.size = sizeof(struct perf_event_attr);
  perf_attr.config = event_identifier;
  perf_attr.sample_period = 1;
  perf_attr.sample_type = PERF_SAMPLE_RAW;
  perf_attr.wakeup_events = 1;
  perf_attr.disabled = 1;
  
  int process_id{-1};
  int cpu_index{0};
  
  auto event_fd =
      static_cast(::syscall(__NR_perf_event_open, &perf_attr, process_id,
                                 cpu_index, -1, PERF_FLAG_FD_CLOEXEC));  
  
  return event_fd;
}

To create the event file descriptor, we have to find the tracepoint identifier, which is in a special file called (unsurprisingly) “id.”

For our last step, we attach the program to the tracepoint event we just created. This is trivial and can be done with a couple of ioctl calls on the event’s file descriptor:

bool attachProgramToEvent(int event_fd, int program_fd) {
  if (ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, program_fd) < 0) {
    return false;
  }

  if (ioctl(event_fd, PERF_EVENT_IOC_ENABLE, 0) < 0) {
    return false;
  }

  return true;
}

Our program should finally succeed in running our BPF code, but no output is generated yet since our module only really contained a return opcode. The easiest way to generate some output is to use the bpf_trace_printk helper to print a fixed string:

void generatePrintk(llvm::IRBuilder<> &builder) {
  // The bpf_trace_printk() function prototype can be found inside
  // the /usr/include/linux/bpf.h header file
  std::vector argument_type_list = {builder.getInt8PtrTy(),
                                                  builder.getInt32Ty()};

  auto function_type =
      llvm::FunctionType::get(builder.getInt64Ty(), argument_type_list, true);

  auto function =
      builder.CreateIntToPtr(builder.getInt64(BPF_FUNC_trace_printk),
                             llvm::PointerType::getUnqual(function_type));

  // Allocate 8 bytes on the stack
  auto buffer = builder.CreateAlloca(builder.getInt64Ty());

  // Copy the string characters to the 64-bit integer
  static const std::string kMessage{"Hello!!"};

  std::uint64_t message{0U};
  std::memcpy(&message, kMessage.c_str(), sizeof(message));

  // Store the characters inside the buffer we allocated on the stack
  builder.CreateStore(builder.getInt64(message), buffer);

  // Print the characters
  auto buffer_ptr = builder.CreateBitCast(buffer, builder.getInt8PtrTy());

#if LLVM_VERSION_MAJOR < 11
  auto function_callee = function;
#else
  auto function_callee = llvm::FunctionCallee(function_type, function);
#endif

  builder.CreateCall(function_callee, {buffer_ptr, builder.getInt32(8U)});
}

Importing new helper functions from BPF is quite easy. The first thing we need is the prototype, which can be taken from the linux/bpf.h include header. The one relative to printk reads as follows:

 * int bpf_trace_printk(const char *fmt, u32 fmt_size, ...)
 * 	Description
 * 		This helper is a "printk()-like" facility for debugging. It
 * 		prints a message defined by format *fmt* (of size *fmt_size*)
 * 		to file *\/sys/kernel/debug/tracing/trace* from DebugFS, if
 * 		available. It can take up to three additional **u64**
 * 		arguments (as an eBPF helpers, the total number of arguments is
 * 		limited to five).

Once the function type matches, we only have to assemble a call that uses the helper function ID as the destination address: BPF_FUNC_trace_printk. The generatePrintk function can now be added to our program right before we create the return instruction inside generateBPFModule.

The full source code for this program can be found in the 01-hello_open folder.

Running the program again will show the “Hello!!” string inside the /sys/kernel/debug/tracing/trace_pipe file every time the tracepoint event is emitted. Using text output can be useful, but due to the BPF VM limitations the printf helper is not as useful as can be in a standard C program.

In the next section, we’ll take a look at maps and how to use them as data storage for our programs.

Profiling system calls

Using maps to store data

Maps are a major component in most programs, and can be used in a number of different ways. Since they’re accessible from both kernel and user mode, they can be useful in storing data for later processing either from additional probes or user programs. Given the limitations that BPF imposes, they’re also commonly used to provide scratch space for handling temporary data that does not fit on the stack.

There are many map types; some are specialized for certain uses, such as storing stack traces. Others are more generic, and suitable for use as custom data containers.

Concurrency and thread safety are not just user mode problems, and BPF comes with two really useful special map types that have dedicated storage for storing values in CPU scope. These maps are commonly used to replace the stack, as a per-CPU map can be easily referenced by programs without having to worry about synchronization.

It’s rather simple to create and use maps since they all share the same interface, regardless of type. The following table, taken from the BPF header file comments, documents the most common operations:

  • BPF_MAP_CREATE: Create a map and return a file descriptor that refers to the map. The close-on-exec file descriptor flag (see fcntl(2)) is automatically enabled for the new file descriptor.
  • BPF_MAP_LOOKUP_ELEM: Look up an element by key in a specified map and return its value.
  • BPF_MAP_UPDATE_ELEM: Create or update an element (key/value pair) in a specified map.
  • BPF_MAP_DELETE_ELEM: Look up and delete an element by key in a specified map.

    The only important thing to remember is that when operating on per-CPU maps the value is not just a single entry, but an array of values that has as many items as CPU cores.

    Creating a map

    Before we can create our map, we have to determine which type we want to use. The following enum declaration has been taken from the linux/bpf.h header file:

    enum bpf_map_type {
      BPF_MAP_TYPE_UNSPEC,  /* Reserve 0 as invalid map type */
      BPF_MAP_TYPE_HASH,
      BPF_MAP_TYPE_ARRAY,
      BPF_MAP_TYPE_PROG_ARRAY,
      BPF_MAP_TYPE_PERF_EVENT_ARRAY,
      BPF_MAP_TYPE_PERCPU_HASH,
      BPF_MAP_TYPE_PERCPU_ARRAY,
      BPF_MAP_TYPE_STACK_TRACE,
      BPF_MAP_TYPE_CGROUP_ARRAY,
      BPF_MAP_TYPE_LRU_HASH,
      BPF_MAP_TYPE_LRU_PERCPU_HASH,
      BPF_MAP_TYPE_LPM_TRIE,
      BPF_MAP_TYPE_ARRAY_OF_MAPS,
      BPF_MAP_TYPE_HASH_OF_MAPS,
      BPF_MAP_TYPE_DEVMAP,
      BPF_MAP_TYPE_SOCKMAP,
      BPF_MAP_TYPE_CPUMAP,
    };
    

    Most of the time we’ll use hash maps and arrays. We have to create a bpf_attr union, initializing key and value size as well as the maximum amount of entries it can hold.

    int createMap(bpf_map_type type, std::uint32_t key_size,
                  std::uint32_t value_size, std::uint32_t key_count) {
    
      union bpf_attr attr = {};
    
      attr.map_type = type;
      attr.key_size = key_size;
      attr.value_size = value_size;
      attr.max_entries = key_count;
    
      return static_cast(
          syscall(__NR_bpf, BPF_MAP_CREATE, &attr, sizeof(attr)));
    }
    

    Not every available operation always makes sense for all map types. For example, it’s not possible to delete entries when working with an array. Lookup operations are also going to behave differently, as they will only fail when the specified index is beyond the last element.

    Here’s the code to read a value from a map:

    // Error codes for map operations; depending on the map type, reads may
    // return NotFound if the specified key is not present
    enum class ReadMapError { Succeeded, NotFound, Failed };
    
    // Attempts to read a key from the specified map. Values in per-CPU maps
    // actually have multiple entries (one per CPU)
    ReadMapError readMapKey(std::vector &value, int map_fd,
                            const void *key) {
    
      union bpf_attr attr = {};
    
      // Use memcpy to avoid string aliasing issues
      attr.map_fd = static_cast<__u32>(map_fd);
      std::memcpy(&attr.key, &key, sizeof(attr.key));
    
      auto value_ptr = value.data();
      std::memcpy(&attr.value, &value_ptr, sizeof(attr.value));
    
      auto err =
          ::syscall(__NR_bpf, BPF_MAP_LOOKUP_ELEM, &attr, sizeof(union bpf_attr));
    
      if (err >= 0) {
        return ReadMapError::Succeeded;
      }
    
      if (errno == ENOENT) {
        return ReadMapError::NotFound;
      } else {
        return ReadMapError::Failed;
      }
    }
    

    Writing a BPF program to count syscall invocations

    In this example we’ll build a probe that counts how many times the tracepoint we’re tracing gets called. We’ll create a counter for each processor core, using a per-CPU array map that only contains a single item.

    auto map_fd = createMap(BPF_MAP_TYPE_PERCPU_ARRAY, 4U, 8U, 1U);
    if (map_fd < 0) {
      return 1;
    }
    

    Referencing this map from the BPF code is not too hard but requires some additional operations:

    1. Convert the map file descriptor to a map address
    2. Use the bpf_map_lookup_elem helper function to retrieve the pointer to the desired map entry
    3. Check the returned pointer to make sure the operation has succeeded (the validator will reject our program otherwise)
    4. Update the counter value

    The map address can be obtained through a special LLVM intrinsic called “pseudo.”

    // Returns the pseudo intrinsic, useful to convert file descriptors (like maps
    // and perf event outputs) to map addresses so they can be used from the BPF VM
    llvm::Function *getPseudoFunction(llvm::IRBuilder<> &builder) {
      auto &insert_block = *builder.GetInsertBlock();
      auto &module = *insert_block.getModule();
    
      auto pseudo_function = module.getFunction("llvm.bpf.pseudo");
    
      if (pseudo_function == nullptr) {
        // clang-format off
        auto pseudo_function_type = llvm::FunctionType::get(
          builder.getInt64Ty(),
    
          {
            builder.getInt64Ty(),
            builder.getInt64Ty()
          },
    
          false
        );
        // clang-format on
    
        pseudo_function = llvm::Function::Create(pseudo_function_type,
                                                 llvm::GlobalValue::ExternalLinkage,
                                                 "llvm.bpf.pseudo", module);
      }
    
      return pseudo_function;
    }
    
    // Converts the given (map or perf event output) file descriptor to a map
    // address
    llvm::Value *mapAddressFromFileDescriptor(int fd, llvm::IRBuilder<> &builder) {
      auto pseudo_function = getPseudoFunction(builder);
    
      // clang-format off
      auto map_integer_address_value = builder.CreateCall(
        pseudo_function,
    
        {
          builder.getInt64(BPF_PSEUDO_MAP_FD),
          builder.getInt64(static_cast(fd))
        }
      );
      // clang-format on
    
      return builder.CreateIntToPtr(map_integer_address_value,
                                    builder.getInt8PtrTy());
    }
    

    Importing the bpf_map_lookup_elem helper function follows the same procedure we used to import the bpf_trace_printk one. Looking at the linux/bpf.h, the prototype reads:

     * void *bpf_map_lookup_elem(struct bpf_map *map, const void *key)
     * 	Description
     * 		Perform a lookup in *map* for an entry associated to *key*.
     * 	Return
     * 		Map value associated to *key*, or **NULL** if no entry was
     * 		found.
    

    Notice how the key parameter is passed by pointer and not by value. We’ll have to allocate the actual key on the stack using CreateAlloca. Since allocations should always happen in the first (entry) basic block, our function will accept a pre-filled buffer as key. The return type is a void pointer, but we can save work if we directly declare the function with the correct value type.

    // Attempts to retrieve a pointer to the specified key inside the map_fd map
    llvm::Value *bpfMapLookupElem(llvm::IRBuilder<> &builder, llvm::Value *key,
                                  llvm::Type *value_type, int map_fd) {
    
      std::vector argument_type_list = {builder.getInt8PtrTy(),
                                                      builder.getInt32Ty()};
    
      auto function_type = llvm::FunctionType::get(value_type->getPointerTo(),
                                                   argument_type_list, false);
    
      auto function =
          builder.CreateIntToPtr(builder.getInt64(BPF_FUNC_map_lookup_elem),
                                 llvm::PointerType::getUnqual(function_type));
    
      auto map_address = mapAddressFromFileDescriptor(map_fd, builder);
    
    #if LLVM_VERSION_MAJOR < 11
      auto function_callee = function;
    #else
      auto function_callee = llvm::FunctionCallee(function_type, function);
    #endif
    
      return builder.CreateCall(function_callee, {map_address, key});
    }
    

    Back to the BPF program generator, we can now call the new bpfMapLookupElem to retrieve the first value in our array map:

    auto map_key_buffer = builder.CreateAlloca(builder.getInt32Ty());
    builder.CreateStore(builder.getInt32(0U), map_key_buffer);
    
    auto counter_ptr =
        bpfMapLookupElem(builder, map_key_buffer, builder.getInt32Ty(), map_fd);
    

    Since we are using a per-CPU array map, the pointer that returns from this function references a private array entry for the core we’re running on. Before we can use it, however, we have to test whether the function has succeeded; otherwise, the verifier will reject the program. This is trivial and can be done with a comparison instruction and a new basic block.

    auto null_ptr = llvm::Constant::getNullValue(counter_ptr->getType());
    auto cond = builder.CreateICmpEQ(null_ptr, counter_ptr);
    
    auto error_bb = llvm::BasicBlock::Create(context, "error", function);
    auto continue_bb = llvm::BasicBlock::Create(context, "continue", function);
    
    builder.CreateCondBr(cond, error_bb, continue_bb);
    
    builder.SetInsertPoint(error_bb);
    builder.CreateRet(builder.getInt64(0));
    
    builder.SetInsertPoint(continue_bb);
    

    The pointer to the counter value can now be dereferenced without causing a validation error from the verifier.

    auto counter = builder.CreateLoad(counter_ptr);
    auto new_value = builder.CreateAdd(counter, builder.getInt32(1));
    
    builder.CreateStore(new_value, counter_ptr);
    builder.CreateRet(builder.getInt64(0));
    

    There is no need to import and use the bpf_map_update_elem() helper function since we can directly increment the value from the pointer we received. We only have to load the value from the pointer, increment it, and then store it back where it was.

    Once we have finished with our tracer, we can retrieve the counters and inspect them:

    auto processor_count = getProcessorCount();
    std::vector value(processor_count * sizeof(std::uint64_t));
    
    std::uint32_t key{0U};
    auto map_error = readMapKey(value, map_fd, &key);
    
    if (map_error != ReadMapError::Succeeded) {
      std::cerr << "Failed to read from the map\n";
      return 1;
    }
    
    std::vector per_cpu_counters(processor_count);
    std::memcpy(per_cpu_counters.data(), value.data(), value.size());
    

    When dealing with per-CPU maps, it is important to not rely on get_nprocs_conf and use /sys/devices/system/cpu/possible instead. On VMware Fusion for example, the vcpu.hotadd setting will cause Linux to report 128 possible CPUs when enabled, regardless of how many cores have been actually assigned to the virtual machine.

    The full sample code can be found in the 02-syscall_counter folder.

    One interesting experiment is to attach this program to the system call tracepoint used by the chmod command line tool to update file modes. The strace debugging utility can help determine which syscall is being used. In this case we are going to be monitoring the following tracepoint: syscalls/sys_enter_fchmodat.

    The taskset command can be altered to force the fchmodat syscall to be called from a specific processor:

    taskset 1 chmod /path/to/file # CPU 1
    taskset 2 chmod /path/to/file # CPU 2
    

    Using perf event outputs

    Maps can be a really powerful way to store data for later processing, but it’s impossible for user mode programs to know when and where new data is available for reading.

    Perf event outputs can help solve this problem, since they enable the program to be notified whenever new data is available. Additionally, since they behave like a circular buffer, we do not have the same size limitations we have when setting map values.

    In this section, we’ll build an application that can measure how much time it takes to handle a system call. To make this work, we’ll attach a program to both the entry and exit points of a tracepoint to gather timestamps.

    Initialization

    Before we start creating our perf output, we have to create a structure to hold our resources. In total, we’ll have a file descriptor for the map and then a perf output per processor, along with its own memory mapping.

    struct PerfEventArray final {
      int fd;
    
      std::vector output_fd_list;
      std::vector mapped_memory_pointers;
    };
    

    To initialize it, we have to create a BPF map of the type PERF_EVENT_ARRAY first. This special data structure maps a specific CPU index to a private perf event output specified as a file descriptor. For it to function properly, we must use the following parameters when creating the map:

    1. Key size must be set to 4 bytes (CPU index).
    2. Value size must be set to 4 bytes (size of a file descriptor specified with an int).
    3. Entry count must be set to a value greater than or equal to the number of processors.
    auto processor_count = getProcessorCount();
    
    // Create the perf event array map
    obj.fd = createMap(BPF_MAP_TYPE_PERF_EVENT_ARRAY, 4U, 4U, processor_count);
    if (obj.fd < 0) {
      return false;
    }
    

    When we looked at maps in the previous sections, we only focused on reading. For the next steps we also need to write new values, so let’s take a look at how to set keys.

    ReadMapError setMapKey(std::vector &value, int map_fd,
                           const void *key) {
    
      union bpf_attr attr = {};
      attr.flags = BPF_ANY; // Always set the value
      attr.map_fd = static_cast<__u32>(map_fd);
    
      // Use memcpy to avoid string aliasing issues
      std::memcpy(&attr.key, &key, sizeof(attr.key));
    
      auto value_ptr = value.data();
      std::memcpy(&attr.value, &value_ptr, sizeof(attr.value));
    
      auto err = ::syscall(__NR_bpf, BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr));
    
      if (err < 0) {
        return ReadMapError::Failed;
      }
    
      return ReadMapError::Succeeded;
    }
    

    This is not too different from how we read map values, but this time we don’t have to deal with the chance that the key may not be present. As always when dealing with per-CPU maps, the data pointer should be considered as an array containing one value per CPU.

    The next step is to create a perf event output for each online processor with the perf_event_open system call, using the special PERF_COUNT_SW_BPF_OUTPUT config value.

    struct perf_event_attr attr {};
    attr.type = PERF_TYPE_SOFTWARE;
    attr.size = sizeof(attr);
    attr.config = PERF_COUNT_SW_BPF_OUTPUT;
    attr.sample_period = 1;
    attr.sample_type = PERF_SAMPLE_RAW;
    attr.wakeup_events = 1;
    
    std::uint32_t processor_index;
    for (processor_index = 0U; processor_index < processor_count;
         ++processor_index) {
    
      // clang-format off
      auto perf_event_fd = ::syscall(
        __NR_perf_event_open,
        &attr,
        -1,               // Process ID (unused)
        processor_index,  // 0 -> getProcessorCount()
        -1,               // Group ID (unused)
        0                 // Flags (unused)
      );
      // clang-format on
    
      if (perf_event_fd == -1) {
        return false;
      }
    
      obj.output_fd_list.push_back(static_cast(perf_event_fd));
    }
    

    Now that we have the file descriptors, we can populate the perf event array map we created:

    // Set the perf event output file descriptors inside the map
    processor_index = 0U;
    
    for (auto perf_event_fd : obj.output_fd_list) {
      std::vector value(4);
      std::memcpy(value.data(), &perf_event_fd, sizeof(perf_event_fd));
    
      auto err = setMapKey(value, obj.fd, &processor_index);
      if (err != ReadMapError::Succeeded) {
        return false;
      }
    
      ++processor_index;
    }
    

    Finally, we create a memory mapping for each perf output:

    // Create a memory mapping for each output
    auto size = static_cast(1 + std::pow(2, page_count));
    size *= static_cast(getpagesize());
    
    for (auto &perf_event_fd : obj.output_fd_list) {
      auto ptr = mmap(nullptr,                // Desired base address (unused)
                      size,                   // Mapped memory size
                      PROT_READ | PROT_WRITE, // Memory protection
                      MAP_SHARED,             // Flags
                      perf_event_fd,          // The perf output handle
                      0                       // Offset (unused)
      );
    
      if (ptr == MAP_FAILED) {
        return false;
      }
    
      obj.mapped_memory_pointers.push_back(ptr);
    }
    

    This is the memory we’ll read from when capturing the BPF program output.

    Writing a BPF program to profile system calls

    Now that we have a file descriptor of the perf event array map, we can use it from within the BPF code to send data with the bpf_perf_event_output helper function. Here’s the prototype from linux/bpf.h:

     * int bpf_perf_event_output(struct pt_reg *ctx, struct bpf_map *map, u64 flags, void *data, u64 size)
     * 	Description
     * 		Write raw *data* blob into a special BPF perf event held by
     * 		*map* of type **BPF_MAP_TYPE_PERF_EVENT_ARRAY**. This perf
     * 		event must have the following attributes: **PERF_SAMPLE_RAW**
     * 		as **sample_type**, **PERF_TYPE_SOFTWARE** as **type**, and
     * 		**PERF_COUNT_SW_BPF_OUTPUT** as **config**.
     *
     * 		The *flags* are used to indicate the index in *map* for which
     * 		the value must be put, masked with **BPF_F_INDEX_MASK**.
     * 		Alternatively, *flags* can be set to **BPF_F_CURRENT_CPU**
     * 		to indicate that the index of the current CPU core should be
     * 		used.
     *
     * 		The value to write, of *size*, is passed through eBPF stack and
     * 		pointed by *data*.
     *
     * 		The context of the program *ctx* needs also be passed to the
     * 		helper.
    

    The ctx parameter must be always set to the value of the first argument received in the entry point function of the BPF program.

    The map address is obtained with the LLVM pseudo intrinsic that we imported in the previous section. Data and size are self-explanatory, but it is important to remember that the memory pointer must reside inside the BPF program (i.e., we can’t pass a user pointer).

    The last parameter, flags, can be used as a CPU index mask to select the perf event output this data should be sent to. A special value can be passed to ask the BPF VM to automatically use the index of the processor we’re running on.

    // Sends the specified buffer to the map_fd perf event output
    llvm::Value *bpfPerfEventOutput(llvm::IRBuilder<> &builder, llvm::Value *ctx,
                                    int map_fd, std::uint64_t flags,
                                    llvm::Value *data, llvm::Value *size) {
    
      // clang-format off
      std::vector argument_type_list = {
        // Context
        ctx->getType(),
    
        // Map address
        builder.getInt8PtrTy(),
    
        // Flags
        builder.getInt64Ty(),
    
        // Data pointer
        data->getType(),
    
        // Size
        builder.getInt64Ty()
      };
      // clang-format on
    
      auto function_type =
          llvm::FunctionType::get(builder.getInt32Ty(), argument_type_list, false);
    
      auto function =
          builder.CreateIntToPtr(builder.getInt64(BPF_FUNC_perf_event_output),
                                 llvm::PointerType::getUnqual(function_type));
    
      auto map_address = mapAddressFromFileDescriptor(map_fd, builder);
    
    #if LLVM_VERSION_MAJOR < 11
      auto function_callee = function;
    #else
      auto function_callee = llvm::FunctionCallee(function_type, function);
    #endif
    
      return builder.CreateCall(
          function_callee, {ctx, map_address, builder.getInt64(flags), data, size});
    }
    

    The file descriptor and flags parameters are most likely known at compile time, so we can make the function a little more user friendly by accepting integer types. The buffer size, however, is often determined at runtime, so it’s best to use an llvm::Value pointer.

    While it’s possible to just send the raw timestamps whenever we enter and leave the system call of our choice, it’s much easier and more efficient to compute what we need directly inside the BPF code. To do this we’ll use a per-CPU hash map shared across two different BPF programs: one for the sys_enter event, and another one for the sys_exit.

    From the enter program, we’ll save the system timestamp in the map. When the exit program is invoked, we’ll retrieve it and use it to determine how much time it took. The resulting value is then sent to the user mode program using the perf output.

    Creating the map is easy, and we can re-use the map helpers we wrote in the previous sections. Both the timestamp and the map key are 64-bit values, so we’ll use 8 bytes for both:

    auto map_fd = createMap(BPF_MAP_TYPE_HASH, 8U, 8U, 100U);
    if (map_fd < 0) {
      std::cerr << "Failed to create the map\n";
      return 1;
    }
    

    Writing the enter program

    We will need to generate a key for our map. A combination of the process ID and thread ID is a good candidate for this:

     * u64 bpf_get_current_pid_tgid(void)
     * 	Return
     * 		A 64-bit integer containing the current tgid and pid, and
     * 		created as such:
     * 		*current_task*\ **->tgid << 32 \|**
     * 		*current_task*\ **->pid**.
    

    Then the system timestamp needs to be acquired. Even though the ktime_get_ns helper function counts the time from the boot, it’s still a good alternative since we only have to use it to calculate the execution time.

     * u64 bpf_ktime_get_ns(void)
     *  Description
     *    Return the time elapsed since system boot, in nanoseconds.
     *  Return
     *    Current *ktime*.
    

    By now you should be well versed in importing them, so here are the two definitions:

    // Returns a 64-bit integer that contains both the process and thread id
    llvm::Value *bpfGetCurrentPidTgid(llvm::IRBuilder<> &builder) {
      auto function_type = llvm::FunctionType::get(builder.getInt64Ty(), {}, false);
    
      auto function =
          builder.CreateIntToPtr(builder.getInt64(BPF_FUNC_get_current_pid_tgid),
                                 llvm::PointerType::getUnqual(function_type));
    
    #if LLVM_VERSION_MAJOR < 11
      auto function_callee = function;
    #else
      auto function_callee = llvm::FunctionCallee(function_type, function);
    #endif
    
      return builder.CreateCall(function_callee, {});
    }
    
    // Returns the amount of nanoseconds elapsed from system boot
    llvm::Value *bpfKtimeGetNs(llvm::IRBuilder<> &builder) {
      auto function_type = llvm::FunctionType::get(builder.getInt64Ty(), {}, false);
    
      auto function =
          builder.CreateIntToPtr(builder.getInt64(BPF_FUNC_ktime_get_ns),
                                 llvm::PointerType::getUnqual(function_type));
    
    #if LLVM_VERSION_MAJOR < 11
      auto function_callee = function;
    #else
      auto function_callee = llvm::FunctionCallee(function_type, function);
    #endif
    
      return builder.CreateCall(function_callee, {});
    }
    

    We can now use the newly defined functions to generate a map key and acquire the system timestamp:

    // Map keys and values are passed by pointer; create two buffers on the
    // stack and initialize them
    auto map_key_buffer = builder.CreateAlloca(builder.getInt64Ty());
    auto timestamp_buffer = builder.CreateAlloca(builder.getInt64Ty());
    
    auto current_pid_tgid = bpfGetCurrentPidTgid(builder);
    builder.CreateStore(current_pid_tgid, map_key_buffer);
    
    auto timestamp = bpfKtimeGetNs(builder);
    builder.CreateStore(timestamp, timestamp_buffer);
    

    For this program we have replaced the array map we used in the previous sections with a hash map. It’s no longer possible to use the bpf_map_lookup_elem() helper since the map key we have will fail with ENOENT if the element does not exist.

    To fix this, we have to import a new helper named bpf_map_update_elem():

    * int bpf_map_update_elem(struct bpf_map *map, const void *key, const void *value, u64 flags)
    * 	Description
    * 		Add or update the value of the entry associated to *key* in
    * 		*map* with *value*. *flags* is one of:
    *
    * 		**BPF_NOEXIST**
    * 			The entry for *key* must not exist in the map.
    * 		**BPF_EXIST**
    * 			The entry for *key* must already exist in the map.
    * 		**BPF_ANY**
    * 			No condition on the existence of the entry for *key*.
    *
    * 		Flag value **BPF_NOEXIST** cannot be used for maps of types
    * 		**BPF_MAP_TYPE_ARRAY** or **BPF_MAP_TYPE_PERCPU_ARRAY**  (all
    * 		elements always exist), the helper would return an error.
    * 	Return
    * 		0 on success, or a negative error in case of failure.
    

    We’ll keep the map file descriptor and flag values as integers, since we know their values before the module is compiled.

    // Updates the value of the specified key inside the map_fd BPF map
    llvm::Value *bpfMapUpdateElem(llvm::IRBuilder<> &builder, int map_fd,
                                  llvm::Value *key, llvm::Value *value,
                                  std::uint64_t flags) {
    
      // clang-format off
      std::vector argument_type_list = {
        // Map address
        builder.getInt8PtrTy(),
    
        // Key
        key->getType(),
    
        // Value
        value->getType(),
    
        // Flags
        builder.getInt64Ty()
      };
      // clang-format on
    
      auto function_type =
          llvm::FunctionType::get(builder.getInt64Ty(), argument_type_list, false);
    
      auto function =
          builder.CreateIntToPtr(builder.getInt64(BPF_FUNC_map_update_elem),
                                 llvm::PointerType::getUnqual(function_type));
    
      auto map_address = mapAddressFromFileDescriptor(map_fd, builder);
    
    #if LLVM_VERSION_MAJOR < 11
      auto function_callee = function;
    #else
      auto function_callee = llvm::FunctionCallee(function_type, function);
    #endif
    
      return builder.CreateCall(function_callee,
                                {map_address, key, value, builder.getInt64(flags)});
    }
    

    We can now store the timestamp inside the map and close the enter program:

    // Save the timestamp inside the map
    bpfMapUpdateElem(builder, map_fd, map_key_buffer, timestamp_buffer, BPF_ANY);
    builder.CreateRet(builder.getInt64(0));
    

    ‍Writing the exit program

    In this program, we’ll retrieve the timestamp we stored and use it to measure how much time we’ve spent inside the system call. Once we have the result, we’ll send it to user mode using the perf output.

    When creating the llvm::Function for this program, we must define at least one argument. This value will be required later for the ctx parameter that we have to pass to the bpf_perf_event_output() helper.

    First, we have to acquire the map entry; as always, we must check for any possible error or the verifier will not let us load our program.

    // Create the entry basic block
    auto entry_bb = llvm::BasicBlock::Create(context, "entry", function);
    builder.SetInsertPoint(entry_bb);
    
    // Map keys are passed by pointer; create a buffer on the stack and initialize
    // it
    auto map_key_buffer = builder.CreateAlloca(builder.getInt64Ty());
    auto current_pid_tgid = bpfGetCurrentPidTgid(builder);
    builder.CreateStore(current_pid_tgid, map_key_buffer);
    
    // Check the pointer and make sure the lookup has succeeded; this is
    // mandatory, or the BPF verifier will refuse to load our program
    auto timestamp_ptr =
        bpfMapLookupElem(builder, map_key_buffer, builder.getInt64Ty(), map_fd);
    
    auto null_ptr = llvm::Constant::getNullValue(timestamp_ptr->getType());
    auto cond = builder.CreateICmpEQ(null_ptr, timestamp_ptr);
    
    auto error_bb = llvm::BasicBlock::Create(context, "error", function);
    auto continue_bb = llvm::BasicBlock::Create(context, "continue", function);
    
    builder.CreateCondBr(cond, error_bb, continue_bb);
    
    // Terminate the program if the pointer is not valid
    builder.SetInsertPoint(error_bb);
    builder.CreateRet(builder.getInt64(0));
    
    // In this new basic block, the pointer is valid
    builder.SetInsertPoint(continue_bb);
    

    Next, we want to read our previous timestamp and subtract it from the current time:

    // Read back the old timestamp and obtain the current one
    auto enter_timestamp = builder.CreateLoad(timestamp_ptr);
    auto exit_timestamp = bpfKtimeGetNs(builder);
    
    // Measure how much it took to go from the first instruction to the return
    auto time_consumed = builder.CreateSub(exit_timestamp, enter_timestamp);
    

    The bpf_perf_event_output expects a buffer, so we have to store our result somewhere in memory. We can re-use the map value address so we don’t have to allocate more stack space:

    builder.CreateStore(time_consumed, timestamp_ptr);
    

    Remember, we have to pass the first program argument to the ctx parameter; the arg_begin method of an llvm::Function will return exactly that. When sending data, the bpf_perf_event_output() helper expects a pointer. We can re-use the timestamp pointer we obtained from the map and avoid allocating additional memory to the very limited stack we have:

    builder.CreateStore(time_consumed, timestamp_ptr);
    
    // Send the result to the perf event array
    auto ctx = function->arg_begin();
    bpfPerfEventOutput(builder, ctx, perf_fd, static_cast(-1UL),
                       timestamp_ptr, builder.getInt64(8U));
    

    Using -1UL as the flag value means that BPF will automatically send this data to the perf event output associated with the CPU we’re running on.

    Reading data from the perf outputs

    In our user mode program, we can access the perf buffers through the memory mappings we created. The list of perf event output descriptors can be used together with the poll() function using an array of pollfd structures. When one of the fd we have set is readable, the corresponding memory mapping will contain the data sent by the BPF program.

    // Uses poll() to wait for the next event happening on the perf even toutput
    bool waitForPerfData(std::vector &readable_outputs,
                         const PerfEventArray &obj, int timeout) {
    
      readable_outputs = {};
    
      // Collect all the perf event output file descriptors inside a
      // pollfd structure
      std::vector poll_fd_list;
      for (auto fd : obj.output_fd_list) {
        struct pollfd poll_fd = {};
        poll_fd.fd = fd;
        poll_fd.events = POLLIN;
    
        poll_fd_list.push_back(std::move(poll_fd));
      }
    
      // Use poll() to determine which outputs are readable
      auto err = ::poll(poll_fd_list.data(), poll_fd_list.size(), timeout);
      if (err < 0) {
        if (errno == EINTR) {
          return true;
        }
    
        return false;
    
      } else if (err == 0) {
        return true;
      }
    
      // Save the index of the outputs that can be read inside the vector
      for (auto it = poll_fd_list.begin(); it != poll_fd_list.end(); ++it) {
        auto ready = ((it->events & POLLIN) != 0);
    
        if (ready) {
          auto index = static_cast(it - poll_fd_list.begin());
          readable_outputs.push_back(index);
        }
      }
    
      return true;
    }
    

    Inside the memory we have mapped, the perf_event_mmap_page header will describe the properties and boundaries of the allocated circular buffer.

    The structure is too big to be reported here, but the most important fields are:

    __u64  data_head;   /* head in the data section */
    __u64  data_tail;   /* user-space written tail */
    __u64  data_offset; /* where the buffer starts */
    __u64  data_size;   /* data buffer size */
    

    The base of the data allocation is located at the offset data_offset; to find the start of our buffer, however, we have to add it to the data_tail value, making sure to wrap around whenever we exceed the data allocation size specified by the data_size field:

    buffer_start = mapped_memory + data_offset + (data_tail % data_size)

    Similarly, the data_head field can be used to find the end of the buffer:

    buffer_end = mapped_memory + data_offset + (data_head % data_size)

    If the end of the buffer is at a lower offset compared to the start, then data is wrapping at the data_size edge and the read has to happen with two operations.

    When extracting data, the program is expected to confirm the read by updating the data_tail value and adding the number of bytes processed, while the kernel will advance the data_head field automatically as new bytes are received. Data is lost when the data_head offset wraps around and crosses data_tail; a special structure inside this buffer will warn the program if this happens.

    Program data is packaged inside the data we have just extracted, preceded by two headers. The first one is the perf_event_header structure:

    struct perf_event_header {
      u32 type;
      u16 misc;
      u16 size;
    };
    

    The second one is an additional 32-bit size field that accounts for itself and the data that follows. Multiple consecutive writes from the BPF program may be added under the same object. Data is, however, grouped by type, which can be used to determine what kind of data to expect after the header. When using BPF, we’ll only have to deal with either our data or a notification of type PERF_RECORD_LOST, which is used to inform the program that a bpf_perf_event_output() call has overwritten data in the ring buffer before we could have a chance to read it.

    Here’s some annotated code that shows how the whole procedure works:

    using PerfBuffer = std::vector;
    using PerfBufferList = std::vector;
    
    // Reads from the specified perf event array, appending new bytes to the
    // perf_buffer_context. When a new complete buffer is found, it is moved
    // inside the the 'data' vector
    bool readPerfEventArray(PerfBufferList &data,
                            PerfBufferList &perf_buffer_context,
                            const PerfEventArray &obj, int timeout) {
    
      // Keep track of the offsets we are interested in to avoid
      // strict aliasing issues
      static const auto kDataOffsetPos{
          offsetof(struct perf_event_mmap_page, data_offset)};
    
      static const auto kDataSizePos{
          offsetof(struct perf_event_mmap_page, data_size)};
    
      static const auto kDataTailPos{
          offsetof(struct perf_event_mmap_page, data_tail)};
    
      static const auto kDataHeadPos{
          offsetof(struct perf_event_mmap_page, data_head)};
    
      data = {};
    
      if (perf_buffer_context.empty()) {
        auto processor_count = getProcessorCount();
        perf_buffer_context.resize(processor_count);
      }
    
      // Use poll() to determine which perf event outputs are readable
      std::vector readable_outputs;
      if (!waitForPerfData(readable_outputs, obj, timeout)) {
        return false;
      }
    
      for (auto perf_output_index : readable_outputs) {
        // Read the static header fields
        auto perf_memory = static_cast(
            obj.mapped_memory_pointers.at(perf_output_index));
    
        std::uint64_t data_offset{};
        std::memcpy(&data_offset, perf_memory + kDataOffsetPos, 8U);
    
        std::uint64_t data_size{};
        std::memcpy(&data_size, perf_memory + kDataSizePos, 8U);
    
        auto edge = perf_memory + data_offset + data_size;
    
        for (;;) {
          // Read the dynamic header fields
          std::uint64_t data_head{};
          std::memcpy(&data_head, perf_memory + kDataHeadPos, 8U);
    
          std::uint64_t data_tail{};
          std::memcpy(&data_tail, perf_memory + kDataTailPos, 8U);
    
          if (data_head == data_tail) {
            break;
          }
    
          // Determine where the buffer starts and where it ends, taking into
          // account the fact that it may wrap around
          auto start = perf_memory + data_offset + (data_tail % data_size);
          auto end = perf_memory + data_offset + (data_head % data_size);
    
          auto byte_count = data_head - data_tail;
          auto read_buffer = PerfBuffer(byte_count);
    
          if (end < start) {
            auto bytes_until_wrap = static_cast(edge - start);
            std::memcpy(read_buffer.data(), start, bytes_until_wrap);
    
            auto remaining_bytes =
                static_cast(end - (perf_memory + data_offset));
    
            std::memcpy(read_buffer.data() + bytes_until_wrap,
                        perf_memory + data_offset, remaining_bytes);
    
          } else {
            std::memcpy(read_buffer.data(), start, byte_count);
          }
    
          // Append the new data to our perf buffer
          auto &perf_buffer = perf_buffer_context[perf_output_index];
    
          auto insert_point = perf_buffer.size();
          perf_buffer.resize(insert_point + read_buffer.size());
    
          std::memcpy(&perf_buffer[insert_point], read_buffer.data(),
                      read_buffer.size());
    
          // Confirm the read
          std::memcpy(perf_memory + kDataTailPos, &data_head, 8U);
        }
      }
    
      // Extract the data from the buffers we have collected
      for (auto &perf_buffer : perf_buffer_context) {
        // Get the base header
        struct perf_event_header header = {};
        if (perf_buffer.size() < sizeof(header)) {
          continue;
        }
    
        std::memcpy(&header, perf_buffer.data(), sizeof(header));
        if (header.size > perf_buffer.size()) {
          continue;
        }
    
        if (header.type == PERF_RECORD_LOST) {
          std::cout << "One or more records have been lost\n";
    
        } else {
          // Determine the buffer boundaries
          auto buffer_ptr = perf_buffer.data() + sizeof(header);
          auto buffer_end = perf_buffer.data() + header.size;
    
          for (;;) {
            if (buffer_ptr + 4U >= buffer_end) {
              break;
            }
    
            // Note: this is data_size itself + bytes used for the data
            std::uint32_t data_size = {};
            std::memcpy(&data_size, buffer_ptr, 4U);
    
            buffer_ptr += 4U;
            data_size -= 4U;
    
            if (buffer_ptr + data_size >= buffer_end) {
              break;
            }
    
            auto program_data = PerfBuffer(data_size);
            std::memcpy(program_data.data(), buffer_ptr, data_size);
            data.push_back(std::move(program_data));
    
            buffer_ptr += 8U;
            data_size -= 8U;
          }
        }
    
        // Erase the chunk we consumed from the buffer
        perf_buffer.erase(perf_buffer.begin(), perf_buffer.begin() + header.size);
      }
    
      return true;
    }
    

    Writing the main function

    While it is entirely possible (and sometimes useful, in order to share types) to use a single LLVM module and context for both the enter and exit programs, we will create two different modules to avoid changing the previous sample code we’ve built.

    The program generation goes through the usual steps, but now we are loading two instead of one, so the previous code has been changed to reflect that.

    The new and interesting part is the main loop where the perf event output data is read and processed:

    // Incoming data is appended here
    PerfBufferList perf_buffer;
    
    std::uint64_t total_time_used{};
    std::uint64_t sample_count{};
    
    std::cout << "Tracing average time used to service the following syscall: "
              << kSyscallName << "\n";
    
    std::cout << "Collecting samples for 10 seconds...\n";
    
    auto start_time = std::chrono::system_clock::now();
    
    for (;;) {
      // Data that is ready for processing is moved inside here
      PerfBufferList data;
      if (!readPerfEventArray(data, perf_buffer, perf_event_array, 1)) {
        std::cerr << "Failed to read from the perf event array\n";
        return 1;
      }
    
      // Inspect the buffers we have received
      for (const auto &buffer : data) {
        if (buffer.size() != 8U) {
          std::cout << "Unexpected buffer size: " << buffer.size() << "\n";
          continue;
        }
    
        // Read each sample and update the counters; use memcpy to avoid
        // strict aliasing issues
        std::uint64_t time_used{};
        std::memcpy(&time_used, buffer.data(), 8U);
    
        total_time_used += time_used;
        ++sample_count;
    
        std::cout << time_used << "ns\n";
      }
    
      // Exit after 10 seconds
      auto elapsed_msecs = std::chrono::duration_cast(
                               std::chrono::system_clock::now() - start_time)
                               .count();
    
      if (elapsed_msecs > 10000) {
        break;
      }
    }
    
    // Print a summary of the data we have collected
    std::cout << "Total time used: " << total_time_used << " nsecs\n";
    std::cout << "Sample count: " << sample_count << "\n";
    std::cout << "Average: " << (total_time_used / sample_count) << " nsecs\n";
    

    The full source code can be found in the 03-syscall_profiler folder.

    Running the sample program as root should print something similar to the following output:

    Tracing average time used to service the following syscall: fchmodat
    Collecting samples for 10 seconds...
    178676ns
    72886ns
    80481ns
    147897ns
    171152ns
    80803ns
    69208ns
    75273ns
    76981ns
    Total time used: 953357 nsecs
    Sample count: 9
    Average: 105928 nsecs
    

    Writing a BPF program to do ANYTHING

    BPF is in active development and is becoming more and more useful with each update, enabling new use cases that extend the original vision. Recently, newly added BPF functionality allowed us to write a simple system-wide syscall fault injector using nothing but BPF and a compatible kernel that supported the required bpf_override_return functionality.

    If you want to keep up with how this technology evolves, one of the best places to start with is Brendan’s Gregg blog. The IO Visor Project repository also contains a ton of code and documentation that is extremely useful if you plan on writing your own BPF-powered tools.

    Want to integrate BPF into your products? We can help! Contact us today, and check out our ebpfpub library.

  • PrivacyRaven: Implementing a proof of concept for model inversion

    By Philip Wang, Intern
    Originally published August 3, 2021

    During my Trail of Bits winternship and springternship, I had the pleasure of working with Suha Hussain and Jim Miller on PrivacyRaven, a Python-based tool for testing deep-learning frameworks against a plethora of privacy attacks. I worked on improving PrivacyRaven’s versatility by adding compatibility for services such as Google Colab and expanding its privacy attack and assurance functionalities.

    What is PrivacyRaven?

    PrivacyRaven is a machine-learning assurance and research tool that simulates privacy attacks against trained machine-learning models. It supports model extraction and label-only membership inference attacks, with support for model inversion currently in development.

    In a model extraction attack, a user attempts to steal or extract a trained deep-learning model that outputs either a probability vector corresponding to each class or a simple classification. For instance, consider a classifier that detects particular emotions in human faces. This classifier will return either a vector specifying the likelihood of each emotion or simply the most likely emotion as its classification. Importantly, the user will have only black-box query access to the classifier and will not receive any other information about it.

    To perform a model extraction attack, the user first queries the model with random unlabeled data to identify all of the classifications returned by the model. This information can then be used to approximate the target classifier. Specifically, the attacker uses a public data source to obtain synthetic data (i.e., similar data such as a dataset of facial images and emotion classifications) and trains a substitute fixed-architecture model on that data.

    If successful, such an attack could have drastic consequences, especially for services that hide their actual models behind a paid API; for example, an attacker, for a low cost, could approximate a service’s model with a small decrease in accuracy, thereby gaining a massive economic advantage over the victim service. Since PrivacyRaven operates under the most restrictive threat model, simulating attacks in which the user has only query access, model extraction is also a critical component of PrivacyRaven’s other attacks (i.e., membership inference attacks).

    I worked on implementing support for a model inversion attack aimed at recovering private data used to train a target model. The model inversion attack uses model extraction to secure white-box access to a substitute classifier that faithfully approximates the target.

    What is model inversion?

    In a model inversion attack, a malicious user targets a classifier that predicts a vector of confidence values for each class; the user then attempts to recover its training data to compromise its privacy

    It turns out that a user with background knowledge about the model may actually be able to obtain a reasonable approximation of the target model’s training data. (For an example of such an approximation, see the image recovered from a facial recognition system in the figure below.) This is the core idea of several papers, including “Adversarial Neural Network Inversion via Auxiliary Knowledge Alignment,” by Ziqi Yang et al., which formed the basis of my implementation of model inversion.

    Model inversion attacks based on background knowledge alignment have plenty of use cases and consequences in the real world. Imagine that a malicious user targeted the aforementioned facial emotion classifier. The user could construct his or her own dataset by scraping relevant images from a search engine, run the images through the classifier to obtain their corresponding confidence values, and construct probability vectors from the values; the user could then train an inversion model capable of reconstructing approximations of the images from the given vectors.

    An inversion model is designed to be the inverse of the target model (hence the use of “inversion”). Instead of inputting images and receiving an emotion classification, the attacker can supply arbitrary emotion-prediction vectors and obtain reconstructed images from the training set.

    Using the above example, let’s walk through how the user would run a model inversion attack in greater detail. Assume that the user has access to an emotion classifier that outputs confidence values for some of its classes and knows that the classifier was trained on images of faces; the user therefore has background knowledge on the classifier.

    The user, via this background knowledge on the model, creates an auxiliary dataset by scraping images of faces from public sites and processing them. The user also chooses an inversion model architecture capable of upscaling the constructed prediction vectors to a “reconstructed” image.

    To train the inversion model, the user queries the classifier with each image in the auxiliary dataset to obtain the classifier’s confidence values for the images. That information is used to construct a prediction vector. Since this classifier might output only the top confidence values for the input, the inversion process assumes that it does in fact truncate the prediction vectors and that the rest of the entries are zeroed out. For example, if the classifier is trained on 5 emotions but outputs only the top 2, with confidence values of 0.5 and 0.3, the user will be able to construct the vector (0.5, 0, 0.3, 0, 0) from those values.

    The user can then input the prediction vector into the inversion model, which will upscale the vector to an image. As a training objective, the user would like to minimize the inversion model’s mean squared error (MSE) loss function, which is calculated pixelwise between images in the auxiliary set and their reconstructions outputted by the model; the user then repeats this training process for many epochs.

    An MSE close to 0 means that the reconstructed image is a sound approximation of the ground truth, and an MSE of 0 means that the reconstructed and ground truth images are identical. Once the model has been sufficiently trained, the user can feed prediction vectors into the trained inversion model to obtain a reconstructed data point representative of each class.

    Note that the model inversion architecture itself is similar to an autoencoder, as mentioned in the paper. Specifically, the classifier may be thought of as the encoder, and the inversion network, as the decoder, with the prediction vector belonging to the latent space. The key differences are that, in model inversion, the classifier is given and fixed, and the training data of the classifier is not available to train the inversion network.

    My other contributions to PrivacyRaven

    While I focused on implementing model inversion support in PrivacyRaven, in the first few weeks of my internship, I helped improve PrivacyRaven’s documentation and versatility. To better demonstrate certain of PrivacyRaven’s capabilities, I added detailed example Python scripts; these include scripts showcasing how to mount attacks and register custom callbacks to obtain more thorough information during attack runs. I also added Docker and Google Colab support, which allows users to containerize and ship attack setups as well as to coordinate and speed up an attack’s development.

    Caveats and room to grow

    PrivacyRaven is designed around usability and is intended to abstract away as much of the tedium of data and neural network engineering as possible. However, model inversion is a relatively fragile attack that depends on fine-tuning certain parameters, including the output dimensionality of both the classifier and inversion model as well as the inversion model’s architecture. Thus, striking a balance between usability and model inversion fidelity proved to be a challenge.

    Another difficulty stemmed from the numerous assumptions that need to be satisfied for a model inversion attack to produce satisfactory results. One assumption is that the user will be able to recover the number of classes that the target classifier was trained on by querying it with numerous data points.

    For example, if the user were trying to determine the number of classes that an object recognition classifier was trained on, the user could query the classifier with a large number of images of random objects and add the classes identified in that process to a set. However, this might not always work if the number of training classes is large; if the user isn’t able to recover all of the classes, the quality of the inversion will likely suffer, though the paper by Yang et al. doesn’t analyze the extent of the impact on inversion quality.

    In addition, Yang et al. were not explicit about the reasoning behind the design of their classifier and inversion model architectures. When conducting their experiments, the authors used the CelebA and MNIST datasets and resized the images in them. They also used two separate inversion architectures for the datasets, with the CelebA inversion architecture upscaling from a prediction vector of length 530 to a 64 x 64 image and the MNIST inversion architecture upscaling from a prediction vector of length 10 to a 32 x 32 image. As you can imagine, generalizing this attack such that it can be used against arbitrary classifiers is difficult, as the optimal inversion architecture changes for each classifier.

    Finally, the authors focused on model inversion in a white-box scenario, which isn’t directly adaptable to PrivacyRaven’s black-box-only threat model. As previously mentioned, PrivacyRaven assumes that the user has no knowledge about the classifier beyond its output; while the general model inversion process remains largely the same, a black-box scenario requires the user to make many more assumptions, particularly on the dimensions of the training data and the classifier’s output. Each additional assumption about dimensionality needs to be considered and addressed, and this inherent need for customization makes designing a one-size-fits-all API for model inversion very difficult.

    Next steps

    PrivacyRaven does not yet have a fully stable API for model inversion, but I have completed a proof-of-concept implementation of the paper. Some design decisions for the model inversion’s API still need to mature, but the plan is for the API to support both white-box and black-box model inversion attacks and to make the model inversion and extraction parameters as customizable as possible without sacrificing usability. I believe that, with this working proof of concept of model inversion, the development of the model inversion API should be a relatively smooth process.

    Inversion results

    The inversion results produced by the proof of concept are displayed below. This attack queries an MNIST-trained victim classifier with extended-MNIST (EMNIST) data to train a substitute model that can then be used to perform a white-box inversion attack. The model inversion-training process was run for 300 epochs with batch sizes of 100, producing a final MSE loss of 0.706. The inversion quality changed substantially depending on the orientation of the auxiliary set images. The images on the left are samples of the auxiliary set images taken from the MNIST set, with their labels in parentheses; their reconstructions are on the right.

    Images of a 0, for example, tended to have fairly accurate reconstructions:

    Other auxiliary set images had reconstructions that looked similar to them but contradicted their labels. For example, the below images appear to depict a 4 but in reality depict a 2 rotated by 90 degrees.

    Other images also had poor or ambiguous reconstructions, such as the following rotated images of an 8 and 9.

    Overall, these results demonstrate that model inversion is a fragile attack that doesn’t always produce high-quality reconstructions. However, one must also consider that the above inversion attack was conducted using only black-box queries to the classifier. Recall that in PrivacyRaven’s current model inversion pipeline, a model extraction attack is first executed to grant the user access to a white-box approximation of the target classifier. Since information is lost during both model extraction and inversion, the reconstruction quality of a black-box model inversion attack is likely to be significantly worse than that of its white-box counterpart. As such, the inversion model’s ability to produce faithful reconstructions for some images even under the most restrictive assumptions does raise significant privacy concerns for the training data of deep-learning classifiers.

    Takeaways

    I thoroughly enjoyed working on PrivacyRaven, and I appreciate the support and advice that Jim and Suha offered me to get me up to speed. I am also grateful for the opportunity to learn about the intersectionality of machine learning and security, particularly in privacy assurance, and to gain valuable experience with deep-learning frameworks including PyTorch. My experience working in machine-learning assurance kindled within me a newfound interest in the field, and I will definitely be delving further into deep learning and privacy in the future.

    Write Rust lints without forking Clippy

    By Samuel Moelius, Staff Engineer
    Originally published May 20, 2021

    This blog post introduces Dylint, a tool for loading Rust linting rules (or “lints”) from dynamic libraries. Dylint makes it easy for developers to maintain their own personal lint collections.

    Previously, the simplest way to write a new Rust lint was to fork Clippy, Rust’s de facto linting tool. But this approach has drawbacks in how one runs and maintains new lints. Dylint minimizes these distractions so that developers can focus on actually writing lints.

    First, we’ll go over the current state of Rust linting and the workings of Clippy. Then, we’ll explain how Dylint improves upon the status quo and offer some tips on how to begin using it. Skip to the last section, If you want to get straight to writing lints.

    Rust linting and Clippy

    Tools like Clippy take advantage of the Rust compiler’s dedicated support for linting. A Rust linter’s core component, called a “driver,” links against an appropriately named library, rustc_driver. By doing so, the driver essentially becomes a wrapper around the Rust compiler.

    To run the linter, the RUSTC_WORKSPACE_WRAPPER environment variable is set to point to the driver and runs cargo check. Cargo notices that the environment variable has been set and calls the driver instead of calling rustc. When the driver is called, it sets a callback in the Rust compiler’s Config struct. The callback registers some number of lints, which the Rust compiler then runs alongside its built-in lints.

    Clippy performs a few checks to ensure it is enabled but otherwise works in the above manner. (See Figure 1 for Clippy’s architecture.) Although it may not be immediately clear upon installation, Clippy is actually two binaries: a Cargo command and a rustc driver. You can verify this by typing the following:

    which cargo-clippy
    which clippy-driver
    

    Now suppose you want to write your own lints. What should you do? Well, you’ll need a driver to run them, and Clippy has a driver, so forking Clippy seems like a reasonable step to take. But there are drawbacks to this solution, namely in running and maintaining the lints that you’ll develop.

    First, your fork will have its own copies of the two binaries, and it’s a hassle to ensure that they can be found. You’ll have to make sure that at least the Cargo command is in your PATH, and you’ll probably have to rename the binaries so that they won’t interfere with Clippy. Clearly, these steps don’t pose insurmountable problems, but you would probably rather avoid them.

    Second, all lints (including Clippy’s) are built upon unstable compiler APIs. Lints compiled together must use the same version of those APIs. To understand why this is an issue, we’ll refer to clippy_utils—a collection of utilities that the Clippy authors have generously made public. Note that clippy_utils uses the same compiler APIs that lints do, and similarly provides no stability guarantees. (See below.)

    Suppose you have a fork of Clippy to which you want to add a new lint. Clearly, you’ll want your new lint to use the most recent version of clippy_utils. But suppose that version uses compiler version B, while your fork of Clippy uses compiler version A. Then you’ll be faced with a dilemma: Should you use an older version of clippy_utils (one that uses compiler version A), or should you upgrade all of the lints in your fork to use compiler version B? Neither is a desirable choice.

    Dylint addresses both of these problems. First, it provides a single Cargo command, saving you from having to manage multiple such commands. Second, for Dylint, lints are compiled together to produce dynamic libraries. So in the above situation, you could simply store your new lint in a new dynamic library that uses compiler version B. You could use this new library alongside your existing libraries for as long as you’d like and upgrade your existing libraries to the newer compiler version if you so choose.

    Dylint provides an additional benefit related to reusing intermediate compilation results. To understand it, we need to examine how Dylint works.

    How Dylint works

    Like Clippy, Dylint provides a Cargo command. The user specifies to that command the dynamic libraries from which the user wants to load lints. Dylint runs cargo check in a way that ensures the lints are registered before control is handed over to the Rust compiler.

    The lint-registration process is more complicated for Dylint than for Clippy, however. All of Clippy’s lints use the same compiler version, so only one driver is needed. But a Dylint user could choose to load lints from libraries that use different compiler versions.

    Dylint handles such situations by building new drivers on-the-fly as needed. In other words, if a user wants to load lints from a library that uses compiler version A and no driver can be found for compiler version A, Dylint will build a new one. Drivers are cached in the user’s home directory, so they are rebuilt only when necessary.

    This brings us to the additional benefit alluded to in the previous section. Dylint groups libraries by the compiler version they use. Libraries that use the same compiler version are loaded together, and their lints are run together. This allows intermediate compilation results (e.g., symbol resolution, type checking, trait solving, etc.) to be shared among the lints.

    For example, in Figure 2, if libraries U and V both used compiler version A, the libraries would be grouped together. The driver for compiler version A would be invoked only once. The driver would register the lints in libraries U and V before handing control over to the Rust compiler.

    To understand why this approach is beneficial, consider the following. Suppose that lints were stored directly in compiler drivers rather than dynamic libraries, and recall that a driver is essentially a wrapper around the Rust compiler. So if one had two lints in two compiler drivers that used the same compiler version, running those two drivers on the same code would amount to compiling that code twice. By storing lints in dynamic libraries and grouping them by compiler version, Dylint avoids these inefficiencies.

    An application: Project-specific lints

    Did you know that Clippy contains lints whose sole purpose is to lint Clippy’s code? It’s true. Clippy contains lints to check, for example, that every lint has an associated LintPass, that certain Clippy wrapper functions are used instead of the functions they wrap, and that every lint has a non-default description. It wouldn’t make sense to apply these lints to code other than Clippy’s. But there’s no rule that all lints must be general purpose, and Clippy takes advantage of this liberty.

    Dylint similarly includes lints whose primary purpose is to lint Dylint’s code. For example, while developing Dylint, we found ourselves writing code like the following:

    let rustup_toolchain = std::env::var("RUSTUP_TOOLCHAIN")?;
    ...
    std::env::remove_var("RUSTUP_TOOLCHAIN");

    This was bad practice. Why? Because it was only a matter of time until we fat-fingered the string literal:

    std::env::remove_var("RUSTUP_TOOLCHIAN"); // Oops

    A better approach is to use a constant instead of a string literal, as in the below code:

    const RUSTUP_TOOLCHAIN: &str = "RUSTUP_TOOLCHAIN";
    ...
    std::env::remove_var(RUSTUP_TOOLCHAIN);
    

    So while working on Dylint, we wrote a lint to check for this bad practice and to make an appropriate suggestion. We applied (and still apply) that lint to the Dylint source code. The lint is called env_literal and the core of its current implementation is as follows:

    impl<'tcx> LateLintPass<'tcx> for EnvLiteral {
       fn check_expr(&mut self, cx: &LateContext<'tcx>, expr: &Expr<'_>) {
          if_chain! {
             if let ExprKind::Call(callee, args) = expr.kind;
             if is_expr_path_def_path(cx, callee, &REMOVE_VAR)
                || is_expr_path_def_path(cx, callee, &SET_VAR)
                || is_expr_path_def_path(cx, callee, &VAR);
             if !args.is_empty();
             if let ExprKind::Lit(lit) = &args[0].kind;
             if let LitKind::Str(symbol, _) = lit.node;
             let ident = symbol.to_ident_string();
             if is_upper_snake_case(&ident);
             then {
                span_lint_and_help(
                   cx,
                   ENV_LITERAL,
                   args[0].span,
                   "referring to an environment variable with a string literal is error prone",
                   None,
                   &format!("define a constant `{}` and use that instead", ident),
                );
             }
          }
       }
    }

    Here is an example of a warning it could produce:

    warning: referring to an environment variable with a string literal is error prone
     --> src/main.rs:2:27
      |
    2 |     let _ = std::env::var("RUSTFLAGS");
      |                           ^^^^^^^^^^^
      |
      = note: `#[warn(env_literal)]` on by default
      = help: define a constant `RUSTFLAGS` and use that instead
    

    Recall that neither the compiler nor clippy_utils provide stability guarantees for its APIs, so future versions of env_literal may look slightly different. (In fact, a change to clippy_utils‘ APIs resulted in a change env_literal’s implementation while this article was being written!) The current version of env_literal can always be found in the examples directory of the Dylint repository.

    Clippy “lints itself” in a slightly different way than Dylint, however. Clippy’s internal lints are compiled into a version of Clippy with a particular feature enabled. But for Dylint, the env_literal lint is compiled into a dynamic library. Thus, env_literal is not a part of Dylint. It’s essentially input.

    Why is this important? Because you can write custom lints for your project and use Dylint to run them just as Dylint runs its own lints on itself. There’s nothing significant about the source of the lints that Dylint runs on the Dylint repository. Dylint would just as readily run your repository’s lints on your repository.

    The bottom line is this: If you find yourself writing code you do not like and you can detect that code with a lint, Dylint can help you weed out that code and prevent its reintroduction.

    Get to linting

    Install Dylint with the following command:

    cargo install cargo-dylint

    We also recommend installing the dylint-link tool to facilitate linking:

    cargo install dylint-link

    The easiest way to write a Dylint library is to fork the dylint-template repository. The repository produces a loadable library right out of the box. You can verify this as follows:

    git clone https://github.com/trailofbits/dylint-template
    cd dylint-template
    cargo build
    DYLINT_LIBRARY_PATH=$PWD/target/debug cargo dylint fill_me_in --list

    All you have to do is implement the LateLintPass trait and accommodate the symbols asking to be filled in.

    Helpful resources for writing lints include the following:

    Adding a new lint (targeted at Clippy but still useful)

    Also consider using the clippy_utils crate mentioned above. It includes functions for many low-level tasks, such as looking up symbols and printing diagnostic messages, and makes writing lints significantly easier.

    We owe a sincere thanks to the Clippy authors for making the clippy_utils crate available to the Rust community. We would also like to thank Philipp Krones for providing helpful comments on an earlier version of this post.

    Discovering goroutine leaks with Semgrep

    By Alex Useche, Security Engineer
    Originally published May 10, 2021

    While learning how to write multithreaded code in Java or C++ can make computer science students reconsider their career choices, calling a function asynchronously in Go is just a matter of prefixing a function call with the go keyword. However, writing concurrent Go code can also be risky, as vicious concurrency bugs can slowly sneak into your application. Before you know it, there could be thousands of hanging goroutines slowing down your application, ultimately causing it to crash. This blog post provides a Semgrep rule that can be used in a bug-hunting quest and includes a link to a repository of specialized Semgrep rules that we use in our audits. It also explains how to use one of those rules to find a particularly pesky type of bug in Go: goroutine leaks.

    The technique described in this post is inspired by GCatch, a tool that uses interprocedural analysis and the Z3 solver to detect misuse-of-channel bugs that may lead to hanging goroutines. The technique and development of the tool are particularly exciting because of the lack of research on concurrency bugs caused by the incorrect use of Go-specific structures such as channels.

    Although the process of setting up this sort of tool, running it, and using it in a practical context is inherently complex, it is worthwhile. When we closely analyzed confirmed bugs reported by GCatch, we noticed patterns in their origins. We were then able to use those patterns to discover alternative ways of identifying instances of these bugs. Semgrep, as we will see, is a good tool for this job, given its speed and the ability to easily tweak Semgrep rules.

    Goroutine leaks explained

    Perhaps the best-known concurrency bugs in Go are race conditions, which often result from improper memory aliasing when working with goroutines inside of loops. Goroutine leaks, on the other hand, are also common concurrency bugs but are seldom discussed. This is partially because the consequences of a goroutine leak only become apparent after several of them occur; the leaks begin to affect performance and reliability in a noticeable way.

    Goroutine leaks typically result from the incorrect use of channels to synchronize a message passed between goroutines. This problem often occurs when unbuffered channels are used for logic in cases when buffered channels should be used. This type of bug may cause goroutines to hang in memory and eventually exhaust a system’s resources, resulting in a system crash or a denial-of-service condition.

    Let’s look at a practical example:

    import (
      "fmt"
      "runtime"
      "time"
    )
    
      func main() {
      requestData(1)
      time.Sleep(time.Second * 1)
      fmt.Printf("Number of hanging goroutines: %d", runtime.NumGoroutine() - 1)
    }
    
    func requestData(timeout time.Duration) string {
     dataChan := make(chan string)
    
     go func() {
         newData := requestFromSlowServer()
         dataChan <- newData // block
     }()
     select {
     case result := <- dataChan:
         fmt.Printf("[+] request returned: %s", result)
         return result
     case <- time.After(timeout):
         fmt.Println("[!] request timeout!")
             return ""
     }
    }
    
    func requestFromSlowServer() string {
     time.Sleep(time.Second * 1)
     return "very important data"
    }
    

    In the above code, a channel write operation on line 21 blocks the anonymous goroutine that encloses it. The goroutine declared on line 19 will be blocked until a read operation occurs on dataChan. This is because read and write operations block goroutines when unbuffered channels are used, and every write operation must have a corresponding read operation.

    There are two scenarios that cause anonymous goroutine leaks:

    • If the second case, case <- time.After(timeout), occurs before the read operation on line 24, the requestData function will exit, and the anonymous goroutine inside of it will be leaked.
    • If both cases are triggered at the same time, the scheduler will randomly select one of the two cases. If the second case is selected, the anonymous goroutine will be leaked.

    When running the code, you’ll get the following output:

    [!] request timeout!
    Number of hanging goroutines: 1
    Program exited.

    The hanging goroutine is the anonymous goroutine on line 19.

    Using buffered channels would fix the above issue. While reading or writing to an unbuffered channel results in a goroutine block, executing a send (a write) to a buffered channel results in a block only when the channel buffer is full. Similarly, a receive operation will cause a block only when the channel buffer is empty.

    To prevent a goroutine leak, all we need to do is add a length to the channel on line 17, which gives us the following:

    func requestData(timeout time.Duration) string {
     dataChan := make(chan string, 1)
    
     go func() {
         newData := requestFromSlowServer()
         dataChan <- newData // block
     }()
    

    After running the updated program, we can confirm that there are no more hanging goroutines.

    [!] request timeout!
    Number of hanging goroutines: 0
    Program exited.
    

    This bug may seem minor, but in certain situations, it could lead to a goroutine leak. For an example of a goroutine leak, see this PR in the Kubernetes repository. While running 1,496 goroutines, the author of the patch experienced an API server crash resulting from a goroutine leak.

    Finding the bug

    The process of debugging concurrency issues is so complex that a tool like Semgrep may seem ill-equipped for it. However, when we closely examined common Go concurrency bugs found in the wild, we identified patterns that we could easily leverage to create Semgrep rules. Those rules enabled us to find even complex bugs of this kind, largely because Go concurrency bugs can often be described by a few sets of simple patterns.

    Before using Semgrep, it is important to recognize the limitations on the types of issues that it can solve. When searching for concurrency bugs, the most significant limitation is Semgrep’s inability to conduct interprocedural analysis. This means that we’ll need to target bugs that are contained within individual functions. This is a manageable problem when working in Go and won’t prevent us from using Semgrep, since Go programmers often rely on anonymous goroutines defined within individual functions.

    Now we can begin to construct our Semgrep rule, basing it on the following typical manifestation of a goroutine leak:

    1. An unbuffered channel, C, of type T is declared.
    2. A write/send operation to channel C is executed in an anonymous goroutine, G.
    3. C is read/received in a select block (or another location outside of G).
    4. The program follows an execution path in which the read operation of C does not occur before the enclosing function is terminated.

    It is the last step that generally causes a goroutine leak.

    Bugs that result from the above conditions tend to cause patterns in the code, which we can detect using Semgrep. Regardless of the forms that these patterns take, there will be an unbuffered channel declared in the program, which we’ll want to analyze:

    - pattern-inside: |
           $CHANNEL := make(...)
           ...
    

    We’ll also need to exclude instances in which the channel is declared as a buffered channel:

    - pattern-not-inside: |
           $CHANNEL := make(..., $T)
           ...
    

    To detect the goroutine leak from our example, we can use the following pattern:

    - pattern: |
             go func(...){
               ...
               $CHANNEL <- $X
               ...
             }(...)
             ...
             select {
             case ...
             case $Y := <- $CHANNEL:
             ...
             }
    

    This code tells Semgrep to look for a send operation to the unbuffered channel, $CHANNEL, executed inside an anonymous goroutine, as well as a subsequent receive operation inside of a select block. Slight variations in this pattern may occur, so we will need to account for those in our Semgrep rule.

    For instance, the receive expression could use the assignment (=) operator rather than the declaration (:=) operator, which would require a new pattern. We won't go over every possible variation here, but you can skip ahead and view the completed rule if you’d like. The finished rule also includes cases in which there could be more send operations than receive operations for an unbuffered channel, $CHANNEL.

    We should also exclude instances of false positives, such as that from the Moby repository, in which a read operation on the blocking channel prevents the channel from causing a block before exiting the function.

    - pattern-not-inside: |
           ...
           select {
           case ...
           case ...:
             ...
             ... =<- $CHANNEL
             ...
           }
       - pattern-not-inside: |
           ...
           select {
           case ...
           case ...:
             ...
             <-$CHANNEL
             ...
           }
    

    Once we have completed our pattern, we can run it against the code. (Try it out using the Semgrep playground.) Running the pattern from the command line returns the following output:

    $ semgrep --config ./hanging-goroutine.yml
    
    running 1 rules...
    test.go
    severity:warning rule:semgrep.go.hanging-goroutine: Potential goroutine leak due to unbuffered channel send inside loop or unbuffered channel read in select block.
    
    18: go func() {
    19:     newData := requestFromSlowServer()
    20:     dataChan <- newData // block
    21: }()
    22: select {
    23: case result := <-dataChan:
    24:     fmt.Printf("[+] request returned: %s", result)
    25:     return result
    26: case <-time.After(timeout):
    27:     fmt.Println("[!] request timeout!")
    28:     return ""
    

    We ran this pattern against an unpatched release of the Docker codebase and compared the matches with those reported by GCatch and documented on its repository. Our Semgrep rule missed only 5 out of all the goroutine leak bugs found by GCatch that were reported to the Docker team via PRs. We also used this rule to find bugs on the Kubernetes and Minikube repositories, amazon-ecs-agent (Amazon Elastic Container Service Agent), as well as in two open-source Microsoft projects (azure-container-networking and hcsshim), and submitted the patches as PRs.

    GCatch uses a technique that is smarter and more sophisticated than the one above. As a result, it can analyze multiple execution paths to find more complex instances of this bug. However, there are advantages to using Semgrep instead of a complex tool:

    Semgrep can analyze code more quickly, because it focuses on discovering pattern matches rather than conducting taint and data flow analysis.
    Semgrep rules are very easy to understand and update.
    The setup is more straightforward and reliable.

    Of course, the drawback is that we miss complex issues, such as cases in which the send operation occurs inside a separate named function (as opposed to an anonymous function). However, Semgrep has experimental support for data flow analysis, taint tracking, and basic cross-function analysis, and we look forward to testing and developing more complicated rules as support for those features continues to mature.

    Finding other types of concurrency bugs

    Our new semgrep-rules repository contains the Semgrep rule for the above bug, as well as other rules we developed and use in our code audits to find Go concurrency bugs. These include Semgrep rules used to catch race conditions in anonymous goroutines and unsafe uses of sleep functions for synchronization.

    Like the goroutine leak we examined, these types of bugs are manifested in repeatable patterns, making them detectable by Semgrep. Keep an eye on the repository, as we will be adding and refining Semgrep rules for Go and other languages. We will also continue researching ways to debug concurrency bugs in Go and will look forward to sharing more findings in the future.

    Solar: Context-free, interactive analysis for Solidity

    We’re hiring for our Research + Engineering team! 

    By Aaron Yoo, University of California, Los Angeles

    As an intern at Trail of Bits, I worked on Solar, a proof-of-concept static analysis framework. Solar is unique because it enables context-free interactive analysis of Solidity smart contracts. A user can direct Solar to explore program paths (e.g., to expand function calls or follow if statements) and assign constraints or values to variables, all without reference to a concrete execution. As a complement to Solar, I created an integrated development environment (IDE)-like tool that illustrates the types of interactivity and analysis that Solar can provide.

    Solar user interface.

    The Solar UI has two main panels, “source” and “IR” (intermediate representation). The source panel displays the source code of the functions being analyzed and the IR panel translates those functions into an enhanced variant of SlithIR, our open-source IR. The green highlights show the lines of the function that are being analyzed. The two panes display similar information intended for different uses: The source pane serves as a map, aiding in navigation and providing context. The IR pane enables interactivity as well as visualization of the information deduced by Solar.

    Analyses built on top of Solar are called “strategies.” Strategies are modular, and each defines a different mode of operation for Solar. Three strategies are shown in the above screenshot: concrete, symbolic, and constraint. Although each analysis has its own workflow, Solar needs all three to support the simplify, downcall, and upcall primitives, which are explained below.

    Solar primitives

    Simplify

    The simplify primitive instructs the framework to make as many inferences as possible based on the current information. Solar can receive new information either through user input or by deriving it as an axiom from the program. In the screencast below, Solar axiomatically deduces the value of the constant, 2. If the user tells Solar to assume that the value of x is 5 and then clicks the “simplify” button, Solar will deduce the return value.

    Downcall

    The downcall primitive instructs the framework to inline a function call. Function calls are inlined so that the user can see the entire path in one pane. In a traditional IDE, the user would have to navigate to the function in question. In our framework, the downcalled function is inlined directly into the IR pane, and its source is brought into the source pane.

    Upcall

    The upcall primitive inlines the current context into a calling function. In other words, upcalling implies traversal up the call stack and is generally the opposite of downcalling. By upcalling, Solar can traverse program paths up through function calls, as demonstrated below. (Pay special attention to the green-highlighted lines.)

    Together, these three primitives give Solar its defining properties: context insensitivity and interactivity. Solar is context-free (context-insensitive) because the user can start analysis from any function. It is interactive because the exact program path is determined by the upcalls and downcalls—which are chosen by the user.

    Demonstration for reasoning about integer overflow

    Solar can help a user reason about nontrivial program properties such as integer overflow. Consider the following program:

    contract Overflow {
        function z(uint32 x, uint32 y) public returns (uint32) {
            uint32 ret = 0;
            if (x < 1000 && y < 1000) {
                will_it_overflow(x, y);
            }
            return ret;
        }
    
        function will_it_overflow(uint32 x, uint32 y) public returns (uint32) {
            return x * y;
        }
    }

    Here, we want to find out whether will_it_overflow will ever cause an integer overflow. Integer overflow occurs when the mathematical result of an arithmetic operation cannot fit into the physical space allocated for its storage.

    Looking at the will_it_overflow function, it’s clear that integer overflow may be possible, as two 32-bit numbers are multiplied and placed into a 32-bit result. However, based on the call sites of will_it_overflow, if z calls will_it_overflow, there can never be an integer overflow; this is because z verifies that arguments to will_it_overflow are small. Let’s see how Solar would reach this conclusion.

    Performing this analysis with Solar requires use of the constraint strategy, which works by attempting to find a single valid execution of the program. The user can constrain the execution with arbitrary polynomial constraints. To start the analysis, we select the will_it_overflow function from the left pane to indicate it as the desired starting point. Here is the initial analysis view:

    Solar provides one possible execution that evaluates all values to zero. The next step is constraining the values of x and y. We can provide the following constraints (in terms of IR variables, not source variables) to the constraint strategy:

    1.   x_1 < (2 ** 32)
    2.   y_1 < (2 ** 32)
    3.   x_1 * y_1 >= (2 ** 32)

    The first two constraints bind x_1 and y_1 to 32-bit integers. The third causes the solver to try to find an execution in which x_1 * y_1 overflows. It is common practice to prove properties using an SAT/SMT solver by showing that the negation of the property is unsatisfiable. With that in mind, we are going to use Solar to show that there are no executions in which x_1 * y_1 overflows, thereby implying that will_it_overflow does not overflow. After simplifying the use of these constraints, we get the following:

    Again, Solar provides a single possible execution based on the constraints. Upcalling once puts us at line 5:

    Because the green-highlighted lines do not form a logical contradiction, Solar can still return a valid program execution. However, upcalling again returns a different result:

    The question marks on the right indicate that the solver cannot find any possible execution given the program path. This is because line 4 contradicts the inequality we wrote earlier: x_1 * y_1 >= (2 ** 32). The above steps constitute informal proof that overflow is not possible if will_it_overflow is called from z.

    Conclusion

    I am proud of what Solar became. Although it is a prototype, Solar represents a novel type of analysis platform that prioritizes interactivity. Given the potential applications for code auditing and IDE-style semantic checking, I am excited to see what the future holds for Solar and its core ideas. I would like to give a big thank-you to my mentor, Peter Goodman, for making this internship fun and fulfilling. Peter accomplished perhaps the most challenging task for a mentor: striking the delicate balance between providing me guidance and freedom of thought. I would also like to extend thanks to Trail of Bits for hosting the internship. I look forward to seeing the exciting projects that future interns create!

    A Year in the Life of a Compiler Fuzzing Campaign

    By Alex Groce, Northern Arizona University

    In the summer of 2020, we described our work fuzzing the Solidity compiler, solc. So now we’d like to revisit this project, since fuzzing campaigns tend to “saturate,” finding fewer new results over time. Did Solidity fuzzing run out of gas? Is fuzzing a high-stakes project worthwhile, especially if it has its own active and effective fuzzing effort?

    The first bugs from that fuzzing campaign were submitted in February of 2020 using an afl variant. Since then, we’ve submitted 74 reports. Sixty-seven have been confirmed as bugs, and 66 of those have been fixed. Seven were duplicates or not considered to be true bugs.


    Given that 21 of these bugs were submitted since last December, it’s fair to say that our fuzzing campaign makes a strong case for why independent fuzzing is still important and can find different bugs, even when OSSFuzz-based testing is also involved.

    Why is it useful to keep fuzzing such a project perhaps indefinitely? The answer has three parts. First, more fuzzing covers more code execution paths and long fuzzing runs are especially helpful. It would be hard to get anywhere near path-coverage saturation with any fuzzer we know of. Even when running afl for 30 or more days, our tests still find new paths around every 1-2 hours and sometimes find new edges. Some of our reported bugs were discovered only after a month of fuzzing.

    A production compiler is fantastically complex, so fuzzing in-depth is useful. It takes time to generate inputs such as the most recent bug we found fuzzing the Solidity level of solc:

    pragma experimental SMTChecker;
    contract A {
        function f() internal virtual {
            v();
        }
        function v() internal virtual {
        }
    }
    
    contract B is A {
        function f() internal virtual override {
            super.f();
        }
    }
    
    contract C is B {
        function v() internal override {
            if (0==1)
                f();
        }
    }
    
    

    This code should compile without error, but it didn’t. Until the bug was fixed, it caused the SMT Checker to crash, throwing a fatal “Super contract not available” error, due to incorrect contract context used for variable access in virtual calls inside branches.

    Compilers should undergo fuzz testing for long stretches of time because of their complexity and number of possible execution paths. A rule of thumb is that afl hasn’t really started in earnest on any non-trivial target until it hits one million executions, and compilers likely require much more than this. In our experience, compiler runs vary anywhere from less than one execution per second to as many as 40 executions per second. Just getting to one million executions can take a few days!

    The second reason we want to fuzz independently alongside OSSFuzz is to approach the target from a different angle. Instead of strictly using a dictionary- or grammar-based approach with traditional fuzzer mutation operators, we used ideas from any-language mutation testing to add “code-specific” mutation operators to afl, and rely mostly (but not exclusively) on those, as opposed to afl’s even more generic mutations, which tend to be focused on binary format data. Doing something different is likely going to be a good solution to fuzzer saturation.

    Finally, we keep grabbing the latest code and start fuzzing on new versions of solc. Since the OSSFuzz continuous integration doesn’t include our techniques, bugs that are hard for other fuzzers but easy for our code-mutation approach will sometimes appear, and our fuzzer will find them almost immediately.

    But we don’t grab every new release and start over, because we don’t want to lose the ground that was gained with our long fuzzing campaigns. We also don’t continually take up where we left off since the tens of thousands of test corpora that afl can generate are likely full of uninteresting paths that might make finding bugs in the new code easier. We sometimes resume from an existing run, but only infrequently.

    Finding bugs in a heavily-fuzzed program like solc is not easy. The next best independent fuzzing effort to ours, that of Charalambos Mitropoulos, also mentioned by the solc team in their post of the OSSFuzz fuzzing, has only discovered 8 bugs, even though it’s been ongoing since October 2019.

    Other Languages, Other Compilers

    Our success with solc inspired us to fuzz other compilers. First, we tried fuzzing the Vyper compiler—a language intended to provide a safer, Python-like, alternative to Solidity for writing Ethereum blockchain smart contracts. Our previous Vyper fuzzing uncovered some interesting bugs using essentially a grammar-based approach with the TSTL (Template Scripting Testing Language) Python library via python-afl. We found a few bugs during this campaign, but chose not to go to extremes, because of the poor speed and throughput of the instrumented Python testing.

    In contrast, my collaborator, Rijnard van Tonder at Sourcegraph, had much greater success fuzzing the Diem project’s Move language—the language for the blockchain formerly known as Facebook’s Libra. Here, the compiler is fast and the instrumentation is cheap. Rijnard has reported 14 bugs in the compiler, so far, all of which have been confirmed and assigned, and 11 of which have been fixed. Given that the fuzzing began just two months ago, this is an impressive bug haul!

    Using Rijnard’s notes on fuzzing Rust code using afl.rs, I tried our tools on Fe, a new smart contract language supported by the Ethereum foundation. Fe is, in a sense, a successor to Vyper, but with more inspiration from Rust and a much faster compiler. I began fuzzing Fe on the date of its first alpha release and submitted my first issue nine days later.

    To support my fuzzing campaign, the Fe team changed failures in the Yul backend, which uses solc to compile Yul, to produce Rust panics visible to afl, and we were off to the races. So far, this effort has produced 31 issues, slightly over 18% of all GitHub issues for Fe, including feature requests. Of these, 14 have been confirmed as bugs, and ten of those have been fixed; the remaining bugs are still under review.

    We didn’t just fuzz smart contract languages. Rijnard fuzzed the Zig compiler—a new systems programming language that aims at simplicity and transparency and found two bugs (confirmed, but not fixed).

    The Future of Our Fuzzing Campaigns

    We uncovered 88 bugs that were fixed during our afl compiler fuzzing campaign, plus an additional 14 confirmed, but not yet fixed bugs.

    What’s interesting is that the fuzzers aren’t using dictionaries or grammars. They know nothing about any of these languages beyond what is expressed by a modest corpus of example programs from the test cases. So how can we fuzz compilers this effectively?

    The fuzzers operate at a regular-expression level. They don’t even use context-free language information. Most of the fuzzing has used fast C string-based heuristics to make “code-like” changes, such as removing code between brackets, changing arithmetic or logical operators, or just swapping lines of code, as well as changing if statements to while and removing function arguments. In other words, they apply the kind of changes a mutation testing tool would. This approach works well even though Vyper and Fe aren’t very C-like and only Python’s whitespace, comma, and parentheses usage are represented.

    Custom dictionaries and language-aware mutation rules may be more effective, but the goal is to provide compiler projects with effective fuzzing without requiring many resources. We also want to see the impact that a good fuzzing strategy can have on a project during the early stages of development, as with the Fe language. Some of the bugs we’ve reported highlighted tricky corner cases for developers much earlier than might have otherwise been the case. We hope that discussions such as this one will help produce a more robust language and compiler with fewer hacks made to accommodate design flaws detected too late to easily change.

    We plan to keep fuzzing most of these compilers since the solc effort has shown that a fuzzing campaign can remain viable for a long time, even if there are other fuzzing efforts targeting the same compiler.

    Compilers are complex and most are also rapidly changing. For example, Fe is a brand-new language that isn’t really fully designed, and Solidity is well-known for making dramatic changes to both user-facing syntax and compiler internals.

    We’re also talking to Bhargava Shastry, who leads the internal fuzzing effort of Solidity, and applying some of the semantic checks they apply in their protobuf-fuzzing of the Yul optimization level ourselves. We started directly fuzzing Yul via solc’s strict-assembly option, and we already found one amusing bug that was quickly fixed and incited quite a bit of discussion! We hope that the ability to find more than just inputs that crash solc will take this fuzzing to the next level.

    The issue at large is whether fuzzing is limited to the bugs it can find due to the inability of a compiler to detect many wrong-code errors. Differential comparison of two compilers, or of a compiler with its own output when optimizations are turned off, usually requires a much more restricted form of the program, which limits the bugs you find, since programs must be compiled and executed to compare results.

    One way to get around this problem is to make the compiler crash more often. We imagine a world where compilers include something like a testing option that enables aggressive and expensive checks that wouldn’t be practical in normal runs, such as sanity-checks on register allocation. Although these checks would likely be too expensive for normal runs, they could be turned on for both some fuzzing runs, since the programs compiled are usually small, and, perhaps even more importantly, in final production compilation for extremely critical code (Mars Rover code, nuclear-reactor control code — or high-value smart contracts) to make sure no wrong-code bugs creep into such systems.

    Finally, we want to educate compiler developers and developers of other tools that take source code as input, that effective fuzzing doesn’t have to be a high-cost effort requiring significant developer time. Finding crashing inputs for a compiler is often easy, using nothing more than some spare CPU cycles, a decent set of source code examples in the language, and the afl-compiler-fuzzer tool!

    We hope you enjoyed learning about our long-term compiler fuzzing project, and we’ve love to hear about your own fuzzing experiences on Twitter @trailofbits.

    Un-bee-lievable Performance: Fast Coverage-guided Fuzzing with Honeybee and Intel Processor Trace

    By Allison Husain, UC Berkeley

    Today, we are releasing an experimental coverage-guided fuzzer called Honeybee that records program control flow using Intel Processor Trace (IPT) technology. Previously, IPT has been scrutinized for severe underperformance due to issues with capture systems and inefficient trace analyses. My winter internship focused on working through these challenges to make IPT-based fuzzing practical and efficient.

    IPT is a hardware feature that asynchronously records program control flow, costing a mere 8-15% overhead at record time. However, applying IPT as a fuzzing coverage mechanism isn’t practical except for highly experimental binary-only coverage solutions, since source code instrumentation typically provides far better performance. Honeybee addresses this limitation and makes IPT significantly faster to capture and hundreds of times faster to analyze. So now we have coverage-guided fuzzing—even if source code is unavailable—at performances competitive with, and sometimes faster than, source-level coverage instrumentation. Here, I will describe the development process behind Honeybee and a general overview of its design.

    How it started…

    IPT is an Intel-specific processor feature that can be used to record the full control flow history of any process for minutes at a time with a minimal performance penalty. IPT drastically improves on Intel’s older hardware tracing systems such as Branch Trace Store, which can have a performance penalty exceeding 100%. Even better, IPT supports the granular selection of the trace target by a specific process or range of virtual addresses.

    A hardware mechanism like IPT is especially alluring for security researchers because it can provide code coverage information for closed-source, unmodified target binaries. A less appreciated fact is that not only can IPT be much faster than any existing black-box approaches (like QEMU, interrupts, or binary lifting), IPT should also be faster than inserting source-level instrumentation.

    Coverage instrumentation inhibits run-time performance by thrashing various CPU caches via frequent, random writes into large coverage buffers. Source-level instrumentation also inhibits many important compile-time optimizations, like automatic vectorization. Further, due to the multi-threaded nature of most programs, instrumentation code needs to operate on bitmaps atomically, which significantly limits pipeline throughput.

    IPT should work on both closed source and open source software with no change in coverage generation strategy and incur only a minimal 8-15% performance penalty. Jaw-dropping, I tell you!

    How it’s going…

    Unfortunately, the 8-15% performance penalty doesn’t tell the whole story. While IPT has low capture overhead, it does not have a low analysis overhead. To capture long traces on commodity hardware, IPT uses various techniques to minimize the amount of data stored per trace. One technique is to only record control flow information not readily available from static analysis of the underlying binary, such as taken/not-taken branches and indirect branch targets. While this optimization assumes the IPT decoder has access to the underlying program binary, this assumption is often correct. (See Figure 1 for example.)

    IPT is a very dense binary format. To showcase what information is stored, I’ve converted it to a more readable format in Figure 1. The packet type is in the left column and the packet payload is on the right.

    Example IPT trace showing how IPT minimizes data stored during program execution.

      1. Tracing starts while the program executes at 0x7ffff7f427ef.
      2. The program hits a conditional branch and accepts it. (The first ! in line 2.)
      3. The program hits two conditional branches and does not accept them. (The . . in line 2.)
      4. The program hits a conditional branch and does not accept it. (The last ! in line 2.)
      5. The program hits a conditional branch and accepts it. (Line 4.)
      6. The program hits an indirect branch, at which point it jumps to the last instruction pointer with the lower two bytes replaced with 0x3301.
      7. The program hits a conditional branch and accepts it.
      8. The program continued with no other conditional/indirect branches until the last four bytes of the instruction pointer were 0xf7c189a0 at which point tracing stopped because the program either exited or another piece of code that did not match the filters began executing.

    Despite all the trace data provided, there is still a surprising amount of information that is omitted. The trace provides its beginning virtual address, eventual conditional branches, and whether they are taken. However, unconditional and unquestionable control flow transfers (i.e. call and jmp instructions) and conditional branch destinations are not provided. This reduces the trace size, because 1) non-branching code is never recorded, 2) conditional branches are represented as a single bit, and 3) indirect branches are only represented by changes to the instruction pointer.

    So how is the real control flow reconstructed from this partial data? An IPT decoder can pair trace data with the underlying binary and “walk” through the binary from the trace start address. When the decoder encounters a control flow transfer that can’t be trivially determined, like a conditional branch, it consults the trace. Data in the trace indicates which branches were taken/not taken and the result of indirect control flow transfers. By walking the binary and trace until the trace ends, a decoder can reconstruct the full flow.

    But herein lies the gotcha of IPT: although capture is fast, walking through the code is ostensibly not, because the decoder must disassemble and analyze multiple x86-64 instructions at every decoding step. While overhead for disassembly and analysis isn’t a problem for debugging and profiling scenarios, it severely hampers fuzzing throughput. Unfortunately, such expensive analysis is fundamentally unavoidable as traces cannot be decoded without analyzing the original program.

    But…is this the end? Was it the beautiful promise of IPT fuzzing just…a dream? A mere illusion? Say it ain’t so!

    Making IPT faster!

    While profiling Intel’s reference IPT decoder, libipt, I noticed that over 85% of the CPU time was spent decoding instructions during a trace analysis. This is not surprising given that IPT data must be decoded by walking through a binary looking for control flow transfers. An enormous amount of time spent during instruction decoding, however, is actually good news.

    Why? A fuzzer needs to decode a multitude of traces against the same binary. It may be reasonable to continuously analyze instructions for a single trace of one binary, but re-analyzing the same instructions millions of times for the same binary is extraordinarily wasteful. If your “hey we should probably use a cache” sense is tingling, you’re totally right! Of course, the importance of instruction decode caches is not a novel realization.

    An open-source decoder that claims to be the fastest IPT decoder (more on that later) named libxdc tries to solve this issue using a fast runtime cache. Using a runtime cache and other performance programming techniques, libxdc operates 10 to 40 times faster than Intel’s reference decoder, which demonstrates that caching is very important.

    I thought I could do better though. My critique of libxdc was that its dynamic instruction cache introduced unnecessary and expensive overhead in two ways. First, a dynamic cache typically has expensive lookups, because it needs to calculate the target’s hash and ensure that the cache is actually a hit. This introduces more overhead and complexity to one of the hottest parts of the entire algorithm and cannot be overlooked.

    Second, and frankly much worse, is that a dynamic cache is typically expected to fill and evict older results. Even the best cache eviction strategy will cause future work: Any evicted instructions will eventually need to be re-decoded because a fuzzer never targets code just once. This creates duplicated effort and decoder performance penalties with every single cache eviction

    My idea was to introduce a static, ahead-of-time generated cache that holds all data IPT decoding that could conceivably require. The cache would be shared between multiple threads without penalty and could be accessed without expensive hashing or locking. By entirely eliminating binary analysis and cache access overhead, I could decode traces significantly faster than libxdc, because my decoder would simply be doing less work.

    The Honeybee architecture.

    Keeping with the theme of Honeybee, I named these static, ahead-of-time generated caches “hives” since they require work to create but only need to be made once. To make the hives, I created hive_generator, which consumes an ELF executable and captures information that may be needed to decode a trace. The hive_generator searches for all control flow instructions and generates an extremely compact encoding of all basic blocks and where code execution could continue. There are two important features of this new design worth discussing. (Full details are available on Github.)

    First, this encoding is data cache-friendly, because not only are blocks the size of cache lines, encoded blocks are stored in the same order as the original binary, which is a small important detail. It means that Honeybee’s decoder can take full advantage of the original binary’s cache locality optimization since compilers generally put relevant basic blocks close to each other. This is not generally possible in dynamic caches, like in libxdc, since the cache’s hash function by design will send neighboring blocks to random locations. This is harmful to performance because it evicts meaningful data from the CPU’s data cache.

    The other important feature is that blocks are encoded in a bitwise-friendly format, so Honeybee can process the compacted blocks using exclusively high-throughput ALU operations. This design makes several critical operations completely branchless — like determining whether a block ends in a direct, indirect, or conditional branch. Combining this with high-throughput ALU operations avoids many costly branch mispredictions and pipeline purges.

    These changes seemed relatively trivial, but I hoped that they would combine to a respectable performance boost over the current state of the art, libxdc.

    Honeybee decoder benchmarks

    To compare the performance of Honeybee’s decoder with Intel’s reference decoder, I ran traces ranging from tens of kilobytes up to 1.25 GB among binaries of sizes of 100 kb to 45 MB. Tests were performed 25 times, and I verified that both decodes traveled identical control flow paths for the same trace files.

    Comparing Honeybee’s decoding speed to libipt.

    These tests showed promising results (Figure 3). On large programs like clang, Honeybee outperformed Intel’s reference decode and libxdc by an order of magnitude (and two orders of magnitude in one case).

    For context, the largest trace in this test suite, “honey_mirror_1/clang_huge.pt,” is 1.25GB and originates from a trace of a complicated static analysis program that disassembled the entire 35MB clang binary.

    Honeybee takes only 3.5 seconds to do what Intel’s reference decoder does in two-and-a-half minutes, which is a 44x improvement! This is the difference between stepping away while the trace decodes and being able to take a sip of water while you wait.

    This difference is even more pronounced on small traces, which are more similar to fuzzing loads like “html_fast_parse/6_txt.pt.” In this case, Honeybee needed only 6.6 microseconds to finish what Intel’s reference coder took 451 microseconds to do. An order of magnitude improvement!

    Integrating Honeybee with honggfuzz.

    Now to actually integrate this new coverage mechanism into a fuzzer. I chose Google’s honggfuzz since it’s modular and, notably, because it actually already has another slow and partially broken version of IPT-based coverage that uses Intel’s reference decoder. My plan was to simply rip out Intel’s decoder, bolt Honeybee in place, and get a wonderful speedup. However, this was more complicated than I expected.

    The challenge is how Linux typically collects IPT data, which is meant to be fairly simple since the mainline kernel actually has support for IPT built right into perf. But I discovered that the complex and aggressive filtering mechanisms that Honeybee needs to clean up IPT data expose stability and performance issues in perf.

    This was problematic. Not only was perf not terribly fast to begin with, but it was highly unstable. Complex configurations used by Honeybee triggered serious bugs in perf which could cause CPUs to be misconfigured and require a full system reboot to recover from lockup. Understandably, both of these issues ultimately made perf unusable for capturing IPT data for any Honeybee-related fuzzing tasks.

    But, as my mother says, if you want something done right, sometimes you just need to do it yourself. Following in her footsteps, I wrote a small kernel module for IPT named “honey_driver” that was specifically optimized for fuzzing. While this new kernel module is certainly less featureful than perf and a likely security hazard, honey_driver is extremely fast and enables a user-space client to rapidly reconfigure tracing and analyze the results with little overhead.

    And so, with this small constellation of custom code, honggfuzz was ready to roll with IPT data from Honeybee!

    Fuzzer benchmarks

    Fuzzer performance measurement is complex and so there are many more reliable and definitive means to measure performance. As a rough benchmark, I persistently fuzzed a small HTML parser using four different coverage strategies. Then, I allowed honggfuzz to fuzz the binary using the chosen coverage technique before recording the average number of executions over the test period.

    Comparing Honeybee to source-level instrumentation.

    The first contender in the experiment was no coverage whatsoever. I considered this to be the baseline since it’s essentially as fast as honggfuzz can run on this binary by feeding random input into the test program. In this configuration, honggfuzz achieved an average of 239K executions per second. In the context of my system, this is decently fast but is still certainly limited by the fuzzing target’s CPU performance.

    Next, I tested honggfuzz’s source-level software instrumentation by compiling my target using the instrumenting compiler with no other features enabled. This led to an average of 98K executions per second or a 41% drop in efficiency compared to the no coverage baseline, which is a generally accepted and expected penalty when fuzzing due to missed compiler optimizations, many function calls, expensive locking, and cache thrashing due to essentially random writes into coverage bitmaps.

    After software instrumentation, we get into the more interesting coverage techniques. As mentioned earlier, honggfuzz has support for processor trace using libipt for analysis and unfiltered perf data for IPT capture. However, honggfuzz’s existing IPT support does not generate full or even possibly correct coverage information, because honggfuzz only extracts indirect branch IPs from the trace and completely ignores any conditional branches. Additionally, since no filtering is employed using perf, honggfuzz generates coverage for every piece of code in the process, including uninteresting libraries like libc. And this leads to bitmap pollution.

    Even with these shortcuts, honggfuzz’s existing IPT can only achieve an average of 1.1K executions per second (roughly half of a percent of the theoretical max). Due to the inaccurate coverage data, a true comparison cannot be made to software instrumentation, because it is possible that it found a more difficult path sooner. Realistically, however, the gap is so enormous that such an issue is unlikely to account for most of the overhead given the previously established performance issues of both perf and libipt.

    Lastly, we have Honeybee with its custom decoder and capture system. Unlike the existing honggfuzz implementation, Honeybee decodes the entire trace and so it is able to generate a full, correct basic block and edge coverage information. Honeybee achieved an average of 171K executions per second, which is only a 28% performance dip compared to the baseline.

    This shouldn’t come as a shock, since IPT only has an 8-15% record time overhead. This leaves 14-21% of the baseline’s total execution time to process IPT data and generate coverage. Given the incredible performance of Honeybee’s decoder and its ability to quickly decode traces, it is entirely reasonable to assume that the total overhead of Honeybee’s data processing could add up to a 29% performance penalty.

    I analyzed Honeybee’s coverage and confirmed that it was operating normally and processing all the data correctly. As such, I’m happy to say that Honeybee is (at least in this case) able to fuzz both closed and open-source software faster and more efficiently than even conventional software instrumentation methods!

    What’s next?

    While it is very exciting to claim to have dethroned an industry-standard fuzzing method, these methods have not been rigorously tested or verified at a large scale or across a large array of fuzzing targets. I can attest, however, that Honeybee has been either faster or at least able to trade blows with software instrumentation while fuzzing many different large open source projects like libpcap, libpng, and libjpegturbo.

    If these patterns apply more generally, this could mean a great speedup for those who need to perform source-level fuzzing. More excitingly, however, this is an absolutely wonderful speedup for those who need to perform black-box fuzzing and have been relying on slow and unreliable tools like QEMU instrumentation, Branch Trace Store (BTS), or binary lifting, since it means they can fuzz at equal or greater speeds than if they had source without making any serious compromises. Even outside fuzzing, however, Honeybee is still a proven and extraordinarily fast IPT decoder. This high-performance decoding is useful outside of fuzzing because it enables many novel applications ranging from live control flow integrity enforcement to more advanced debuggers and performance analysis tools.

    Honeybee is a very young project and is still under active development in its home on GitHub. If you’re interested in IPT, fuzzing, or any combination thereof please feel free to reach out to me over on Twitter where I’m @ezhes_ or over email at allison.husain on the berkeley.edu domain!

    Acknowledgments

    As mentioned earlier, IPT has caught many researchers’ eyes and so, naturally, many have tried to use it for fuzzing. Before starting my own project, I studied others’ research to learn from their work and see where I might be able to offer improvements. I’d first like to acknowledge the authors behind PTrix (PTrix: Efficient Hardware-Assisted Fuzzing for COTS Binary) and PTfuzz (PTfuzz: Guided Fuzzing With Processor Trace Feedback) as they provided substantial insight into how a fuzzer could be structured around IPT data. Additionally, I’d like to thank the team behind libxdc as their fast IPT packet decoder forms the basis for the packet decoder in Honeybee. Finally, I’d like to give a big thank you to the team at Trail of Bits, and especially my mentors Artem Dinaburg and Peter Goodman, for their support through this project and for having me on as an intern this winter!

    Never a dill moment: Exploiting machine learning pickle files

    By Evan Sultanik

    Many machine learning (ML) models are Python pickle files under the hood, and it makes sense. The use of pickling conserves memory, enables start-and-stop model training, and makes trained models portable (and, thereby, shareable). Pickling is easy to implement, is built into Python without requiring additional dependencies, and supports serialization of custom objects. There’s little doubt about why choosing pickling for persistence is a popular practice among Python programmers and ML practitioners.

    Pre-trained models are typically treated as “free” byproducts of ML since they allow the valuable intellectual property like algorithms and corpora that produced the model to remain private. This gives many people the confidence to share their models over the internet, particularly for reusable computer vision and natural language processing classifiers. Websites like PyTorch Hub facilitate model sharing, and some libraries even provide APIs to download models from GitHub repositories automatically.

    Here, we discuss the underhanded antics that can occur simply from loading an untrusted pickle file or ML model. In the process, we introduce a new tool, Fickling, that can help you reverse engineer, test, and even create malicious pickle files. If you are an ML practitioner, you’ll learn about the security risks inherent in standard ML practices. If you are a security engineer, you’ll learn about a new tool that can help you construct and forensically examine pickle files. Either way, by the end of this article, pickling will hopefully leave a sour taste in your mouth.

    Do you know how pickles are stored? It’s jarring!

    Python pickles are compiled programs run in a unique virtual machine called a Pickle Machine (PM). The PM interprets the pickle file’s sequence of opcodes to construct an arbitrarily complex Python object. Python pickle is also a streaming format, allowing the PM to incrementally build the resulting object as portions of the pickle are downloaded over the network or read from a file.

    The PM uses a Harvard architecture, segregating the program opcodes from writable data memory, thus preventing self-modifying code and memory corruption attacks. It also lacks support for conditionals, looping, or even arithmetic. During unpickling, the PM reads in a pickle program and performs a sequence of instructions. It stops as soon as it reaches the STOP opcode and whatever object is on top of the stack at that point is the final result of unpickling.

    From this description, one might reasonably conclude that the PM is not Turing-complete. How could this format possibly be unsafe? To corrode the words of Mishima’s famous aphorism:

    Computer programs are a medium that reduces reality to abstraction for transmission to our reason, and in their power to corrode reality inevitably lurks the danger of the weird machines.

    The PM contains two opcodes that can execute arbitrary Python code outside of the PM, pushing the result onto the PM’s stack: GLOBAL and REDUCE. GLOBAL is used to import a Python module or class, and REDUCE is used to apply a set of arguments to a callable, typically previously imported through GLOBAL. Even if a pickle file does not use the REDUCE opcode, the act of importing a module alone can and will execute arbitrary code in that module, so GLOBAL alone is dangerous.

    For example, one can use a GLOBAL to import the exec function from __builtins__ and then REDUCE to call exec with an arbitrary string containing Python code to run. Likewise for other sensitive functions like os.system and subprocess.call. Python programs can optionally limit this behavior by defining a custom unpickler; however, none of the ML libraries we inspected do so. Even if they did, these protections can almost always be circumvented; there is no guaranteed way to safely load untrusted pickle files, as is highlighted in this admonition from the official Python 3.9 Pickle documentation:

    Warning The pickle module is not secure. Only unpickle data you trust.
    It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. Never unpickle data that could have come from an untrusted source, or that could have been tampered with.
    Consider signing data with hmac if you need to ensure that it has not been tampered with.
    Safer serialization formats such as JSON may be more appropriate if you are processing untrusted data.

    We are not aware of any ML file formats that include a checksum of the model, eitherSome libraries like Tensorflow do have the capability to verify download checksums, however, verification is disabled by default and based upon a checksum embedded in the filename, which can be easily forged..

    The dangers of Python pickling have been known to the computer security community for quite some time.

    Introducing Fickling: A decompiler, static analyzer, and bytecode rewriter for pickle files

    Fickling has its own implementation of a Pickle Virtual Machine (PM), and it is safe to run on potentially malicious files, because it symbolically executes code rather than overtly executing it.

    Let’s see how Fickling can be used to reverse engineer a pickle file by creating an innocuous pickle containing a serialized list of basic Python types:

    $ python3 -c "import sys, pickle; \
      sys.stdout.buffer.write(pickle.dumps([1, ‘2’, {3: 4}]))" \
      > simple_list.pickle
    $ python3 -m pickle simple_list.pickle
    [1, ‘2’, {3: 4}]
    

    Running fickling on the pickle file will decompile it and produce a human-readable Python program equivalent to what code would be run by the real PM during deserialization:

    $ fickling simple_list.pickle
    result = [1, ‘2’, {3: 4}]
    

    In this case, since it’s a simple serialized list, the code is neither surprising nor very interesting. By passing the --trace option to Fickling, we can trace the execution of the PM:

    $ fickling --trace simple_list.pickle
    PROTO
    FRAME
    EMPTY_LIST
        Pushed []
    MEMOIZE
        Memoized 0 -> []
    MARK
        Pushed MARK
    BININT1
        Pushed 1
    SHORT_BINUNICODE
        Pushed '2'
    MEMOIZE
        Memoized 1 -> '2'
    EMPTY_DICT
        Pushed {}
    MEMOIZE
        Memoized 2 -> {}
    BININT1
        Pushed 3
    BININT1
        Pushed 4
    SETITEM
        Popped 4
        Popped 3
        Popped {}
        Pushed {3: 4}
    APPENDS
        Popped {3: 4}
        Popped '2'
        Popped 1
        Popped MARK
    STOP
        result = [1, '2', {3: 4}]
        Popped [1, '2', {3: 4}]
    

    You can run Fickling’s static analyses to detect certain classes of malicious pickles by passing the --check-safety option:

    $ fickling --check-safety simple_list.pickle
    Warning: Fickling failed to detect any overtly unsafe code,
    but the pickle file may still be unsafe.
    Do not unpickle this file if it is from an untrusted source!
    

    What would it look like if the pickle file were malicious? Well, why not make one! We can do that by injecting arbitrary Python code into the pickle file:

    $ fickling --inject 'print("Hello World!")' testpickle > testpickle.pwn3d
    $ python3 -m pickle testpickle.pwn3d
    Hello World!
    [1, '2', {3: 4}]
    

    It works! So let’s see Fickling’s decompilation:

    $ fickling testpickle.pwn3d
    _var0 = eval('print("Hello World!")')
    result = [1, '2', {3: 4}]
    

    and its analysis:

    $ fickling --check-safety testpickle.pwn3d
    Call to `eval('print("Hello World!")')` is almost certainly
    evidence of a malicious pickle file
    

    Fickling can also be used as a Python library, and has a programmatic interface to decompile, analyze, modify, and synthesize Pickle files. It is open source, and you can install it by running:

    pip3 install fickling
    

    Making Malicious ML Models

    Since the majority of ML models use pickling extensively, there is a potential attack surface for weight/neuron perturbations on models, including fault injections, live trojans, and weight poisoning attacks among others. For example, during deserialization, code injected into the pickle could programmatically make changes to the model depending on the local environment, such as time of day, timezone, hostname, system locale/language, or IP address. These changes could be subtle, like a bitflip attack, or more overt, like injecting arbitrary delays in the deserialization to deny service.

    Fickling has a proof-of-concept based on the official PyTorch tutorial that injects arbitrary code into an existing PyTorch model. This example shows how loading the generated model into PyTorch will automatically list all of the files in the current directory (presumably containing proprietary models and code) and exfiltrate them to a remote server.

    This is concerning for services like Microsoft’s Azure ML, which supports running user-supplied models in their cloud instances. A malicious, “Fickled” model could cause a denial of service, and/or achieve remote code execution in an environment that Microsoft likely assumed would be proprietary. If multiple users’ jobs are not adequately compartmentalized, there is also the potential of exfiltrating other users’ proprietary models.

    How do we dill with it?

    The ideal solution is to avoid pickling altogether. There are several different encodings—JSON, CBOR, ProtoBuf—that are much safer than pickling and are sufficient for encoding these models. In fact, PyTorch already includes state_dict and load_state_dict functions that save and load model weights into a dictionary, which can be easily serialized into a JSON format. In order to fully load the model, the model structure (how many layers, layer types, etc.) is also required. If PyTorch implements serialization/deserialization methods for the model structure, the entire model can be much more safely encoded into JSON files.

    Outside of PyTorch, there are other frameworks that avoid using pickle for serialization. For example, the Open Neural Network Exchange (ONNX) aims to provide a universal standard for encoding AI models to improve interoperability. The ONNX specification uses ProtoBuf to encode their model representations.

    ReSpOnSiBlE DiScLoSuRe

    We reported our concerns about sharing ML models to the PyTorch and PyTorch Hub maintainers on January 25th and received a reply two days later. The maintainers said that they will consider adding additional warnings to PyTorch and PyTorch Hub. They also explained that models submitted to PyTorch Hub are vetted for quality and utility, but the maintainers do not perform any background checks on the people publishing the model or carefully audit the code for security before adding a link to the GitHub repository on the PyTorch Hub indexing page. The maintainers do not appear to be following our recommendation to switch to a safer form of serialization; they say that the onus is on the user to ensure the provenance and trustworthiness of third party models.

    We do not believe this is sufficient, particularly in the face of increasingly prevalent typosquatting attacks (see those of pip and npm). Moreover, a supply chain attack could very easily inject malicious code into a legitimate model, even though the associated source code appears benign. The only way to detect such an attack would be to manually inspect the model using a tool like Fickling.

    Conclusions

    As ML continues to grow in popularity and the majority of practitioners rely on generalized frameworks, we must ensure the frameworks are secure. Many users do not have a background in computer science, let alone computer security, and may not understand the dangers of trusting model files of unknown provenance. Moving away from pickling as a form of data serialization is relatively straightforward for most frameworks and is an easy win for security. We relish the thought of a day when pickling will no longer be used to deserialize untrusted files. In the meantime, try out Fickling and let us know how you use it!

    Acknowledgements

    Many thanks goes out to our team for their hard work on this effort: Sonya Schriner, Sina Pilehchiha, Jim Miller, Suha S. Hussain, Carson Harmon, Josselin Feist, and Trent Brunson


    † Some libraries like Tensorflow do have the capability to verify download checksums, however, verification is disabled by default and based upon a checksum embedded in the filename, which can be easily forged.

    The Tao of Continuous Integration

    By Paul Kehrer

    It is a truism in modern software development that a robust continuous integration (CI) system is necessary. But many projects suffer from CI that feels brittle, frustrates developers, and actively impedes development velocity. Why is this? What can you do to avoid the common CI pitfalls?

    Continuous Integration Needs a Purpose

    CI is supposed to provide additional assurance that a project’s code is correct. However, the tests a developer writes to verify the expected functionality are at their least useful when they are initially written. This is perhaps counterintuitive because a developer’s greatest familiarity comes when they initially write the code. They’ve thought about it from numerous angles and considered many of the possible edge cases and have implemented something that works pretty well!

    Unfortunately, writing code is the easiest part of programming. The real challenge is building code that others can read so your project can thrive for many years. Software entropy increases over time. Developers—especially ones not familiar with large long-term codebases—can’t anticipate how their code may be integrated, refactored, and repurposed to accommodate needs beyond those that weren’t originally considered.

    When these sorts of refactors and expansions occur, tests are the only way changes can be made confidently. So why do developers end up with systems that lack quality testing?

    Trivial Testing

    When writing tests, especially for high code-coverage metrics, the most common complaint is that some tests are trivial and exercise nothing interesting or error-prone in the codebase. These complaints are valid when thinking about the code as it exists today, but now consider that the software could be repurposed from its original intention. What once was trivial might now be subtle. Failing to test trivial cases may lead your work into a labyrinth of hidden traps rooted in unobservable behavior.
    Remember these three things:

    1. No test is trivial in the long run.
    2. Tests are documentation of expected behavior.
    3. Untested code is subject to incidental behavioral change.

    Unreliable CI

    Unreliable CI is poison for developers. For internal projects, it saps productivity and makes people hate working on it. And for open-source projects, it drives away contributors faster than they can arrive.

    Find what’s causing your tests to be unreliable and fix it. Unreliable CI commonly manifests as flaky tests, and tools exist to mark tests as flaky until you can find the root cause. This will allow immediate improvement in your CI without crippling the team.

    Slow CI

    You may find yourself with an excessively long CI cycle time. This is problematic because a quality development process requires that all CI jobs pass. If the cycle time is too long and complex so that it’s impractical to run it locally, then developers will create workarounds. These workarounds may take many forms, but it’s most common to see PR sizes balloon when no one wants to put in a 2-line PR, wait an hour for it to merge, and then rebase their 300-line PR. On top of it when they can just make a few unrelated changes in a single PR. This causes problems for code reviewers and lowers the quality of the project.

    Developers aren’t wrong to do this, and CI has failed them. When building CI systems, it’s important to keep a latency budget in mind that goes something like, “CI should never be slower than time, t, where t is chosen a priori.” If CI becomes slower than that, then an effort is spent to improve it, even if it encroaches on the development of new features.

    Coverage is difficult

    Part of responsible testing is knowing which lines of code your tests are exercising—a nice, simple number that tells you everything. So why is coverage so commonly ignored?

    First, the technical challenge. Modern software runs against many disparate targets. To be useful, CI systems should run against numerous targets that submit data to a hosted system that can combine coverage. (The frustration of tools like this failing and how to maintain development velocity despite all software being hot garbage is another discussion.) Absent this, service software developers often fail to notice missed coverage as it becomes lost in the noise of “expected” missed lines.

    Now let’s talk about the social challenges. Software is typically written in a way that makes it difficult to test small pieces of functionality. This issue gave rise to the test-driven development (TDD) trend, where tests are written first to help developers factor their code in a testable manner. This is generally a net win in readability and testability but requires discipline and a different approach to development that doesn’t come naturally to many people.

    The perceived drudgery in making more of a codebase testable causes complaints that coverage is an imperfect metric. After all, not all code branches are created equal, and depending on your language, some code paths should never be exercised. These are not good reasons to dismiss coverage as a valuable metric, but on specific occasions, there may exist a compelling reason to not spend the effort to cover something with tests. However, be aware that by failing to cover a certain piece of code with tests, its behavior is no longer part of the contract future developers will uphold during refactoring.

    What do we do?

    So how do we get to CI nirvana given all these obstacles? Incrementally. An existing project is a valuable asset, and we want to preserve what we have while increasing our ability to improve it in the future. (Rewrites are almost universally a bad idea.) This necessitates a graduated approach that, while specifically customized to a given project, has a broad recipe:

      1. Make CI reliable
      2. Speed up CI
      3. Improve test quality
      4. Improve coverage

    We should all spend time investing in the longevity of our projects. This sort of foundational effort pays rapid dividends and ensures that your software projects can be world-class.