All your tracing are belong to BPF

By Alessandro Gario, Senior Software Engineer
Originally published August 11, 2021

TL;DR: These simpler, step-by-step methods equip you to apply BPF tracing technology to real-word problems—no specialized tools or libraries required.

BPF, a tracing technology in the Linux kernel for network stack tracing, has become popular recently thanks to new extensions that enable novel use-cases outside of BPF’s original scope. Today it can be used to implement program performance analysis tools, system and program dynamic tracing utilities, and much more.

In this blog post we’ll show you how to use the Linux implementation of BPF to write tools that access system and program events. The excellent tools from IO Visor make it possible for users to easily harness BPF technology without the considerable time investment of writing specialized tools in native code languages.

What the BPF?

BPF itself is just a way to express a program, and a runtime interpreter for executing that program “safely.” It’s a set of specifications for virtual architecture, detailing how virtual machines dedicated to running its code should behave. The latest extensions to BPF have not only introduced new, really useful helper functions (such as reading a process’ memory), but also new registers and more stack space for the BPF bytecode.

Our main goal is to help you to take advantage of BPF and apply it to real-world problems without depending on external tools or libraries that may have been written with different goals and requirements in mind.

You can find the examples in this post in our repository. Please note that the code is simplified to focus on the concepts. This means that, where possible, we skip error checking and proper resource cleanup.

BPF program limitations

Even though we won’t be handwriting BPF assembly, it’s useful to know the code limitations since the in-kernel verifier will reject our instructions if we break its rules.

BPF programs are extremely simple, being made of only a single function. Instructions are sent to the kernel as an array of opcodes, meaning there’s no executable file format involved. Without sections, it’s not possible to have things like global variables or string literals; everything has to live on the stack, which can only hold up to 512 bytes. Branches are allowed, but it is only since kernel version 5.3 that jump opcodes can go backward—provided the verifier can prove the code will not execute forever.

The only other way to use loops without requiring recent kernel versions is to unroll them, but this will potentially use a lot of instructions, and older Linux versions will not load any program that exceeds the 4096 opcode count limit (see BPF_MAXINSNS under linux/bpf_common.h). Error handling in some cases is mandatory, and the verifier will prevent you from using resources that may fail initialization by rejecting the program.

These limitations are extremely important since these programs can get hooked on kernel code. When a verifier challenges the correctness of the code, it’s possible to prevent system crashes or slowdowns from loading malformed code.

External resources

To make BPF programs truly useful, they need ways to communicate with a user mode process and manage long-term data, i.e., via maps and perf event outputs.

Although many map types exist, they all essentially behave like key-value databases, and are commonly used to share data between user modes and/or other programs. Some of these types store data in per-CPU storage, making it easy to save and retrieve state when the same BPF program is run concurrently from different CPU cores.

Perf event outputs are generally used to send data to user mode programs and services, and are implemented as circular buffers.

Event sources

Without some data to process, our programs will just sit around doing nothing. BPF probes on Linux can be attached to several different event sources. For our purpose, we’re mainly interested in function tracing events.

Dynamic instrumentation

Similar to code hooking, BPF programs can be attached to any function. The probe type depends on where the target code lives. Kprobes are used when tracing kernel functions, while Uprobes are used when working with user mode libraries or binaries.

While Kprobe and Uprobe events are emitted when entering the monitored functions, Kretprobe and Uretprobe events are generated whenever the function returns. This works correctly even if the function being traced has multiple exit points. This kind of event does not forward typed syscall parameters and only comes with a pt_regs structure that contains the register values at the time of the call. Knowledge about the function prototype and system ABI is required to map back the function arguments to the right register.

Static instrumentation

It’s not always ideal to rely on function hooking when writing a tool, because the risk of breakage increases as the kernel or software gets updated. In most cases, it’s best to use a more stable event source such as a tracepoint.

There are two types of tracepoints:

  • One for user mode code (USDT, a.k.a. User-Level Statically Defined Tracepoints)
  • One for kernel mode code (interestingly, they are referred to as just “tracepoints”).

Both types of tracepoints are defined in the source code by the programmer, essentially defining a stable interface that shouldn’t change unless strictly necessary.

If DebugFS has been enabled and mounted, registered tracepoints will all appear under the /sys/kernel/debug/tracing folder. Similar to Kprobes and Kretprobes, each system call defined in the Linux kernel comes with two different tracepoints. The first one, sys_enter, is activated whenever a program in the system transitions to a syscall handler inside the kernel, and carries information about the parameters that have been received. The second (and last) one, sys_exit, only contains the exit code of the function and is invoked whenever the syscall function terminates.

BPF development prerequisites

Even though there’s no plan to use external libraries, we still have a few dependencies. The most important thing is to have access to a recent LLVM toolchain compiled with BPF support. If your system does not satisfy this requirement, it is possible—and actually encouraged—to make use of the osquery toolchain. You’ll also need CMake, as that’s what I use for the sample code.

When running inside the BPF environment, our programs make use of special helper functions that require a kernel version that’s at least above 4.18. While it’s possible to avoid using them, it would severely limit what we can do from our code.

Using Ubuntu 20.04 or equivalent is a good bet, as it comes with both a good kernel version and an up-to-date LLVM toolchain with BPF support.

Some LLVM knowledge is useful, but the code doesn’t require any advanced LLVM expertise. The Kaleidoscope language tutorial on the official site is a great introduction if needed.

Writing our first program

There are many new concepts to introduce, so we’ll start simple: our first example loads a program that returns without doing anything.

First, we create a new LLVM module and a function that contains our logic:

std::unique_ptr createBPFModule(llvm::LLVMContext &context) {
  auto module = std::make_unique("BPFModule", context);
  return module;
std::unique_ptr generateBPFModule(llvm::LLVMContext &context) {
  // Create the LLVM module for the BPF program
  auto module = createBPFModule(context);
  // BPF programs are made of a single function; we don't care about parameters
  // for the time being
  llvm::IRBuilder<> builder(context);
  auto function_type = llvm::FunctionType::get(builder.getInt64Ty(), {}, false);
  auto function = llvm::Function::Create(
      function_type, llvm::Function::ExternalLinkage, "main", module.get());  
  // Ask LLVM to put this function in its own section, so we can later find it
  // more easily after we have compiled it to BPF code
  // Create the entry basic block and assemble the printk code using the helper
  // we have written
  auto entry_bb = llvm::BasicBlock::Create(context, "entry", function);
  return module;

Since we’re not going to handle event arguments, the function we created does not accept any parameters. Not much else is happening here except the return instruction. Remember, each BPF program has exactly one function, so it’s best to ask LLVM to store them in separate sections. This makes it easier to retrieve them once the module is compiled.

We can now JIT our module to BPF bytecode using the ExecutionEngine class from LLVM:

SectionMap compileModule(std::unique_ptr module) {
  // Create a new execution engine builder and configure it
  auto exec_engine_builder =
  SectionMap section_map;
  // Create the execution engine and build the given module
  std::unique_ptr execution_engine(
  return section_map;

Our custom SectionMemoryManager class mostly acts as a passthrough to the original SectionMemoryManager class from LLVM—it’s only there to keep track of the sections that the ExecutionEngine object creates when compiling our IR.

Once the code is built, we get back a vector of bytes for each function that was created inside the module:

int loadProgram(const std::vector &program) {
  // The program needs to be aware how it is going to be used. We are
  // only interested in tracepoints, so we'll hardcode this value
  union bpf_attr attr = {};
  attr.prog_type = BPF_PROG_TYPE_TRACEPOINT;
  attr.log_level = 1U;
  // This is the array of (struct bpf_insn) instructions we have received
  // from the ExecutionEngine (see the compileModule() function for more
  // information)
  auto instruction_buffer_ptr =;
  std::memcpy(&attr.insns, &instruction_buffer_ptr, sizeof(attr.insns));
  attr.insn_cnt =
      static_cast(program.size() / sizeof(struct bpf_insn));
  // The license is important because we will not be able to call certain
  // helpers within the BPF VM if it is not compatible
  static const std::string kProgramLicense{"GPL"};
  auto license_ptr = kProgramLicense.c_str();
  std::memcpy(&attr.license, &license_ptr, sizeof(attr.license));
  // The verifier will provide a text disasm of our BPF program in here.
  // If there is anything wrong with our code, we'll also find some
  // diagnostic output
  std::vector log_buffer(4096, 0);
  attr.log_size = static_cast<__u32>(log_buffer.size());
  auto log_buffer_ptr =;
  std::memcpy(&attr.log_buf, &log_buffer_ptr, sizeof(attr.log_buf));
  auto program_fd =
      static_cast(::syscall(__NR_bpf, BPF_PROG_LOAD, &attr, sizeof(attr)));
  if (program_fd < 0) {
    std::cerr << "Failed to load the program: " << << "\n";
  return program_fd;

Loading the program is not hard, but as you may have noticed, there is no helper function defined for the bpf() system call we’re using. The tracepoint is the easiest event type to set up, and it’s what we’re using for the time being.

Once the BPF_PROG_LOAD command is issued, the in-kernel verifier will validate our program and also provide a disassembly of it inside the log buffer we’ve provided. The operation will fail if kernel output is longer than the bytes available, so only provide a log buffer in production code if the load has already failed.

Another important field in the attr union is the program license; specifying any value other than GPL may disable some of the features that are exposed to BPF. I’m not a licensing expert, but it should be possible to use different licenses for the generator and the generated code (but please speak to a lawyer and/or your employer first!).

We can now assemble the main() function using the helpers we built:

int main() {
  // Generate our BPF program
  llvm::LLVMContext context;
  auto module = generateBPFModule(context);
  // JIT the module to BPF code using the execution engine
  auto section_map = compileModule(std::move(module));
  if (section_map.size() != 1U) {
    std::cerr << "Unexpected section count\n";
    return 1;
  // We have previously asked LLVM to create our function inside a specific
  // section; get our code back from it and load it
  const auto &main_program ="bpf_main_section");
  auto program_fd = loadProgram(main_program);
  if (program_fd < 0) {
    return 1;
  return 0;

If everything works correctly, no error is printed when the binary is run as the root user. You can find the source code for the empty program in the 00-empty folder of the companion code repository.

But…this program isn’t very exciting, since it doesn’t do anything! Now we’ll update it so we can execute it when a certain system event happens.

Creating our first useful program

In order to actually execute our BPF programs, we have to attach them to an event source.

Creating a new tracepoint event is easy; it only involves reading and writing some files from under the debugfs folder:

int createTracepointEvent(const std::string &event_name) {
  const std::string kBaseEventPath = "/sys/kernel/debug/tracing/events/";
  // This special file contains the id of the tracepoint, which is
  // required to initialize the event with perf_event_open  
  std::string event_id_path = kBaseEventPath + event_name + "/id";
  // Read the tracepoint id and convert it to an integer
  auto event_file = std::fstream(event_id_path, std::ios::in);
  if (!event_file) {
    return -1;
  std::stringstream buffer;
  buffer << event_file.rdbuf();
  auto str_event_id = buffer.str();
  auto event_identifier = static_cast(
      std::strtol(str_event_id.c_str(), nullptr, 10));
  // Create the event
  struct perf_event_attr perf_attr = {};
  perf_attr.type = PERF_TYPE_TRACEPOINT;
  perf_attr.size = sizeof(struct perf_event_attr);
  perf_attr.config = event_identifier;
  perf_attr.sample_period = 1;
  perf_attr.sample_type = PERF_SAMPLE_RAW;
  perf_attr.wakeup_events = 1;
  perf_attr.disabled = 1;
  int process_id{-1};
  int cpu_index{0};
  auto event_fd =
      static_cast(::syscall(__NR_perf_event_open, &perf_attr, process_id,
                                 cpu_index, -1, PERF_FLAG_FD_CLOEXEC));  
  return event_fd;

To create the event file descriptor, we have to find the tracepoint identifier, which is in a special file called (unsurprisingly) “id.”

For our last step, we attach the program to the tracepoint event we just created. This is trivial and can be done with a couple of ioctl calls on the event’s file descriptor:

bool attachProgramToEvent(int event_fd, int program_fd) {
  if (ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, program_fd) < 0) {
    return false;

  if (ioctl(event_fd, PERF_EVENT_IOC_ENABLE, 0) < 0) {
    return false;

  return true;

Our program should finally succeed in running our BPF code, but no output is generated yet since our module only really contained a return opcode. The easiest way to generate some output is to use the bpf_trace_printk helper to print a fixed string:

void generatePrintk(llvm::IRBuilder<> &builder) {
  // The bpf_trace_printk() function prototype can be found inside
  // the /usr/include/linux/bpf.h header file
  std::vector argument_type_list = {builder.getInt8PtrTy(),

  auto function_type =
      llvm::FunctionType::get(builder.getInt64Ty(), argument_type_list, true);

  auto function =

  // Allocate 8 bytes on the stack
  auto buffer = builder.CreateAlloca(builder.getInt64Ty());

  // Copy the string characters to the 64-bit integer
  static const std::string kMessage{"Hello!!"};

  std::uint64_t message{0U};
  std::memcpy(&message, kMessage.c_str(), sizeof(message));

  // Store the characters inside the buffer we allocated on the stack
  builder.CreateStore(builder.getInt64(message), buffer);

  // Print the characters
  auto buffer_ptr = builder.CreateBitCast(buffer, builder.getInt8PtrTy());

  auto function_callee = function;
  auto function_callee = llvm::FunctionCallee(function_type, function);

  builder.CreateCall(function_callee, {buffer_ptr, builder.getInt32(8U)});

Importing new helper functions from BPF is quite easy. The first thing we need is the prototype, which can be taken from the linux/bpf.h include header. The one relative to printk reads as follows:

 * int bpf_trace_printk(const char *fmt, u32 fmt_size, ...)
 * 	Description
 * 		This helper is a "printk()-like" facility for debugging. It
 * 		prints a message defined by format *fmt* (of size *fmt_size*)
 * 		to file *\/sys/kernel/debug/tracing/trace* from DebugFS, if
 * 		available. It can take up to three additional **u64**
 * 		arguments (as an eBPF helpers, the total number of arguments is
 * 		limited to five).

Once the function type matches, we only have to assemble a call that uses the helper function ID as the destination address: BPF_FUNC_trace_printk. The generatePrintk function can now be added to our program right before we create the return instruction inside generateBPFModule.

The full source code for this program can be found in the 01-hello_open folder.

Running the program again will show the “Hello!!” string inside the /sys/kernel/debug/tracing/trace_pipe file every time the tracepoint event is emitted. Using text output can be useful, but due to the BPF VM limitations the printf helper is not as useful as can be in a standard C program.

In the next section, we’ll take a look at maps and how to use them as data storage for our programs.

Profiling system calls

Using maps to store data

Maps are a major component in most programs, and can be used in a number of different ways. Since they’re accessible from both kernel and user mode, they can be useful in storing data for later processing either from additional probes or user programs. Given the limitations that BPF imposes, they’re also commonly used to provide scratch space for handling temporary data that does not fit on the stack.

There are many map types; some are specialized for certain uses, such as storing stack traces. Others are more generic, and suitable for use as custom data containers.

Concurrency and thread safety are not just user mode problems, and BPF comes with two really useful special map types that have dedicated storage for storing values in CPU scope. These maps are commonly used to replace the stack, as a per-CPU map can be easily referenced by programs without having to worry about synchronization.

It’s rather simple to create and use maps since they all share the same interface, regardless of type. The following table, taken from the BPF header file comments, documents the most common operations:

  • BPF_MAP_CREATE: Create a map and return a file descriptor that refers to the map. The close-on-exec file descriptor flag (see fcntl(2)) is automatically enabled for the new file descriptor.
  • BPF_MAP_LOOKUP_ELEM: Look up an element by key in a specified map and return its value.
  • BPF_MAP_UPDATE_ELEM: Create or update an element (key/value pair) in a specified map.
  • BPF_MAP_DELETE_ELEM: Look up and delete an element by key in a specified map.

    The only important thing to remember is that when operating on per-CPU maps the value is not just a single entry, but an array of values that has as many items as CPU cores.

    Creating a map

    Before we can create our map, we have to determine which type we want to use. The following enum declaration has been taken from the linux/bpf.h header file:

    enum bpf_map_type {
      BPF_MAP_TYPE_UNSPEC,  /* Reserve 0 as invalid map type */

    Most of the time we’ll use hash maps and arrays. We have to create a bpf_attr union, initializing key and value size as well as the maximum amount of entries it can hold.

    int createMap(bpf_map_type type, std::uint32_t key_size,
                  std::uint32_t value_size, std::uint32_t key_count) {
      union bpf_attr attr = {};
      attr.map_type = type;
      attr.key_size = key_size;
      attr.value_size = value_size;
      attr.max_entries = key_count;
      return static_cast(
          syscall(__NR_bpf, BPF_MAP_CREATE, &attr, sizeof(attr)));

    Not every available operation always makes sense for all map types. For example, it’s not possible to delete entries when working with an array. Lookup operations are also going to behave differently, as they will only fail when the specified index is beyond the last element.

    Here’s the code to read a value from a map:

    // Error codes for map operations; depending on the map type, reads may
    // return NotFound if the specified key is not present
    enum class ReadMapError { Succeeded, NotFound, Failed };
    // Attempts to read a key from the specified map. Values in per-CPU maps
    // actually have multiple entries (one per CPU)
    ReadMapError readMapKey(std::vector &value, int map_fd,
                            const void *key) {
      union bpf_attr attr = {};
      // Use memcpy to avoid string aliasing issues
      attr.map_fd = static_cast<__u32>(map_fd);
      std::memcpy(&attr.key, &key, sizeof(attr.key));
      auto value_ptr =;
      std::memcpy(&attr.value, &value_ptr, sizeof(attr.value));
      auto err =
          ::syscall(__NR_bpf, BPF_MAP_LOOKUP_ELEM, &attr, sizeof(union bpf_attr));
      if (err >= 0) {
        return ReadMapError::Succeeded;
      if (errno == ENOENT) {
        return ReadMapError::NotFound;
      } else {
        return ReadMapError::Failed;

    Writing a BPF program to count syscall invocations

    In this example we’ll build a probe that counts how many times the tracepoint we’re tracing gets called. We’ll create a counter for each processor core, using a per-CPU array map that only contains a single item.

    auto map_fd = createMap(BPF_MAP_TYPE_PERCPU_ARRAY, 4U, 8U, 1U);
    if (map_fd < 0) {
      return 1;

    Referencing this map from the BPF code is not too hard but requires some additional operations:

    1. Convert the map file descriptor to a map address
    2. Use the bpf_map_lookup_elem helper function to retrieve the pointer to the desired map entry
    3. Check the returned pointer to make sure the operation has succeeded (the validator will reject our program otherwise)
    4. Update the counter value

    The map address can be obtained through a special LLVM intrinsic called “pseudo.”

    // Returns the pseudo intrinsic, useful to convert file descriptors (like maps
    // and perf event outputs) to map addresses so they can be used from the BPF VM
    llvm::Function *getPseudoFunction(llvm::IRBuilder<> &builder) {
      auto &insert_block = *builder.GetInsertBlock();
      auto &module = *insert_block.getModule();
      auto pseudo_function = module.getFunction("llvm.bpf.pseudo");
      if (pseudo_function == nullptr) {
        // clang-format off
        auto pseudo_function_type = llvm::FunctionType::get(
        // clang-format on
        pseudo_function = llvm::Function::Create(pseudo_function_type,
                                                 "llvm.bpf.pseudo", module);
      return pseudo_function;
    // Converts the given (map or perf event output) file descriptor to a map
    // address
    llvm::Value *mapAddressFromFileDescriptor(int fd, llvm::IRBuilder<> &builder) {
      auto pseudo_function = getPseudoFunction(builder);
      // clang-format off
      auto map_integer_address_value = builder.CreateCall(
      // clang-format on
      return builder.CreateIntToPtr(map_integer_address_value,

    Importing the bpf_map_lookup_elem helper function follows the same procedure we used to import the bpf_trace_printk one. Looking at the linux/bpf.h, the prototype reads:

     * void *bpf_map_lookup_elem(struct bpf_map *map, const void *key)
     * 	Description
     * 		Perform a lookup in *map* for an entry associated to *key*.
     * 	Return
     * 		Map value associated to *key*, or **NULL** if no entry was
     * 		found.

    Notice how the key parameter is passed by pointer and not by value. We’ll have to allocate the actual key on the stack using CreateAlloca. Since allocations should always happen in the first (entry) basic block, our function will accept a pre-filled buffer as key. The return type is a void pointer, but we can save work if we directly declare the function with the correct value type.

    // Attempts to retrieve a pointer to the specified key inside the map_fd map
    llvm::Value *bpfMapLookupElem(llvm::IRBuilder<> &builder, llvm::Value *key,
                                  llvm::Type *value_type, int map_fd) {
      std::vector argument_type_list = {builder.getInt8PtrTy(),
      auto function_type = llvm::FunctionType::get(value_type->getPointerTo(),
                                                   argument_type_list, false);
      auto function =
      auto map_address = mapAddressFromFileDescriptor(map_fd, builder);
      auto function_callee = function;
      auto function_callee = llvm::FunctionCallee(function_type, function);
      return builder.CreateCall(function_callee, {map_address, key});

    Back to the BPF program generator, we can now call the new bpfMapLookupElem to retrieve the first value in our array map:

    auto map_key_buffer = builder.CreateAlloca(builder.getInt32Ty());
    builder.CreateStore(builder.getInt32(0U), map_key_buffer);
    auto counter_ptr =
        bpfMapLookupElem(builder, map_key_buffer, builder.getInt32Ty(), map_fd);

    Since we are using a per-CPU array map, the pointer that returns from this function references a private array entry for the core we’re running on. Before we can use it, however, we have to test whether the function has succeeded; otherwise, the verifier will reject the program. This is trivial and can be done with a comparison instruction and a new basic block.

    auto null_ptr = llvm::Constant::getNullValue(counter_ptr->getType());
    auto cond = builder.CreateICmpEQ(null_ptr, counter_ptr);
    auto error_bb = llvm::BasicBlock::Create(context, "error", function);
    auto continue_bb = llvm::BasicBlock::Create(context, "continue", function);
    builder.CreateCondBr(cond, error_bb, continue_bb);

    The pointer to the counter value can now be dereferenced without causing a validation error from the verifier.

    auto counter = builder.CreateLoad(counter_ptr);
    auto new_value = builder.CreateAdd(counter, builder.getInt32(1));
    builder.CreateStore(new_value, counter_ptr);

    There is no need to import and use the bpf_map_update_elem() helper function since we can directly increment the value from the pointer we received. We only have to load the value from the pointer, increment it, and then store it back where it was.

    Once we have finished with our tracer, we can retrieve the counters and inspect them:

    auto processor_count = getProcessorCount();
    std::vector value(processor_count * sizeof(std::uint64_t));
    std::uint32_t key{0U};
    auto map_error = readMapKey(value, map_fd, &key);
    if (map_error != ReadMapError::Succeeded) {
      std::cerr << "Failed to read from the map\n";
      return 1;
    std::vector per_cpu_counters(processor_count);
    std::memcpy(,, value.size());

    When dealing with per-CPU maps, it is important to not rely on get_nprocs_conf and use /sys/devices/system/cpu/possible instead. On VMware Fusion for example, the vcpu.hotadd setting will cause Linux to report 128 possible CPUs when enabled, regardless of how many cores have been actually assigned to the virtual machine.

    The full sample code can be found in the 02-syscall_counter folder.

    One interesting experiment is to attach this program to the system call tracepoint used by the chmod command line tool to update file modes. The strace debugging utility can help determine which syscall is being used. In this case we are going to be monitoring the following tracepoint: syscalls/sys_enter_fchmodat.

    The taskset command can be altered to force the fchmodat syscall to be called from a specific processor:

    taskset 1 chmod /path/to/file # CPU 1
    taskset 2 chmod /path/to/file # CPU 2

    Using perf event outputs

    Maps can be a really powerful way to store data for later processing, but it’s impossible for user mode programs to know when and where new data is available for reading.

    Perf event outputs can help solve this problem, since they enable the program to be notified whenever new data is available. Additionally, since they behave like a circular buffer, we do not have the same size limitations we have when setting map values.

    In this section, we’ll build an application that can measure how much time it takes to handle a system call. To make this work, we’ll attach a program to both the entry and exit points of a tracepoint to gather timestamps.


    Before we start creating our perf output, we have to create a structure to hold our resources. In total, we’ll have a file descriptor for the map and then a perf output per processor, along with its own memory mapping.

    struct PerfEventArray final {
      int fd;
      std::vector output_fd_list;
      std::vector mapped_memory_pointers;

    To initialize it, we have to create a BPF map of the type PERF_EVENT_ARRAY first. This special data structure maps a specific CPU index to a private perf event output specified as a file descriptor. For it to function properly, we must use the following parameters when creating the map:

    1. Key size must be set to 4 bytes (CPU index).
    2. Value size must be set to 4 bytes (size of a file descriptor specified with an int).
    3. Entry count must be set to a value greater than or equal to the number of processors.
    auto processor_count = getProcessorCount();
    // Create the perf event array map
    obj.fd = createMap(BPF_MAP_TYPE_PERF_EVENT_ARRAY, 4U, 4U, processor_count);
    if (obj.fd < 0) {
      return false;

    When we looked at maps in the previous sections, we only focused on reading. For the next steps we also need to write new values, so let’s take a look at how to set keys.

    ReadMapError setMapKey(std::vector &value, int map_fd,
                           const void *key) {
      union bpf_attr attr = {};
      attr.flags = BPF_ANY; // Always set the value
      attr.map_fd = static_cast<__u32>(map_fd);
      // Use memcpy to avoid string aliasing issues
      std::memcpy(&attr.key, &key, sizeof(attr.key));
      auto value_ptr =;
      std::memcpy(&attr.value, &value_ptr, sizeof(attr.value));
      auto err = ::syscall(__NR_bpf, BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr));
      if (err < 0) {
        return ReadMapError::Failed;
      return ReadMapError::Succeeded;

    This is not too different from how we read map values, but this time we don’t have to deal with the chance that the key may not be present. As always when dealing with per-CPU maps, the data pointer should be considered as an array containing one value per CPU.

    The next step is to create a perf event output for each online processor with the perf_event_open system call, using the special PERF_COUNT_SW_BPF_OUTPUT config value.

    struct perf_event_attr attr {};
    attr.type = PERF_TYPE_SOFTWARE;
    attr.size = sizeof(attr);
    attr.config = PERF_COUNT_SW_BPF_OUTPUT;
    attr.sample_period = 1;
    attr.sample_type = PERF_SAMPLE_RAW;
    attr.wakeup_events = 1;
    std::uint32_t processor_index;
    for (processor_index = 0U; processor_index < processor_count;
         ++processor_index) {
      // clang-format off
      auto perf_event_fd = ::syscall(
        -1,               // Process ID (unused)
        processor_index,  // 0 -> getProcessorCount()
        -1,               // Group ID (unused)
        0                 // Flags (unused)
      // clang-format on
      if (perf_event_fd == -1) {
        return false;

    Now that we have the file descriptors, we can populate the perf event array map we created:

    // Set the perf event output file descriptors inside the map
    processor_index = 0U;
    for (auto perf_event_fd : obj.output_fd_list) {
      std::vector value(4);
      std::memcpy(, &perf_event_fd, sizeof(perf_event_fd));
      auto err = setMapKey(value, obj.fd, &processor_index);
      if (err != ReadMapError::Succeeded) {
        return false;

    Finally, we create a memory mapping for each perf output:

    // Create a memory mapping for each output
    auto size = static_cast(1 + std::pow(2, page_count));
    size *= static_cast(getpagesize());
    for (auto &perf_event_fd : obj.output_fd_list) {
      auto ptr = mmap(nullptr,                // Desired base address (unused)
                      size,                   // Mapped memory size
                      PROT_READ | PROT_WRITE, // Memory protection
                      MAP_SHARED,             // Flags
                      perf_event_fd,          // The perf output handle
                      0                       // Offset (unused)
      if (ptr == MAP_FAILED) {
        return false;

    This is the memory we’ll read from when capturing the BPF program output.

    Writing a BPF program to profile system calls

    Now that we have a file descriptor of the perf event array map, we can use it from within the BPF code to send data with the bpf_perf_event_output helper function. Here’s the prototype from linux/bpf.h:

     * int bpf_perf_event_output(struct pt_reg *ctx, struct bpf_map *map, u64 flags, void *data, u64 size)
     * 	Description
     * 		Write raw *data* blob into a special BPF perf event held by
     * 		*map* of type **BPF_MAP_TYPE_PERF_EVENT_ARRAY**. This perf
     * 		event must have the following attributes: **PERF_SAMPLE_RAW**
     * 		as **sample_type**, **PERF_TYPE_SOFTWARE** as **type**, and
     * 		**PERF_COUNT_SW_BPF_OUTPUT** as **config**.
     * 		The *flags* are used to indicate the index in *map* for which
     * 		the value must be put, masked with **BPF_F_INDEX_MASK**.
     * 		Alternatively, *flags* can be set to **BPF_F_CURRENT_CPU**
     * 		to indicate that the index of the current CPU core should be
     * 		used.
     * 		The value to write, of *size*, is passed through eBPF stack and
     * 		pointed by *data*.
     * 		The context of the program *ctx* needs also be passed to the
     * 		helper.

    The ctx parameter must be always set to the value of the first argument received in the entry point function of the BPF program.

    The map address is obtained with the LLVM pseudo intrinsic that we imported in the previous section. Data and size are self-explanatory, but it is important to remember that the memory pointer must reside inside the BPF program (i.e., we can’t pass a user pointer).

    The last parameter, flags, can be used as a CPU index mask to select the perf event output this data should be sent to. A special value can be passed to ask the BPF VM to automatically use the index of the processor we’re running on.

    // Sends the specified buffer to the map_fd perf event output
    llvm::Value *bpfPerfEventOutput(llvm::IRBuilder<> &builder, llvm::Value *ctx,
                                    int map_fd, std::uint64_t flags,
                                    llvm::Value *data, llvm::Value *size) {
      // clang-format off
      std::vector argument_type_list = {
        // Context
        // Map address
        // Flags
        // Data pointer
        // Size
      // clang-format on
      auto function_type =
          llvm::FunctionType::get(builder.getInt32Ty(), argument_type_list, false);
      auto function =
      auto map_address = mapAddressFromFileDescriptor(map_fd, builder);
      auto function_callee = function;
      auto function_callee = llvm::FunctionCallee(function_type, function);
      return builder.CreateCall(
          function_callee, {ctx, map_address, builder.getInt64(flags), data, size});

    The file descriptor and flags parameters are most likely known at compile time, so we can make the function a little more user friendly by accepting integer types. The buffer size, however, is often determined at runtime, so it’s best to use an llvm::Value pointer.

    While it’s possible to just send the raw timestamps whenever we enter and leave the system call of our choice, it’s much easier and more efficient to compute what we need directly inside the BPF code. To do this we’ll use a per-CPU hash map shared across two different BPF programs: one for the sys_enter event, and another one for the sys_exit.

    From the enter program, we’ll save the system timestamp in the map. When the exit program is invoked, we’ll retrieve it and use it to determine how much time it took. The resulting value is then sent to the user mode program using the perf output.

    Creating the map is easy, and we can re-use the map helpers we wrote in the previous sections. Both the timestamp and the map key are 64-bit values, so we’ll use 8 bytes for both:

    auto map_fd = createMap(BPF_MAP_TYPE_HASH, 8U, 8U, 100U);
    if (map_fd < 0) {
      std::cerr << "Failed to create the map\n";
      return 1;

    Writing the enter program

    We will need to generate a key for our map. A combination of the process ID and thread ID is a good candidate for this:

     * u64 bpf_get_current_pid_tgid(void)
     * 	Return
     * 		A 64-bit integer containing the current tgid and pid, and
     * 		created as such:
     * 		*current_task*\ **->tgid << 32 \|**
     * 		*current_task*\ **->pid**.

    Then the system timestamp needs to be acquired. Even though the ktime_get_ns helper function counts the time from the boot, it’s still a good alternative since we only have to use it to calculate the execution time.

     * u64 bpf_ktime_get_ns(void)
     *  Description
     *    Return the time elapsed since system boot, in nanoseconds.
     *  Return
     *    Current *ktime*.

    By now you should be well versed in importing them, so here are the two definitions:

    // Returns a 64-bit integer that contains both the process and thread id
    llvm::Value *bpfGetCurrentPidTgid(llvm::IRBuilder<> &builder) {
      auto function_type = llvm::FunctionType::get(builder.getInt64Ty(), {}, false);
      auto function =
      auto function_callee = function;
      auto function_callee = llvm::FunctionCallee(function_type, function);
      return builder.CreateCall(function_callee, {});
    // Returns the amount of nanoseconds elapsed from system boot
    llvm::Value *bpfKtimeGetNs(llvm::IRBuilder<> &builder) {
      auto function_type = llvm::FunctionType::get(builder.getInt64Ty(), {}, false);
      auto function =
      auto function_callee = function;
      auto function_callee = llvm::FunctionCallee(function_type, function);
      return builder.CreateCall(function_callee, {});

    We can now use the newly defined functions to generate a map key and acquire the system timestamp:

    // Map keys and values are passed by pointer; create two buffers on the
    // stack and initialize them
    auto map_key_buffer = builder.CreateAlloca(builder.getInt64Ty());
    auto timestamp_buffer = builder.CreateAlloca(builder.getInt64Ty());
    auto current_pid_tgid = bpfGetCurrentPidTgid(builder);
    builder.CreateStore(current_pid_tgid, map_key_buffer);
    auto timestamp = bpfKtimeGetNs(builder);
    builder.CreateStore(timestamp, timestamp_buffer);

    For this program we have replaced the array map we used in the previous sections with a hash map. It’s no longer possible to use the bpf_map_lookup_elem() helper since the map key we have will fail with ENOENT if the element does not exist.

    To fix this, we have to import a new helper named bpf_map_update_elem():

    * int bpf_map_update_elem(struct bpf_map *map, const void *key, const void *value, u64 flags)
    * 	Description
    * 		Add or update the value of the entry associated to *key* in
    * 		*map* with *value*. *flags* is one of:
    * 		**BPF_NOEXIST**
    * 			The entry for *key* must not exist in the map.
    * 		**BPF_EXIST**
    * 			The entry for *key* must already exist in the map.
    * 		**BPF_ANY**
    * 			No condition on the existence of the entry for *key*.
    * 		Flag value **BPF_NOEXIST** cannot be used for maps of types
    * 		elements always exist), the helper would return an error.
    * 	Return
    * 		0 on success, or a negative error in case of failure.

    We’ll keep the map file descriptor and flag values as integers, since we know their values before the module is compiled.

    // Updates the value of the specified key inside the map_fd BPF map
    llvm::Value *bpfMapUpdateElem(llvm::IRBuilder<> &builder, int map_fd,
                                  llvm::Value *key, llvm::Value *value,
                                  std::uint64_t flags) {
      // clang-format off
      std::vector argument_type_list = {
        // Map address
        // Key
        // Value
        // Flags
      // clang-format on
      auto function_type =
          llvm::FunctionType::get(builder.getInt64Ty(), argument_type_list, false);
      auto function =
      auto map_address = mapAddressFromFileDescriptor(map_fd, builder);
      auto function_callee = function;
      auto function_callee = llvm::FunctionCallee(function_type, function);
      return builder.CreateCall(function_callee,
                                {map_address, key, value, builder.getInt64(flags)});

    We can now store the timestamp inside the map and close the enter program:

    // Save the timestamp inside the map
    bpfMapUpdateElem(builder, map_fd, map_key_buffer, timestamp_buffer, BPF_ANY);

    ‍Writing the exit program

    In this program, we’ll retrieve the timestamp we stored and use it to measure how much time we’ve spent inside the system call. Once we have the result, we’ll send it to user mode using the perf output.

    When creating the llvm::Function for this program, we must define at least one argument. This value will be required later for the ctx parameter that we have to pass to the bpf_perf_event_output() helper.

    First, we have to acquire the map entry; as always, we must check for any possible error or the verifier will not let us load our program.

    // Create the entry basic block
    auto entry_bb = llvm::BasicBlock::Create(context, "entry", function);
    // Map keys are passed by pointer; create a buffer on the stack and initialize
    // it
    auto map_key_buffer = builder.CreateAlloca(builder.getInt64Ty());
    auto current_pid_tgid = bpfGetCurrentPidTgid(builder);
    builder.CreateStore(current_pid_tgid, map_key_buffer);
    // Check the pointer and make sure the lookup has succeeded; this is
    // mandatory, or the BPF verifier will refuse to load our program
    auto timestamp_ptr =
        bpfMapLookupElem(builder, map_key_buffer, builder.getInt64Ty(), map_fd);
    auto null_ptr = llvm::Constant::getNullValue(timestamp_ptr->getType());
    auto cond = builder.CreateICmpEQ(null_ptr, timestamp_ptr);
    auto error_bb = llvm::BasicBlock::Create(context, "error", function);
    auto continue_bb = llvm::BasicBlock::Create(context, "continue", function);
    builder.CreateCondBr(cond, error_bb, continue_bb);
    // Terminate the program if the pointer is not valid
    // In this new basic block, the pointer is valid

    Next, we want to read our previous timestamp and subtract it from the current time:

    // Read back the old timestamp and obtain the current one
    auto enter_timestamp = builder.CreateLoad(timestamp_ptr);
    auto exit_timestamp = bpfKtimeGetNs(builder);
    // Measure how much it took to go from the first instruction to the return
    auto time_consumed = builder.CreateSub(exit_timestamp, enter_timestamp);

    The bpf_perf_event_output expects a buffer, so we have to store our result somewhere in memory. We can re-use the map value address so we don’t have to allocate more stack space:

    builder.CreateStore(time_consumed, timestamp_ptr);

    Remember, we have to pass the first program argument to the ctx parameter; the arg_begin method of an llvm::Function will return exactly that. When sending data, the bpf_perf_event_output() helper expects a pointer. We can re-use the timestamp pointer we obtained from the map and avoid allocating additional memory to the very limited stack we have:

    builder.CreateStore(time_consumed, timestamp_ptr);
    // Send the result to the perf event array
    auto ctx = function->arg_begin();
    bpfPerfEventOutput(builder, ctx, perf_fd, static_cast(-1UL),
                       timestamp_ptr, builder.getInt64(8U));

    Using -1UL as the flag value means that BPF will automatically send this data to the perf event output associated with the CPU we’re running on.

    Reading data from the perf outputs

    In our user mode program, we can access the perf buffers through the memory mappings we created. The list of perf event output descriptors can be used together with the poll() function using an array of pollfd structures. When one of the fd we have set is readable, the corresponding memory mapping will contain the data sent by the BPF program.

    // Uses poll() to wait for the next event happening on the perf even toutput
    bool waitForPerfData(std::vector &readable_outputs,
                         const PerfEventArray &obj, int timeout) {
      readable_outputs = {};
      // Collect all the perf event output file descriptors inside a
      // pollfd structure
      std::vector poll_fd_list;
      for (auto fd : obj.output_fd_list) {
        struct pollfd poll_fd = {};
        poll_fd.fd = fd; = POLLIN;
      // Use poll() to determine which outputs are readable
      auto err = ::poll(, poll_fd_list.size(), timeout);
      if (err < 0) {
        if (errno == EINTR) {
          return true;
        return false;
      } else if (err == 0) {
        return true;
      // Save the index of the outputs that can be read inside the vector
      for (auto it = poll_fd_list.begin(); it != poll_fd_list.end(); ++it) {
        auto ready = ((it->events & POLLIN) != 0);
        if (ready) {
          auto index = static_cast(it - poll_fd_list.begin());
      return true;

    Inside the memory we have mapped, the perf_event_mmap_page header will describe the properties and boundaries of the allocated circular buffer.

    The structure is too big to be reported here, but the most important fields are:

    __u64  data_head;   /* head in the data section */
    __u64  data_tail;   /* user-space written tail */
    __u64  data_offset; /* where the buffer starts */
    __u64  data_size;   /* data buffer size */

    The base of the data allocation is located at the offset data_offset; to find the start of our buffer, however, we have to add it to the data_tail value, making sure to wrap around whenever we exceed the data allocation size specified by the data_size field:

    buffer_start = mapped_memory + data_offset + (data_tail % data_size)

    Similarly, the data_head field can be used to find the end of the buffer:

    buffer_end = mapped_memory + data_offset + (data_head % data_size)

    If the end of the buffer is at a lower offset compared to the start, then data is wrapping at the data_size edge and the read has to happen with two operations.

    When extracting data, the program is expected to confirm the read by updating the data_tail value and adding the number of bytes processed, while the kernel will advance the data_head field automatically as new bytes are received. Data is lost when the data_head offset wraps around and crosses data_tail; a special structure inside this buffer will warn the program if this happens.

    Program data is packaged inside the data we have just extracted, preceded by two headers. The first one is the perf_event_header structure:

    struct perf_event_header {
      u32 type;
      u16 misc;
      u16 size;

    The second one is an additional 32-bit size field that accounts for itself and the data that follows. Multiple consecutive writes from the BPF program may be added under the same object. Data is, however, grouped by type, which can be used to determine what kind of data to expect after the header. When using BPF, we’ll only have to deal with either our data or a notification of type PERF_RECORD_LOST, which is used to inform the program that a bpf_perf_event_output() call has overwritten data in the ring buffer before we could have a chance to read it.

    Here’s some annotated code that shows how the whole procedure works:

    using PerfBuffer = std::vector;
    using PerfBufferList = std::vector;
    // Reads from the specified perf event array, appending new bytes to the
    // perf_buffer_context. When a new complete buffer is found, it is moved
    // inside the the 'data' vector
    bool readPerfEventArray(PerfBufferList &data,
                            PerfBufferList &perf_buffer_context,
                            const PerfEventArray &obj, int timeout) {
      // Keep track of the offsets we are interested in to avoid
      // strict aliasing issues
      static const auto kDataOffsetPos{
          offsetof(struct perf_event_mmap_page, data_offset)};
      static const auto kDataSizePos{
          offsetof(struct perf_event_mmap_page, data_size)};
      static const auto kDataTailPos{
          offsetof(struct perf_event_mmap_page, data_tail)};
      static const auto kDataHeadPos{
          offsetof(struct perf_event_mmap_page, data_head)};
      data = {};
      if (perf_buffer_context.empty()) {
        auto processor_count = getProcessorCount();
      // Use poll() to determine which perf event outputs are readable
      std::vector readable_outputs;
      if (!waitForPerfData(readable_outputs, obj, timeout)) {
        return false;
      for (auto perf_output_index : readable_outputs) {
        // Read the static header fields
        auto perf_memory = static_cast(
        std::uint64_t data_offset{};
        std::memcpy(&data_offset, perf_memory + kDataOffsetPos, 8U);
        std::uint64_t data_size{};
        std::memcpy(&data_size, perf_memory + kDataSizePos, 8U);
        auto edge = perf_memory + data_offset + data_size;
        for (;;) {
          // Read the dynamic header fields
          std::uint64_t data_head{};
          std::memcpy(&data_head, perf_memory + kDataHeadPos, 8U);
          std::uint64_t data_tail{};
          std::memcpy(&data_tail, perf_memory + kDataTailPos, 8U);
          if (data_head == data_tail) {
          // Determine where the buffer starts and where it ends, taking into
          // account the fact that it may wrap around
          auto start = perf_memory + data_offset + (data_tail % data_size);
          auto end = perf_memory + data_offset + (data_head % data_size);
          auto byte_count = data_head - data_tail;
          auto read_buffer = PerfBuffer(byte_count);
          if (end < start) {
            auto bytes_until_wrap = static_cast(edge - start);
            std::memcpy(, start, bytes_until_wrap);
            auto remaining_bytes =
                static_cast(end - (perf_memory + data_offset));
            std::memcpy( + bytes_until_wrap,
                        perf_memory + data_offset, remaining_bytes);
          } else {
            std::memcpy(, start, byte_count);
          // Append the new data to our perf buffer
          auto &perf_buffer = perf_buffer_context[perf_output_index];
          auto insert_point = perf_buffer.size();
          perf_buffer.resize(insert_point + read_buffer.size());
          // Confirm the read
          std::memcpy(perf_memory + kDataTailPos, &data_head, 8U);
      // Extract the data from the buffers we have collected
      for (auto &perf_buffer : perf_buffer_context) {
        // Get the base header
        struct perf_event_header header = {};
        if (perf_buffer.size() < sizeof(header)) {
        std::memcpy(&header,, sizeof(header));
        if (header.size > perf_buffer.size()) {
        if (header.type == PERF_RECORD_LOST) {
          std::cout << "One or more records have been lost\n";
        } else {
          // Determine the buffer boundaries
          auto buffer_ptr = + sizeof(header);
          auto buffer_end = + header.size;
          for (;;) {
            if (buffer_ptr + 4U >= buffer_end) {
            // Note: this is data_size itself + bytes used for the data
            std::uint32_t data_size = {};
            std::memcpy(&data_size, buffer_ptr, 4U);
            buffer_ptr += 4U;
            data_size -= 4U;
            if (buffer_ptr + data_size >= buffer_end) {
            auto program_data = PerfBuffer(data_size);
            std::memcpy(, buffer_ptr, data_size);
            buffer_ptr += 8U;
            data_size -= 8U;
        // Erase the chunk we consumed from the buffer
        perf_buffer.erase(perf_buffer.begin(), perf_buffer.begin() + header.size);
      return true;

    Writing the main function

    While it is entirely possible (and sometimes useful, in order to share types) to use a single LLVM module and context for both the enter and exit programs, we will create two different modules to avoid changing the previous sample code we’ve built.

    The program generation goes through the usual steps, but now we are loading two instead of one, so the previous code has been changed to reflect that.

    The new and interesting part is the main loop where the perf event output data is read and processed:

    // Incoming data is appended here
    PerfBufferList perf_buffer;
    std::uint64_t total_time_used{};
    std::uint64_t sample_count{};
    std::cout << "Tracing average time used to service the following syscall: "
              << kSyscallName << "\n";
    std::cout << "Collecting samples for 10 seconds...\n";
    auto start_time = std::chrono::system_clock::now();
    for (;;) {
      // Data that is ready for processing is moved inside here
      PerfBufferList data;
      if (!readPerfEventArray(data, perf_buffer, perf_event_array, 1)) {
        std::cerr << "Failed to read from the perf event array\n";
        return 1;
      // Inspect the buffers we have received
      for (const auto &buffer : data) {
        if (buffer.size() != 8U) {
          std::cout << "Unexpected buffer size: " << buffer.size() << "\n";
        // Read each sample and update the counters; use memcpy to avoid
        // strict aliasing issues
        std::uint64_t time_used{};
        std::memcpy(&time_used,, 8U);
        total_time_used += time_used;
        std::cout << time_used << "ns\n";
      // Exit after 10 seconds
      auto elapsed_msecs = std::chrono::duration_cast(
                               std::chrono::system_clock::now() - start_time)
      if (elapsed_msecs > 10000) {
    // Print a summary of the data we have collected
    std::cout << "Total time used: " << total_time_used << " nsecs\n";
    std::cout << "Sample count: " << sample_count << "\n";
    std::cout << "Average: " << (total_time_used / sample_count) << " nsecs\n";

    The full source code can be found in the 03-syscall_profiler folder.

    Running the sample program as root should print something similar to the following output:

    Tracing average time used to service the following syscall: fchmodat
    Collecting samples for 10 seconds...
    Total time used: 953357 nsecs
    Sample count: 9
    Average: 105928 nsecs

    Writing a BPF program to do ANYTHING

    BPF is in active development and is becoming more and more useful with each update, enabling new use cases that extend the original vision. Recently, newly added BPF functionality allowed us to write a simple system-wide syscall fault injector using nothing but BPF and a compatible kernel that supported the required bpf_override_return functionality.

    If you want to keep up with how this technology evolves, one of the best places to start with is Brendan’s Gregg blog. The IO Visor Project repository also contains a ton of code and documentation that is extremely useful if you plan on writing your own BPF-powered tools.

    Want to integrate BPF into your products? We can help! Contact us today, and check out our ebpfpub library.

  • Leave a Reply