How to write a rootkit without really trying

We open-sourced a fault injection tool, KRF, that uses kernel-space syscall interception. You can use it today to find faulty assumptions (and resultant bugs) in your programs. Check it out!


This post covers intercepting system calls from within the Linux kernel, via a plain old kernel module.

We’ll go through a quick refresher on syscalls and why we might want to intercept them and then demonstrate a bare-bones module that intercepts the read(2) syscall.

But first, you might be wondering:

What makes this any different from $other_fault_injection_strategy?

Other fault injection tools rely on a few different techniques:

  • There’s the well-known LD_PRELOAD trick, which really intercepts the syscall wrapper exposed by libc (or your language runtime of choice). This often works (and can be extremely useful for e.g. spoofing the system time within a program or using SOCKS proxies transparently), but comes with some major downsides:
    • LD_PRELOAD only works when libc (or the target library of choice) has been dynamically linked, but newer languages (read: Go) and deployment trends (read: fully static builds and non-glibc Linux containers) have made dynamic linkage less popular.
    • Syscall wrappers frequently deviate significantly from their underlying syscalls: depending on your versions of Linux and glibc open() may call openat(2), fork() may call clone(2), and other calls may modify their flags or default behavior for POSIX compliance. As a result, it can be difficult to reliably predict whether a given syscall wrapper invokes its syscall namesake.
  • Dynamic instrumentation frameworks like DynamoRIO or Intel PIN can be used to identify system calls at either the function or machine-code level and instrument their calls and/or returns. While this grants us fine-grained access to individual calls, it usually comes with substantial runtime overhead.

Injecting faults within kernelspace sidesteps the downsides of both of these approaches: it rewrites the actual syscalls directly instead of relying on the dynamic loader, and it adds virtually no runtime overhead (beyond checking to see whether a given syscall is one we’d like to fault).

What makes this any different from $other_blog_post_on_syscall_interception?

Other blog posts address the interception of syscalls, but many:

  • Grab the syscall table by parsing their kernel’s System.map, which can be unreliable (and is slower than the approach we give below).
  • Assume that the kernel exports sys_call_table and that extern void *sys_call_table will work (not true on Linux 2.6+).
  • Involve prodding large ranges of kernel memory, which is slow and probably dangerous.

Basically, we couldn’t find a recent (>2015) blog post that described a syscall interception process that we liked. So we developed our own.

Why not just use eBPF or kprobes?

eBPF can’t intercept syscalls. It can only record their parameters and return types.

The kprobes API might be able to perform interception from within a kernel module, although I haven’t come across a really good source of information about it online. In any case, the point here is to do it ourselves!

Will this work on $architecture?

For the most part, yes. You’ll need to make some adjustments to the write-unlocking macro for non-x86 platforms.

What’s a syscall?

A syscall, or system call, is a function1 that exposes some kernel-managed resource (I/O, process control, networking, peripherals) to user-space processes. Any program that takes user input, communicates with other programs, changes files on disk, uses the system time, or contacts another device over a network (usually) does so via syscalls.2

The core UNIX-y syscalls are fairly primitive: open(2), close(2), read(2), and write(2) for the vast majority of I/O; fork(2), kill(2), signal(2), exit(2), and wait(2) for process management; and so forth.

The socket management syscalls are mostly bolted on to the UNIX model: send(2) and recv(2) behave much like read(2) and write(2), but with additional transmission flags. ioctl(2) is the kernel’s garbage dump, overloaded to perform every conceivable operation on a file descriptor where no simpler means exists. Despite these additional complexities in usage, the underlying principle behind their usage (and interception) remains the same. If you’d like to dive all the way in, Filippo Valsorda maintains an excellent Linux syscall reference for x86 and x86_64.

Unlike regular function calls in user-space, syscalls are extraordinarily expensive: on x86 architectures, int 80h (or the more modern sysenter/syscall instructions) causes both the CPU and the kernel to execute slow interrupt-handling code paths as well as perform a privilege-context switch.3

Why intercept syscalls?

For a few different reasons:

  • We’re interested in gathering statistics about a given syscall’s usage, beyond
    what eBPF or another instrumentation API could (easily) provide.
  • We’re interested in fault injection that can’t be avoided by static linking or manual syscall(3) invocations (our use case).
  • We’re feeling malicious, and we want to write a rootkit that’s hard to remove from user-space (and possibly even kernel-space, with a few tricks).4

Why do I need fault injection?

Fault injection finds bugs in places that fuzzing and conventional unit testing often won’t:

  • NULL dereferences caused by assuming that particular functions never fail (are you sure you always check whether getcwd(2) succeeds?) Are you sure that you’re doing better than systemd?
  • Memory corruption caused by unexpectedly small buffers, or disclosure caused by unexpectedly large buffers
  • Integer over/underflow caused by invalid or unexpected values (are you sure you’re not making incorrect assumptions about stat(2)‘s atime/mtime/ctime fields?)

Getting started: Finding the syscall table

Internally, the Linux kernel stores syscalls within the syscall table, an array
of __NR_syscalls pointers. This table is defined as sys_call_table, but has not been directly exposed as a symbol (to kernel modules) since Linux 2.5.

First thing, we need to get the syscall table’s address, ideally without using the System.map file or scanning kernel memory for well-known addresses. Luckily for us, Linux provides a superior interface than either of these: kallsyms_lookup_name.

This makes retrieving the syscall table as easy as:

static unsigned long *sys_call_table;

int init_module(void) {
  sys_call_table = (void *)kallsyms_lookup_name("sys_call_table");

  if (sys_call_table == NULL) {
    printk(KERN_ERR "Couldn't look up sys_call_table\n");
    return -1;
  }

  return 0;
}

Of course, this only works if your Linux kernel was compiled with CONFIG_KALLSYMS=1. Debian and Ubuntu provide this, but you may need to test in other distros. If your distro doesn’t enable kallsyms by default, consider using a VM for one that does (you weren’t going to test this code on your host, were you?).

Injecting our replacement syscalls

Now that we have the kernel’s syscall table, injecting our replacement should be as easy as:

static unsigned long *sys_call_table;
static typeof(sys_read) *orig_read;

/* asmlinkage is important here -- the kernel expects syscall parameters to be
 * on the stack at this point, not inside registers.
 */
asmlinkage long phony_read(int fd, char __user *buf, size_t count) {
  printk(KERN_INFO "Intercepted read of fd=%d, %lu bytes\n", fd, count);

  return orig_read(fd, buf, count);
}

int init_module(void) {
  sys_call_table = (void *)kallsyms_lookup_name("sys_call_table");

  if (sys_call_table == NULL) {
    printk(KERN_ERR "Couldn't look up sys_call_table\n");
    return -1;
  }

  orig_read = (typeof(sys_read) *)sys_call_table[__NR_read];
  sys_call_table[__NR_read] = (void *)&phony_read;

  return 0;
}

void cleanup_module(void) {
  /* Don't forget to fix the syscall table on module unload, or you'll be in
   * for a nasty surprise!
   */
  sys_call_table[__NR_read] = (void *)orig_read;
}

…but it isn’t that easy, at least not on x86: sys_call_table is write-protected by the CPU itself. Attempting to modify it will cause a page fault (#PF) exception.5 To get around this, we twiddle the 16th bit of the cr0 register, which controls the write-protect state:

#define CR0_WRITE_UNLOCK(x) \
  do { \
    write_cr0(read_cr0() & (~X86_CR0_WP)); \
    x; \
    write_cr0(read_cr0() | X86_CR0_WP); \
  } while (0)

Then, our insertions become a matter of:

CR0_WRITE_UNLOCK({
  sys_call_table[__NR_read] = (void *)&phony_read;
});

and:

CR0_WRITE_UNLOCK({
  sys_call_table[__NR_read] = (void *)orig_read;
});

and everything works as expected…almost.

We’ve assumed a single processor; there’s an SMP-related race condition bug in the way we twiddle cr0. If our kernel task were preempted immediately after disabling write-protect and placed onto another core with WP still enabled, we’d get a page fault instead of a successful memory write. The chances of this happening are pretty slim, but it doesn’t hurt to be careful by implementing a guard around the critical section:

#define CR0_WRITE_UNLOCK(x) \
  do { \
    unsigned long __cr0; \
    preempt_disable(); \
    __cr0 = read_cr0() & (~X86_CR0_WP); \
    BUG_ON(unlikely((__cr0 & X86_CR0_WP))); \
    write_cr0(__cr0); \
    x; \
    __cr0 = read_cr0() | X86_CR0_WP; \
    BUG_ON(unlikely(!(__cr0 & X86_CR0_WP))); \
    write_cr0(__cr0); \
    preempt_enable(); \
  } while (0)

(The astute will notice that this is almost identical to the “rare write” mechanism from PaX/grsecurity. This is not a coincidence: it’s based on it!)

What’s next?

The phony_read above just wraps the real sys_read and adds a printk, but we could just as easily have it inject a fault:

asmlinkage long phony_read(int fd, char __user *buf, size_t count) {
  return -ENOSYS;
}

…or a fault for a particular user:

asmlinkage long phony_read(int fd, char __user *buf, size_t count) {
  if (current_uid().val == 1005) {
    return -ENOSYS;
  } else {
    return orig_read(fd, buf, count);
  }
}

…or return bogus data:

asmlinkage long phony_read(int fd, char __user *buf, size_t count) {
  unsigned char kbuf[1024];

  memset(kbuf, 'A', sizeof(kbuf));
  copy_to_user(buf, kbuf, sizeof(kbuf));

  return sizeof(kbuf);
}

Syscalls happen under task context within the kernel, meaning that the
current task_struct is valid. Opportunities for poking through kernel structures abound!

Wrap up

This post covers the very basics of kernel-space syscall interception. To do anything really interesting (like precise fault injection or statistics beyond those provided by official introspection APIs), you’ll need to read a good kernel module programming guide6 and do the legwork yourself.

Our new tool, KRF, does everything mentioned above and more: it can intercept and fault syscalls with per-executable precision, operate on an entire syscall “profile” (e.g., all syscalls that touch the filesystem or perform process scheduling), and can fault in real-time without breaking a sweat. Oh, and static linkage doesn’t bother it one bit: if your program makes any syscalls, KRF will happily fault them.

Other work

Outside of kprobes for kernel-space interception and LD_PRELOAD for user-space interception of wrappers, there are a few other clever tricks out there:

  • syscall_intercept is loaded through LD_PRELOAD like a normal wrapper interceptor, but actually uses capstone internally to disassemble (g)libc and instrument the syscalls that it makes. This only works on syscalls made by the libc wrappers, but it’s still pretty cool.
  • ptrace(2) can be used to instrument syscalls made by a child process, all within user-space. It comes with two considerable downsides, though: it can’t be used in conjunction with a debugger, and it returns (PTRACE_GETREGS) architecture-specific state on each syscall entry and exit. It’s also slow. Chris Wellons’s awesome blog post covers ptrace(2)‘s many abilities.

  1. More of a “service request” than a “function” in the ABI sense, but thinking about syscalls as a special class of functions is a serviceable-enough fabrication.
  2. The number of exceptions to this continues to grow, including user-space networking stacks and the Linux kernel’s vDSO for many frequently called syscalls, like time(2).
  3. No process context switch is necessary. Linux executes syscalls within the same underlying kernel task that the process belongs to. But a processor context switch does occur.
  4. I won’t detail this because it’s outsite of this post’s scope, but consider that init_module(2) and delete_module(2) are just normal syscalls.
  5. Sidenote: this is actually how CoW works on Linux. fork(2) write-protects the pre-duplicated process space, and the kernel waits for the corresponding page fault to tell it to copy a page to the child.
  6. This one’s over a decade old, but it covers the basics well. If you run into missing symbols or changed signatures, you should find the current equivalents with a quick search.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s