Holy Macroni! A recipe for progressive language enhancement

September 11, 2023

compilers, linux, llvm, mlir, static-analysis, vast

Page content

Despite its use for refactoring and static analysis tooling, Clang has a massive shortcoming: the Clang AST does not provide provenance information about which CPP macro expansions a given AST node is expanded from; nor does it lower macro expansions down to LLVM Intermediate Representation (IR) code. This makes the construction of macro-aware static analyses and transformations exceedingly difficult and an ongoing area of research.¹ Struggle no more, however, because this summer at Trail of Bits, I created Macroni to make it easy to create macro-aware static analyses.

Macroni allows developers to define the syntax for new C language constructs with macros and provide the semantics for these constructs with MLIR. Macroni uses VAST to lower C code down to MLIR and uses PASTA to obtain macro-to-AST provenance information to lower macros down to MLIR as well. Developers can then define custom MLIR converters to transform Macroni’s output into domain-specific MLIR dialects for more nuanced analyses. In this post, I will present several examples of how to use Macroni to augment C with safer language constructs and build C safety analyses.

Stronger typedefs

C typedefs are useful for giving semantically meaningful names to lower-level types; however, C compilers don’t use these names during type checking and perform their type checking only on the lower-level types instead. This can manifest in a simple form of type-confusion bug when the semantic types represent different formats or measures, such as the following:

typedef double fahrenheit;
typedef double celsius;
fahrenheit F;
celsius C;
F = C; // No compiler error or warning

Figure 1: C type checking considers only typedef’s underlying types.

The above code successfully type checks, but the semantic difference between the types fahrenheit and celsius should not be ignored, as they represent values in different temperature scales. There is no way to enforce this sort of strong typing using C typedefs alone.

With Macroni, we can use macros to define the syntax for strong typedefs and MLIR to implement custom type checking for them. Here’s an example of using macros to define strong typedefs representing temperatures in degrees Fahrenheit and Celsius:

#define STRONG_TYPEDEF(name) name
typedef double STRONG_TYPEDEF(fahrenheit);
typedef double STRONG_TYPEDEF(celsius);

Figure 2: Using macros to define syntax for strong C typedefs

Wrapping a typedef name with the STRONG_TYPEDEF() macro allows Macroni to identify typedefs whose names were expanded from invocations of STRONG_TYPEDEF() and convert them into the types of a custom MLIR dialect (e.g., temp), like so:

%0 = hl.var "F" : !hl.lvalue<!temp.fahrenheit>
%1 = hl.var "C" : !hl.lvalue<!temp.celsius>
%2 = hl.ref %1 : !hl.lvalue<!temp.celsius>
%3 = hl.ref %0 : !hl.lvalue<!temp.fahrenheit>
%4 = hl.implicit_cast %3 LValueToRValue : !hl.lvalue<!temp.fahrenheit> -> !temp.fahrenheit
%5 = hl.assign %4 to %2 : !temp.fahrenheit, !hl.lvalue<!temp.celsius> -> !temp.celsius

Figure 3: Macroni enables us to lower typedefs to MLIR types and enforce strict typing.

By integrating these macro-attributed typedefs into the type system, we can now define custom type-checking rules for them. For instance, we could enforce strict type checking for operations between temperature values so that the above program would fail to type check. We could also add custom type-casting logic for temperature values so that casting a temperature value in one scale to a different scale implicitly inserts instructions to convert between them.

The reason for using macros to add the strong typedef syntax is that macros are both backwards-compatible and portable. While we could identify our custom types with Clang by annotating our typedefs using GNU’s or Clang’s attribute syntax, we cannot guarantee annotate()‘s availability across platforms and compilers, whereas we can make strong assumptions about the presence of a C preprocessor.

Now, you might be thinking: C already has a form of strong typedef called struct. So we could also enforce stricter type checking by converting our typedef types into structs (e.g., struct fahrenheit { double value; }), but this would alter both the type’s API and ABI, breaking existing client code and backwards-compatibility. If we were to change our typedefs into structs, a compiler may produce completely different assembly code. For example, consider the following function definition:

fahrenheit convert(celsius temp) { return (temp * 9.0 / 5.0) + 32.0; }

Figure 4: A definition for a Celsius-to-Fahrenheit conversion function

If we define our strong typedefs using macro-attributed typedefs, then Clang emits the following LLVM IR for the convert(25) call. The LLVM IR representation of the convert function matches up with its C counterpart, accepting a single double-typed argument and returning a double-typed value.

tail call double @convert(double noundef 2.500000e+01)

Figure 5: LLVM IR for convert(25), with macro-attributed typedefs used to define strong typedefs

Contrast this to the IR that Clang produces when we define our strong typedefs using structs. The function call now accepts four arguments instead of one. That first ptr argument represents the location where convert will store the return value. Imagine what would happen if client code called this new version of convert according to the calling convention of the original.

call void @convert(ptr nonnull sret(%struct.fahrenheit) align 8 %1,
                   i32 undef, i32 inreg 1077477376, i32 inreg 0)

Figure 6: LLVM IR for convert(25), with structs used to define strong typedefs

Weak typedefs that ought to be strong are pervasive in C codebases, including critical infrastructure like libc and the Linux kernel. Preserving API- and ABI-compatibility is essential if you want to add strong type checking to a standard type such as time_t. If you wrapped time_t in a struct (e.g., struct strict_time_t { time_t t; }) to provide strong type checking, then not only would all APIs accessing time_t-typed values need to change, but so would the ABIs of those usage sites. Clients who were already using bare time_t values would need to painstakingly change all the places in their code that use time_t to instead use your struct to activate stronger type checking. On the other hand, if you used a macro-attributed typedef to alias the original time_t (i.e., typedef time_t STRONG_TYPEDEF(time_t)), then time_t‘s API and ABI would remain consistent, and client code using time_t correctly could remain as-is.

Enhancing Sparse in the Linux Kernel

In 2003, Linus Torvalds built a custom preprocessor, C parser, and compiler called Sparse. Sparse performs Linux kernel–specific type checking. Sparse relies on macros, such as __user, sprinkled around in kernel code, that do nothing under normal build configurations but expand to uses of __attribute__((address_space(...))) when the __CHECKER__ macro is defined.

Gatekeeping the macro definitions with __CHECKER__ is necessary because most compilers don’t provide ways to hook into macros or implement custom safety checking … until today. With Macroni, we can hook into macros and perform Sparse-like safety checks and analyses. But where Sparse is limited to C (by virtue of implementing a custom C preprocessor and parser), Macroni applies to any code parseable by Clang (i.e., C, C++, and Objective C).

The first Sparse macro we’ll hook into is __user. The kernel currently defines __user to an attribute that Sparse recognizes:

# define __user     __attribute__((noderef, address_space(__user)))

Figure 7: The Linux kernel’s __user macro

Sparse hooks into this attribute to find pointers that come from user space, as in the following example. The noderef tells Sparse that these pointers must not be dereferenced (e.g., *uaddr = 1) because their provenance cannot be trusted.

u32 __user *uaddr;

Figure 8: Example of using the __user macro to annotate a variable as coming from user space

Macroni can hook into the macro and expanded attribute to lower the declaration down to MLIR like this:

%0 = hl.var "uaddr" : !hl.lvalue<!sparse.user<!hl.ptr<!hl.elaborated<!hl.typedef<"u32">>>>>

Figure 9: Kernel code after being lowered to MLIR by Macroni

The lowered MLIR code embeds the annotation into the type system by wrapping declarations that come from user space in the type sparse.user. Now we can add custom type-checking logic for user-space variables, similar to how we created strong typedefs previously. We can even hook into the Sparse-specific macro __force to disable strong type checking on an ad hoc basis, as developers do currently:

raw_copy_to_user(void __user *to, const void *from, unsigned long len)
{
   return __copy_user((__force void *)to, from, len);
}

Figure 10: Example use of the __force macro to copy a pointer to user space

We can also use Macroni to identify RCU read-side critical sections in the kernel and verify that certain RCU operations appear only within these sections. For instance, consider the following call to rcu_dereference():

rcu_read_lock();
rcu_dereference(sbi->s_group_desc)[i] = bh;
rcu_read_unlock();

Figure 11: A call to rcu_derefernce() in an RCU read-side critical section in the Linux kernel

The above code calls rcu_derefernce() in a critical section—that is, a region of code beginning with a call to rcu_read_lock() and ending with rcu_read_unlock(). One should call rcu_dereference() only within read-side critical sections; however, there is no way to enforce this constraint.

With Macroni, we can use rcu_read_lock() and rcu_read_unlock() calls to identify critical sections that form implied lexical code regions and then check that calls to rcu_dereference() appear only within these sections:

kernel.rcu.critical_section {
 %1 = macroni.parameter "p" : ...
 %2 = kernel.rcu_dereference rcu_dereference(%1) : ...
}

Figure 12: The result of lowering the RCU-critical section to MLIR, with types omitted for brevity

The above code turns both the RCU-critical sections and calls to rcu_dereference() into MLIR operations. This makes it easy to check that rcu_dereference() appears only within the right regions.

Unfortunately, RCU-critical sections don’t always bound neat lexical code regions, and rcu_dereference() is not always called in such regions, as shown in the following example:

__bpf_kfunc void bpf_rcu_read_lock(void)
{
       rcu_read_lock();
}

Figure 13: Kernel code containing a non-lexical RCU-critical section

static inline struct in_device *__in_dev_get_rcu(const struct net_device *dev)
{
   return rcu_dereference(dev->ip_ptr);
}

Figure 14: Kernel code calling rcu_dereference() outside of an RCU-critical section

We can use the __force macro to permit these sorts of calls to rcu_dereference(), just as we did to escape type checking for user-space pointers.

Rust-like unsafe regions

It’s clear that Macroni can help strengthen type checking and even enable application-specific type-checking rules. However, marking types as strong means committing to that level of strength. In a large codebase, such a commitment might require a massive changeset. To make adapting to a stronger type system more manageable, we can design an “unsafety” mechanism for C akin to that of Rust: within the unsafe region, strong type checking does not apply.

#define unsafe if (0); else


fahrenheit convert(celsius C) {
 fahrenheit F;
 unsafe {
         F = (C * 9.0 / 5.0) + 32.0;
 }
 return F;
}

Figure 15: C code snippet presenting macro-implemented syntax for unsafe regions

This snippet demonstrates our safety API’s syntax: we call the unsafe macro before potentially unsafe regions of code. All code not listed in an unsafe region will be subject to strong type checking, while we can use the unsafe macro to call out regions of lower-level code that we deliberately want to leave as-is. That’s progressive!

The unsafe macro provides the syntax only for our safety API, though, and not the logic. To make this leaky abstraction watertight, we would need to transform the macro-marked if statement into an operation in our theoretical safety dialect:

...
"safety.unsafe"() ({
   ...
}) : () -> ()
...

Figure 16: With Macroni, we can lower our safety API to an MLIR dialect and implement safety-checking logic.

Now we can disable strong type checking on operations nested within the MLIR representation of the unsafe macro.

Safer signal handling

By this point, you may have noticed a pattern for creating safer language constructs: we use macros to define syntax for marking certain types, values, or regions of code as obeying some set of invariants, and then we define logic in MLIR to check that these invariants hold.

We can use Macroni to ensure that signal handlers execute only signal-safe code. For example, consider the following signal handler defined in the Linux kernel:

static void sig_handler(int signo) {
       do_detach(if_idx, if_name);
       perf_buffer__free(pb);
       exit(0);
}

Figure 17: A signal handler defined in the Linux kernel

sig_handler() calls three other functions in its definition, which should all be safe to call in signal-handling contexts. However, nothing in the above code checks that we call signal-safe functions only inside sig_handler()‘s definition—C compilers don’t have a way of expressing semantic checks that apply to lexical regions.

Using Macroni, we could add macros for marking certain functions as signal handlers and others as signal-safe and then implement logic in MLIR to check that signal handlers call only signal-safe functions, like this:

#define SIG_HANDLER(name) name
#define SIG_SAFE(name) name


int SIG_SAFE(do_detach)(int, const char*);
void SIG_SAFE(perf_buffer__free)(struct perf_buffer*);
void SIG_SAFE(exit)(int);


static void SIG_HANDLER(sig_handler)(int signo) { ... }

Figure 18: Token-based syntax for marking signal handlers and signal-safe functions

The above code marks sig_handler() as a signal handler and the three functions it calls as signal-safe. Each macro invocation expands to a single token—the name of the function we want to mark. With this approach, Macroni hooks into the expanded function name token to determine if the function is a signal handler or signal-safe.

An alternative approach would be to define these macros to magic annotations and then hook into these with Macroni:

#define SIG_HANDLER __attribute__((annotate("macroni.signal_handler")))
#define SIG_SAFE __attribute__((annotate("macroni.signal_safe")))


int SIG_SAFE do_detach(int, const char*);
void SIG_SAFE perf_buffer__free(struct perf_buffer*);
void SIG_SAFE exit(int);


static void SIG_HANDLER sig_handler(int signo) { ... }

Figure 19: Alternative attribute syntax for marking signal handlers and signal-safe functions

With this approach, the macro invocation looks more like a type specifier, which some may find more appealing. The only difference between the token-based syntax and the attribute syntax is that the latter requires compiler support for the annotate() attribute. If this is not an issue, or if __CHECKER__-like gatekeeping is acceptable, then either syntax works fine; the back-end MLIR logic for checking signal safety would be the same regardless of the syntax we choose.

Conclusion: Why Macroni?

Macroni lowers C code and macros down to MLIR so that you can avoid basing your analyses on the lackluster Clang AST and instead build them off of a domain-specific IR that has full access to types, control flow, and data flow within VAST’s high-level MLIR dialect. Macroni will lower the domain-relevant macros down to MLIR for you and elide all other macros. This unlocks macro-sensitive static analysis superpowers. You can define custom analyses, transformations, and optimizations, taking macros into account at every step. As this post demonstrates, you can even combine macros and MLIR to define new C syntax and semantics. Macroni is free and open source, so check out its GitHub repo to try it out!

Acknowledgments

I thank Trail of Bits for the opportunity to create Macroni this summer. In particular, I would like to thank my manager and mentor Peter Goodman for the initial idea of lowering macros down to MLIR and for suggestions for potential use cases for Macroni. I would also like to thank Lukas Korencik for reviewing Macroni’s code and for providing advice on how to improve it.

¹ See Understanding code containing preprocessor constructs, SugarC: Scalable Desugaring of Real-World Preprocessor Usage into Pure C, An Empirical Analysis of C Preprocessor Use, A Framework for Preprocessor-Aware C Source Code Analyses, Variability-aware parsing in the presence of lexical macros and conditional compilation, Parsing C/C++ Code without Pre-processing, Folding: an approach to enable program understanding of preprocessed languages, and Challenges of refactoring C programs.