Amarna: Static analysis for Cairo programs

By Filipe Casal

We are open-sourcing Amarna, our new static analyzer and linter for the Cairo programming language. Cairo is a programming language powering several trading exchanges with millions of dollars in assets (such as dYdX, driven by StarkWare) and is the programming language for StarkNet contracts. But, not unlike other languages, it has its share of weird features and footguns. So we will first provide a brief overview of the language, its ecosystem, and some pitfalls in the language that developers should be aware of. We will then present Amarna and discuss how it works, and what it finds, and what we plan to implement down the line.

Introduction to Cairo

Why do we need Cairo?

The purpose of Cairo, and similar languages such as Noir and Leo, is to write “provable programs,” where one party runs the program and creates a proof that it returns a certain output when given a certain input.

Suppose we want to outsource a program’s computation to some (potentially dishonest) server and need to guarantee that the result is correct. With Cairo, we can obtain a proof that the program output the correct result; we need only to verify the proof rather than recomputing the function ourselves (which would defeat the purpose of outsourcing the computation in the first place).

In summary, we take the following steps:

  • Write the function we want to compute.
  • Run the function on the worker machine with the concrete inputs, obtain the result, and generate a proof of validity for the computation.
  • Validate the computation by validating the proof.

The Cairo programming language

As we just explained, the Cairo programming model involves two key roles: the prover, who runs the program and creates a proof that the program returns a certain output, and the verifier, who verifies the proofs created by the prover.

However, in practice, Cairo programmers will not actually generate or verify the proofs themselves. Instead, the ecosystem includes these three pillars:

The fact registry is a database that stores program facts, or values computed from hashes of programs and of their outputs; creating a program fact is a way to bind a program to its output.

This is the basic workflow in Cairo:

  • A user writes a program and submits its trace to the SHARP (via the Cairo playground or the command cairo-sharp).
  • The SHARP creates a STARK proof for the program trace and submits it to the proof verifier contract.
  • The proof verifier contract validates the proof, and, if valid, writes the program fact to the fact registry.
  • Any other user can now query the fact registry contract to check whether that program fact is valid.

There are two other things to keep in mind:

  • Memory in Cairo is write-once: after a value is written to memory, it cannot be changed.
  • The assert statement assert a = b will behave differently depending on whether a is initialized: if a is uninitialized, the assert statement assigns b to a;  if a is initialized, the assert statement asserts that a and b are equal.

Although the details of Cairo’s syntax and keywords are interesting, we will not cover these topics in this post. The official Cairo documentation and Perama’s notes on Cairo are a good starting point for more information.

Setting up and running Cairo code

Now that we’ve briefly outlined the Cairo language in general, let’s discuss how to set up and run Cairo code. Consider the following simple Cairo program. This function computes the Pedersen hash function of a pair of numbers, (input, 1), and outputs the result in the console:

# validate_hash.cairo
%builtins output pedersen

from starkware.cairo.common.cairo_builtins import HashBuiltin
from starkware.cairo.common.hash import hash2
from starkware.cairo.common.serialize import serialize_word

func main{output_ptr:felt*, pedersen_ptr : HashBuiltin*}():
    alloc_locals
    local input
    %{ ids.input = 4242 %}

    # computes the Pedersen hash of the tuple (input, 1)
    let (hash) = hash2{hash_ptr=pedersen_ptr}(input, 1)

    # prints the computed hash
    serialize_word(hash)

    return ()
end

To set up the Cairo tools, we use a Python virtual environment:

$ mkvirtualenv cairo-venv
(cairo-venv)$ pip3 install cairo-lang

Then, we compile the program:

# compile the validate_hash.cairo file,
# writing the output to compiled.json
$ cairo-compile validate_hash.cairo --output compiled.json

Finally, we run the program, which will output the following value:

# run the program
$ cairo-run --program=compiled.json --print_output --layout small
Program output:
  1524309693207128500197192682807522353121026753660881687114217699526941127707

This value is the field element corresponding to the Pedersen hash of (4242, 1).

Now, suppose that we change the input from 4242 to some hidden value and instead provide the verifier with the following output:

$ cairo-run --program=compiled.json --print_output --layout small
Program output:
  1134422549749907873058035660235532262290291351787221961833544516346461369884

Why would the verifier believe us? Well, we can prove that we know the hidden value that will make the program return that output!

To generate the proof, we need to compute the hash of the program to generate the program fact. This hash does not depend on the input value, as the assignment is inside a hint (a quirk of Cairo that we’ll discuss later in this post):

# compute the program's hash
$ cairo-hash-program --program compiled.json                                                
0x3c034247e8bf20ce12c878793cd47c5faa6f5470114a33ac62a90b43cfbb494
 
# compute program fact
from web3 import Web3

def compute_fact(program_hash, program_output):
    fact = Web3.solidityKeccak(['uint256', 'bytes32'],
                               [program_hash, Web3.solidityKeccak(['uint256[]'], [program_output])])

    h = hex(int.from_bytes(fact, 'big'))
    return h

# hash and output computed above
program_hash = 0x3c034247e8bf20ce12c878793cd47c5faa6f5470114a33ac62a90b43cfbb494
program_output = [1134422549749907873058035660235532262290291351787221961833544516346461369884]

print(compute_fact(program_hash, program_output))
# 0xe7551a607a2f15b078c9ae76d2641e60ed12f2943e917e0b1d2e84dc320897f3

Then, we can check the validity of the program fact by using the fact registry contract and calling the isValid function with the program fact as input:

The result of calling the isValid function to check the validity of the program fact                                                                         .

To recap, we ran the program, and the SHARP created a proof that can be queried in the fact registry to check its validity, proving that we actually know the input that would cause the program to output this value.

Now, I can actually tell you that the input I used was 71938042130017, and you can go ahead and check that the result matches.

You can read more about the details of this process in Cairo’s documentation for blockchain developers and more about the fact registry in this article by StarkWare.

Cairo features and footguns

Cairo has several quirks and footguns that can trip up new Cairo programmers. We will describe three Cairo features that are easily misused, leading to security issues: Cairo hints, the interplay between recursion and underconstrained structures, and non-deterministic jumps.

Hints

Hints are special Cairo statements that basically enable the prover to write arbitrary Python code. Yes, the Python code written in a Cairo hint is literally exec’d!

Hints are written inside %{ %}. We already used them in the first example to assign a value to the input variable:

%builtins output

from starkware.cairo.common.serialize import serialize_word

func main{output_ptr:felt*}():

    # arbitrary python code
    %{
       import os
       os.system('whoami')
    %}

    # prints 1
    serialize_word(1)

    return ()
end

 

 
$ cairo-compile hints.cairo --output compiled.json               
$ cairo-run --program=compiled.json --print_output --layout small
fcasal
Program output:
  1

Because Cairo can execute arbitrary Python code in hints, you should not run arbitrary Cairo code on your own machine—doing so can grant full control of your machine to the person who wrote the code.

Hints are commonly used to write code that is only executed by the prover. The proof verifier does not even know that a hint exists because hints do not change the program hash. The following function from the Cairo playground computes the square root of a positive integer, n:

func sqrt(n) -> (res):
    alloc_locals
    local res

    # Set the value of res using a python hint.
    %{
        import math

        # Use the ids variable to access the value
        # of a Cairo variable.
        ids.res = int(math.sqrt(ids.n))
    %}

    # The following line guarantees that 
    # `res` is a square root of `n`
    assert n = res * res
    return (res)
end

The program computes the square root of n by using the Python math library inside the hint. But at verification time, this code does not run, and the verifier needs to check that the result is actually a square root. So, the function includes a check to verify that n equals res * res before the function returns the result.

Underconstrained structures

Cairo lacks support for while and for loops, leaving programmers to use good old recursion for iteration. Let’s consider the “Dynamic allocation” challenge from the Cairo playground. The challenge asks us to write a function that, given a list of elements, will square those elements and return a new list containing those squared elements:

%builtins output

from starkware.cairo.common.alloc import alloc
from starkware.cairo.common.serialize import serialize_word

# Fills `new_array` with the squares of the
# first `length` elements in `array`.
func _inner_sqr_array(array : felt*, new_array : felt*,
                                        length : felt):
    # recursion base case
    if length == 0:
        return ()
    end

    # recursive case: the first element of the new_array will
    # be the first element of the array squared
    # recall that the assert will assign to the
    # `new_array` array at position 0
    # since it has not been initialized
    assert [new_array] = [array] * [array]

    # recursively call, advancing the arrays
    # and subtracting 1 to the array length
    _inner_sqr_array(array=array + 1, 
                     new_array=new_array + 1,
                     length=length - 1)
    return ()
end

func sqr_array(array : felt*, length : felt) ->
                                          (new_array : felt*):
    alloc_locals
    # allocates an arbitrary length array
    let (local res_array) = alloc()

    # fills the newly allocated array with the squares
    # of the elements of array
    _inner_sqr_array(array, res_array, length)
    return (res_array)
end

func main{output_ptr : felt*}():
    alloc_locals

    # Allocate a new array.
    let (local array) = alloc()

    # Fill the new array with field elements.
    assert [array] = 1
    assert [array + 1] = 2
    assert [array + 2] = 3
    assert [array + 3] = 4

    let (new_array) = sqr_array(array=array, length=4)

    # prints the array elements
    serialize_word([new_array])
    serialize_word([new_array + 1])
    serialize_word([new_array + 2])
    serialize_word([new_array + 3])

    return ()
end

Running this code will output the numbers 1, 4, 9, and 16 as expected.

But what happens if an error (or an off-by-one bug) occurs and causes the sqr_array function to be called with a zero length?

func main{output_ptr : felt*}():
    alloc_locals
    # Allocate a new array.
    let (local array) = alloc()
    # Fill the new array with field elements.
    assert [array] = 1
    assert [array + 1] = 2
    assert [array + 2] = 3
    assert [array + 3] = 4

    let (new_array) = sqr_array(array=array, length=0)
    serialize_word([new_array])
    serialize_word([new_array + 1])
    serialize_word([new_array + 2])
    serialize_word([new_array + 3])

    return ()
end

Basically, the following happens:

  1. The sqr_array function will allocate res_array and call _inner_sqr_array(array, res_array, 0).
  2. _inner_sqr_array will compare the length with 0 and return immediately.
  3. sqr_array will return the allocated (but never written to) res_array.

So what happens when you call serialize_word on the first element of new_array?
Well, it depends… Running the code as-is will result in an error because the value of new_array is unknown:

The error that occurs after running the above code as-is                                                                                     .

However, remember that usually you won’t be running code; you’ll be verifying proofs that a program outputs some value. And I can actually provide you proof that this program can output any four values that you would like! You can compute all of this yourself to confirm that I’m not cheating:

$ cairo-compile recursion.cairo --output compiled.json    
$ cairo-hash-program --program compiled.json             
0x1eb05e1deb7ea9dd7bd266abf8aa8a07bf9a62146b11c0bd1da8bb844ff2479

The following fact binds this program with the output [1, 3, 3, 7]:

# hash and output computed above
program_hash = 0x01eb05e1deb7ea9dd7bd266abf8aa8a07bf9a62146b11c0bd1da8bb844ff2479
program_output = [1, 3, 3, 7]

print(compute_fact(program_hash, program_output))
# 0x4703704b8f7411d5195e907c2eba54af809cb05eebc65eb9a9423964409a8a4d

This fact is valid according to the fact registry contract:

The fact registry’s verification of the program fact                                                           .

So what is happening here?

Well, since the returned array is only allocated and never written to (because its length is 0, the recursion stops as soon as it starts), the prover can write to the array in a hint, and the hint code won’t affect the program’s hash!

The “evil” sqr_array function is actually the following:

func sqr_array(array : felt*, length : felt) -> 
                                           (new_array : felt*):
    alloc_locals
    let (local res_array) = alloc()

    %{  # write on the result array if the length is 0
        if ids.length == 0:
            data = [1, 3, 3, 7]
            for idx, d in enumerate(data):
                memory[ids.res_array + idx] = d
    %}

    _inner_sqr_array(array, res_array, length)
    return (res_array)
end

In a nutshell, if there is some bug that makes the length of the array 0, a malicious prover could create any arbitrary result he wants.

You might also ask yourself why, in general, a malicious prover can’t simply add a hint at the end of the program to change the output in any way he wishes. Well, he can, as long as that memory hasn’t been written to before; this is because memory in Cairo is write-once, so you can only write one value to each memory cell.

This pattern of creating the final result array is necessary due to the way memory works in Cairo, but it also carries the risk of a security issue: a simple off-by-one mistake in tracking the length of this array can allow a malicious prover to arbitrarily control the array memory.

Nondeterministic jumps

Nondeterministic jumps are another code pattern that can seem unnatural to a programmer reading Cairo for the first time. They combine hints and conditional jumps to redirect the program’s control flow with some value. This value can be unknown to the verifier, as the prover can set it in a hint.

For example, we can write a program that checks whether two elements, x and y, are equal in the following contrived manner:

func are_equal(x, y) -> (eq):
    # sets the ap register to True or False depending on 
    # the equality of x and y
    %{ memory[ap] = ids.x == ids.y %}
    
    # jump to the label equal if the elements were equal
    jmp equal if [ap] != 0; ap++

    # case x != y
    not_equal:    
    return (0)
    
    # case x == y
    equal:    
    return (1)
end

Running this program will return the expected result (0 for different values and 1 for equal values):

func main{output_ptr : felt*}():

    let (res) = are_equal(1, 2)
    serialize_word(res) # -> 0

    let (res) = are_equal(42, 42)
    serialize_word(res) # -> 1

    return()
end

However, this function is actually vulnerable to a malicious prover. Notice how the jump instruction depends only on the value written in the hint:

    %{ memory[ap] = ids.x == ids.y %}
    jmp equal if [ap] != 0; ap++

And we know that hints are fully controllable by the prover! This means that the prover can write any other code in that hint. In fact, there are no guarantees that the prover actually checked whether x and y are equal, or even that x and y were used in any way. Since there are no other checks in place, the function could return whatever the prover wants it to.

As we saw previously, the program hash does not consider code in hints; therefore, a verifier can’t know whether the correct hint was executed. The malicious prover can provide proofs for any possible output values of the program ((0, 0), (1, 1), (0, 1), or (1, 0)) by changing the hint code and submitting each proof to the SHARP.

So how do we fix it?

Whenever we see nondeterministic jumps, we need to make sure that the jumps are valid, and the verifier needs to validate the jumps in each label:

func are_equal(x, y) -> (eq):
    %{ memory[ap] = ids.x == ids.y %}
    jmp equal if [ap] != 0; ap++

    # case x != y
    not_equal:
    # we are in the not_equal case
    # so we can't have equal x and y
    if x == y:
	# add unsatisfiable assert
        assert x = x + 1
    end
    return (0)
    
    # case x == y
    equal:
    # we are in the equal case
    # so x and y must equal
    assert x = y
    return (1)
end

In this case, the function is simple enough that the code needs only an if statement:

func are_equal(x, y) -> (eq):
    if x == y:
        return (1)
    else:
        return (0)
    end
end

Amarna, our Cairo static analyzer

While auditing Cairo code, we noticed there was essentially no language support of any form, except for syntax highlighting in VScode. Then, as we found issues in the code, we wanted to make sure that similar patterns were not present elsewhere in the codebase.

We decided to build Amarna, a static analyzer for Cairo, to give us the ability to create our own rules and search code patterns of interest to us—not necessarily security vulnerabilities, but any security-sensitive operations that need to be analyzed or need greater attention when reviewing code.

Amarna exports its static analysis results to the SARIF format, allowing us to easily integrate them into VSCode with VSCode’s SARIF Viewer extension and to view warnings underlined in the code:

Cairo code with an underlined dead store (left) and the SARIF Viewer extension showing the results from Amarna (right)                                                                   .

How does Amarna work?

The Cairo compiler is written in Python using lark, a parsing toolkit, to define a grammar and to construct its syntax tree. Using the lark library, it is straightforward to build visitors to a program’s abstract syntax tree. From here, writing a rule is a matter of encoding what you want to find in the tree.

The first rule we wrote was to highlight all uses of arithmetic operations +, -, *, and /. Of course, not all uses of division are insecure, but with these operations underlined, the developer is reminded that Cairo arithmetic works over a finite field and that division is not integer division, as it is in other programming languages. Field arithmetic underflows and overflows are other issues that developers need to be aware of. By highlighting all arithmetic expressions, Amarna helps developers and reviewers to quickly zoom in on locations in the codebase that could be problematic in this regard.

The rule to detect all divisions is very simple: it basically just creates the result object with the file position and adds it to the analysis results:

class ArithmeticOperationsRule(GenericRule):
    """
    Check arithmetic operations:
        - reports ALL multiplications and divisions
        - reports ONLY addition and subtraction that
          do not involve a register like [ap - 1]
    """

    RULE_TEXT = "Cairo arithmetic is defined over a finite field and has potential for overflows."
    RULE_PREFIX = "arithmetic-"

    def expr_div(self, tree: Tree) -> None:
        result = create_result(
            self.fname,
            self.RULE_PREFIX + tree.data,
            self.RULE_TEXT,
            getPosition(tree)
        )
        self.results.append(result)

As we looked for more complex code patterns, we developed three classes of rules:

  • Local rules analyze each file independently. The rule described above, to find all arithmetic operations in a file, is an example of a local rule.
  • Gatherer rules analyze each file independently and gather data to be used by post-processing rules. For example, we have rules to gather all declared functions and all called functions.
  • Post-processing rules run after all files are analyzed and use the data gathered by the gatherer rules. For example, after a gatherer rule finds all declared functions and all called functions in a file, a post-processing rule can find all unused functions by identifying functions that are declared but never called.

So what does Amarna find?

So far, we have implemented 10 rules, whose impacts range from informational rules that help us audit code (marked as Info) to potentially security-sensitive code patterns (marked as Warning):

 

# Rule What it finds Impact Precision
1 Arithmetic operations All uses of arithmetic operations +, -, *, and / Info High
2 Unused arguments Function arguments that are not used in the function they appear in Warning High
3 Unused imports Unused imports Info High
4 Mistyped decorators Mistyped code decorators Info High
5 Unused functions Functions that are never called Info Medium
6 Error codes Function calls that have a return value that must be checked Info High
7 Inconsistent assert usage Asserts that use the same constant in different ways (e.g., assert_le(amount, BOUND) and assert_le(amount, BOUND - 1)) Warning High
8 Dead stores Variables that are assigned values but are not used before a return statement Info Medium
9 Potential unchecked overflows Function calls that ignore the returned overflow flags (e.g., uint256_add) Warning High
10 Caller address return value Function calls to the get_caller_address function Info High

While most of these rules fall into the informational category, they can definitely have security implications: for example, failing to check the return code of a function can be quite serious (imagine if the function is a signature verification); the Error codes rule will find some of these instances.

The Unused arguments rule will find function arguments that are not used in the function they appear in, a common pattern in general purpose programming language linters; this generally indicates that there was some intention of using the argument, but it was never actually used, which might also have security implications. The rule would have found this bug in an OpenZeppelin contract a few months ago which was due to an unchecked nonce, passed as an argument to the execute function.

Going forward

As Cairo is still a developing ecosystem, enumerating all vulnerable patterns can be difficult. We plan to add more rules moving forward, and in the medium/long term, we plan to add more complex analysis features, such as data-flow analysis.

In the meantime, if you have any ideas for vulnerable code patterns, we are more than happy to review feature requests, new rules, bug fixes, issues, and other contributions from the community.

The Frozen Heart vulnerability in PlonK

By Jim Miller

In part 1 of this blog post, we disclosed critical vulnerabilities that break the soundness of multiple implementations of zero-knowledge proof systems. This class of vulnerability, which we dubbed Frozen Heart, is caused by insecure implementations of the Fiat-Shamir transformation that allow malicious users to forge proofs for random statements. In part 2 and part 3 of the blog post, we demonstrated how to exploit a Frozen Heart vulnerability in two specific proof systems: Girault’s proof of knowledge and Bulletproofs. In this part, we will look at the Frozen Heart vulnerability in PlonK.

Zero-Knowledge Proofs and the Fiat-Shamir Transformation

This post assumes that you possess some familiarity with zero-knowledge proofs. If you would like to read more about them, there are several helpful blog posts and videos available to you, such as Matt Green’s primer. To learn more about the Fiat-Shamir transformation, check out the blog post I wrote explaining it in more detail. You can also check out ZKDocs for more information about both topics.

PlonK

The PlonK proof system is one of the latest ZK-SNARK schemes to be developed in the last few years. SNARKs are essentially a specialized group of zero-knowledge proof systems that have very efficient proof size and verification costs. For more information on SNARKs, I highly recommend reading Zcash’s blog series.

Most SNARKs require a trusted setup ceremony. In this ceremony, a group of parties generates a set of values that are structured in a particular way. Importantly, nobody in this group can know the underlying secrets for this set of values (otherwise the proof system can be broken). This ceremony is not ideal, as users of these proof systems have to trust that some group generates these values honestly and securely, otherwise the entire proof system is insecure.

Older SNARK systems need to perform this ceremony multiple times for each program. PlonK’s universal and updateable trusted setup is much nicer; only one trusted setup ceremony is required to prove statements for multiple programs.

In the next section, I will walk through the full PlonK protocol to demonstrate how a Frozen Heart vulnerability works. However, PlonK is quite complex, so this section may be difficult to follow. If you’d like to learn more about how PlonK works, I highly recommend reading Vitalik Buterin’s blog post and watching David Wong’s YouTube series.

The Frozen Heart Vulnerability in PlonK

As in Bulletproofs, PlonK contains multiple Fiat-Shamir transformations. Each of these transformations, if implemented incorrectly, could contain a Frozen Heart vulnerability. I will focus on one particular vulnerability that appears to be the most common. This is the vulnerability that affected Dusk Network’s plonk, Iden3’s SnarkJS, and ConsenSys’ gnark.

PlonK, like most SNARKs, is used to prove that a certain computation was performed correctly. In a process called arithmetization, the computation is represented in a format that the proof system can interpret: polynomials and arithmetic circuits.

We can think of the inputs to the computation as input wires in a circuit. For SNARKs, there are public and private inputs. The private inputs are wires in the circuit that only the prover sees. The public inputs are the remaining wires, seen by both the prover and the verifier.

Other public values include the values generated by the trusted setup ceremony used to generate and verify proofs. Additionally, because the prover and verifier need to agree on the computation or program to be proven, a representation of the program’s circuit is shared publicly between both parties.

In part 1 of this series, we introduced a rule of thumb for securely implementing the Fiat-Shamir transformation: the Fiat-Shamir hash computation must include all public values from the zero-knowledge proof statement and all public values computed in the proof (i.e., all random “commitment” values). This means that the public inputs, the values from the trusted setup ceremony, the program’s circuit, and all the public values computed in the proof itself must be included in PlonK’s Fiat-Shamir transformations. The Frozen Heart vulnerability affecting the Dusk Network, Iden3, and ConsenSys codebases stems from their failure to include the public inputs in these computations. Let’s take a closer look.

Overview of the PlonK Protocol

PlonK has undergone a few revisions since it was first published, so a lot of the PlonK implementations are not based on the most recent version on the Cryptology ePrint archive. In this section, I will be focusing on this version posted in March 2020, as this (or a similar version) appears to be what the SnarkJS implementation is based on. Note: this vulnerability still applies to all versions of the paper when the protocol is implemented incorrectly, but for different versions, the exact details of how this issue can be exploited will differ.

(The paragraph above was updated on June 8, 2022.)

Before diving into the details of the protocol, I will describe it at a high level. To produce a proof, the prover first constructs a series of commitments to various polynomials that represent the computation (specifically, they are constraints corresponding to the program’s circuit). After committing to these values, the prover constructs the opening proofs for these polynomials, which the verifier can verify using elliptic curve pairings. This polynomial committing and opening is done using the Kate polynomial commitment scheme. This scheme is complex, but it’s a core part of the PlonK protocol, and understanding it is key to understanding how PlonK works. Check out this blog post for more detail.

From the prover’s perspective, the protocol has five rounds in total, each of which computes a new Fiat-Shamir challenge. The following are all of the public and private inputs for the PlonK protocol:

Public and private inputs for the PlonK proof system (source)

The common preprocessed input contains the values from the trusted setup ceremony (the x[1]1, …, xn+2[1]1 values, where x[1]1 = g1x and g1 is the generator of an elliptic curve group) and the circuit representation of the program; these are shared by both the prover and verifier. The remaining values are the input wires to the program’s circuit. The public inputs are the public wires, and the prover’s input is both the public input and the prover’s private wires. With that established, we can start the first round of the protocol, in which the prover computes the blinded wire values.

Round 1 for the prover in the PlonK protocol (source)

The prover first computes three polynomials (a, b, and c) and then evaluates them at point x by using the values from the trusted setup ceremony. The underlying x value is not known (assuming the trusted setup ceremony was performed correctly), so the prover is producing a commitment to these polynomials without leaking any information. The prover does the same for the permutation polynomial in round 2:

Round 2 for the prover in the PlonK protocol (source)

We will not describe this round in detail, as it’s especially complex and irrelevant to this vulnerability. In summary, this round is responsible for enforcing the “copy constraints,” which ensure that the values assigned to all wires are consistent (i.e., it prevents a prover from assigning inconsistent values if one gate’s output is the input to another gate). From there, the prover computes the quotient polynomial in round 3. This quotient polynomial is crucial for the Frozen Heart vulnerability, as we will see shortly.

Round 3 for the prover in the PlonK protocol (source)

After round 3, the prover then computes the evaluation challenge, zeta, using the Fiat-Shamir technique. zeta is then used to evaluate all of the polynomials constructed up to this point. This zeta challenge value is the value we will target in the Frozen Heart vulnerability. As you can see, the authors of the paper use the term “transcript,” which they explain earlier in the paper to mean the preprocessed input, the public input, and the proof values generated along the way.

(The paragraph above was updated on April 29, 2022.)

Round 4 for the prover in the PlonK protocol (source)

Finally, in round 5, the prover then generates the opening proof polynomials and returns the final proof.

Round 5 for the prover in the PlonK protocol (source)

To verify this proof, the verifier performs the following 12 steps:

Verifier in the PlonK protocol (source)

At a high level, the verifier first verifies that the proof is well formed (steps 1-4), computes a series of values (steps 5-11), and then verifies them by using a single elliptic curve pairing operation (step 12). This check essentially verifies the divisibility of the constructed polynomials and will pass only if the polynomials are structured as expected (unless there’s a Frozen Heart vulnerability, of course).

Exploiting a Frozen Heart Vulnerability in PlonK

Recall that the Frozen Heart vulnerability in question stems from a failure to include the public inputs for the program’s circuit in any of the Fiat-Shamir challenge calculations (importantly, zeta). At a high level, a malicious prover can exploit this vulnerability by picking malicious public input values (that will depend on challenge values like zeta) to trick the verifier into reconstructing a t_bar in step 8 that will pass the final verification check. Let’s look at the details.

Recall that in round 1, the prover generates three polynomials (a, b, and c) based on the prover’s public and private wire values. If we are a malicious prover trying to forge proofs, then we don’t actually have these wire values (because we haven’t actually done any computation). Therefore, in round 1, we’ll generate completely random polynomials a’, b’, and c’ and then output their evaluations, [a’]1, [b’]1, and [c’]1.

Our random a, b, and c polynomials will break the polynomial computed in round 2 because this polynomial enforces the “copy constraints,” which means that wire values are consistent. With completely random polynomials, it’s essentially guaranteed that these constraints won’t hold. Fortunately, we can actually skip these checks entirely by zeroing out the round 2 polynomial and outputting [0]1. Note: if the implementation checks for the point at infinity, the verifier might reject this proof. If this is the case, you can be clever with how you generate a’, b’, and c’ so that the copy constraints hold, but the details of how that would work are a bit gnarly. Since the SnarkJS implementation does not return an error on this zeroed out polynomial, I will continue with this approach.

Now, for round 3, remember that we don’t have the required wire values to correctly compute our polynomial t. Instead, we will generate a random polynomial t’ and output [t’lo]1, [t’mid]1, and [t’hi]1.

In round 4, we will compute the challenge zeta and all of the evaluations as we did before, except we now use a’, b’, c’, and t’. We will then use these evaluations to construct the r polynomial in the expected way and then evaluate r at zeta and output all of the evaluations. Notice that each of the evaluations computed are consistent with the polynomials that we committed to in previous rounds.

In round 5, we will perform the calculations as expected but replace a, b, c, and t with a’, b’, c’, and t’. This is the end of the protocol, but we’re not done. The last step is to compute public input values that will trick the verifier.

Tricking the verifier really comes down to this opening proof polynomial (we can ignore the other opening proof polynomial because we zeroed out polynomial z in round 2):

Computation of opening proof polynomial (source)

Since our evaluations in round 4 corresponded to the polynomials used in round 5, the polynomial inside of the large parentheses above will evaluate to zero when evaluated at zeta; therefore, it is divisible by (X - zeta), and we can compute Wzeta. Now, we need to ensure that the values we compute in the proof are recomputed the same way by the verifier. In steps 9–11 the verifier reconstructs the values in a structure identical to how the prover calculated them. Therefore, all of these values should be valid.

This just leaves us with a final problem to solve in step 8, in which the verifier calculates t_bar. Because we output t’_bar, a random polynomial evaluated at zeta rather than the t_bar that the prover is expected to compute, the verifier’s t_bar will not match our output and will not pass the verification. However, the verifier uses public inputs to compute t_bar, which are not included in the Fiat-Shamir transformation for any of the challenges (namely, zeta). So we can retrofit our public inputs so that they force the verifier to compute t’_bar in step 8.

To do so, we first plug in the t’_bar that we computed in round 4 to the left side of the equation in step 8. Then, we solve this equation for PI(zeta), a sum over the public inputs multiplied by the Lagrangian polynomial. Since it’s a sum, there are multiple combinations that will work. The simplest way is to zero out every public input except the first and solve for the wire value that results in the PI(zeta) value we solved for. We’ve now constructed our public inputs that will trick the verifier into computing t’_bar. The reason this works is because these public input values are not used to compute zeta, so we know zeta before we have to decide on these public input values.

Now, the verifier will reconstruct the same t’_bar value that we computed in round 4, and the final pairing check will pass—we’ve successfully forged a proof.

To be clear, this vulnerability is introduced only through incorrect implementations; this is not a vulnerability in the PlonK paper itself.

(This section was updated on April 29, 2022.)

Frozen Heart’s Impact on PlonK

Let’s take a step back and assess the severity of this vulnerability. First, let’s recall what we are proving. Using PlonK, the prover is proving to the verifier that he has correctly executed a particular (agreed upon) program and that the output he has given to the verifier is correct. In the previous section, we forged a proof by using completely random wire values for the program’s circuit. The verifier accepts this proof of computation as valid, even though the prover didn’t correctly compute any of the circuit’s wire values (i.e., he didn’t actually run the program). It’s worth reiterating that this post focused on only one kind of Frozen Heart vulnerability. Several similar attacks against this proof system are possible if other Fiat-Shamir transformations are done incorrectly.

Whether this is exploitable in practice is determined by how the verifier handles the public input values. Specifically, this is exploitable only if the verifier accepts arbitrary public inputs from the prover (rather than agreeing on them beforehand, for instance). If we look at the example code in the SnarkJS repository, we can see that the public inputs (publicSignals) are output by the prover (using the fullProve function) and blindly accepted by the verifier (this example is for Groth16, but the PlonK API works in the same way). In general, the exploitability of this vulnerability is implementation dependent.

Example code snippet using SnarkJS (source)

You can imagine that in most applications, this type of proof forgery is very severe. The PlonK proof system effectively gives zero guarantees that a program was executed correctly. Keep in mind that our example used random wire values, but this is not a requirement. A more sophisticated attacker could pick more clever wire values (e.g., by choosing an output of the program that benefits the attacker) and still perform this same attack.

This is the final post in our series on the Frozen Heart vulnerability. I hope that, by now, I’ve made it clear that these issues are very severe and, unfortunately, widespread. Our hope is that this series of posts creates better awareness of these issues. If you haven’t already, check out ZKDocs for more guidance, and please reach out to us if you’d like more content to be added!

The Frozen Heart vulnerability in Bulletproofs

By Jim Miller

In part 1 of this series, we disclosed critical vulnerabilities that break the soundness of multiple implementations of zero-knowledge proof systems. This class of vulnerability, which we dubbed Frozen Heart, is caused by insecure implementations of the Fiat-Shamir transformation that allow malicious users to forge proofs for random statements. In part 2, we demonstrated how to exploit a Frozen Heart vulnerability in a specific proof system: Girault’s proof of knowledge. In this post, we will demonstrate such an exploit against another proof system: Bulletproofs.

Zero-Knowledge Proofs and the Fiat-Shamir Transformation

This post assumes that you possess some familiarity with zero-knowledge proofs. If you would like to read more about them, there are several helpful blog posts and videos available to you, such as Matt Green’s primer. To learn more about the Fiat-Shamir transformation, check out the blog post I wrote explaining it in more detail. You can also check out ZKDocs for more information about both topics.

Bulletproofs

Bulletproofs are complex and efficient zero-knowledge range proofs, in which a prover proves that a certain secret value lies within a predefined range without having to reveal the value itself.

Bulletproofs operate over a cryptographic primitive known as a Pedersen commitment, a specific type of commitment scheme. Using a commitment scheme, a party can create a commitment, which binds the party to a secret value but does not reveal any information about this value. Later, this party can decommit or reveal this commitment; if revealed, and if the scheme is secure, the other party can be sure that the revealed value is the same as the original committed value.

To create a Pedersen commitment for value x, you generate a random gamma and then compute the commitment using comm = (gx)(hgamma): g and h are different generators of your finite group, and the discrete log of h relative to g is unknown (i.e., it’s infeasible to find a such that ga = h). Since Pedersen commitments are secure, the commitment does not reveal any information about x, and it’s impossible to equivocate on the commitment; this means you cannot publish different x’ and gamma’ values that produce the same commitment (assuming the discrete log between g and h is unknown).

The fact that the Pedersen commitment does not reveal any information can be problematic for complex protocols, such as those that need to guarantee that the secret value falls within a predefined range (e.g., to restrict these values to prevent integer overflows). However, using Bulletproofs, we can prove that our commitment corresponds to a value within a predefined range, such as [0, 232), without revealing the specific input value.

Unfortunately, the Bulletproofs protocol is too complex to walk through in detail in this post. To describe the Frozen Heart vulnerability, I will present the Bulletproofs protocol step-by-step, but this will be difficult to follow if you have not seen it before. If you’d like to learn more about how Bulletproofs work, you can find a number of blogs and videos online, such as this write-up.

The Frozen Heart Vulnerability in Bulletproofs

Bulletproofs have several Fiat-Shamir transformations (the exact number depends on the parameters being used), which could be abused in different ways. In this section, I will walk through how to exploit one such vulnerability. In fact, this vulnerability is the result of an error made by the authors of the original Bulletproofs paper, in which they recommend using an insecure Fiat-Shamir implementation. ING Bank’s zkrp, SECBIT Labs’ ckb-zkp, and Adjoint, Inc.’s bulletproofs were all affected by this issue.

Bulletproofs operate over Pedersen commitments, which take the form V = (gv)(hgamma). As a reminder, the goal of Bulletproofs is to prove that the secret value, v, lies in a predefined range. For the purposes of this post, we will use [0, 232) as our predefined range. Here is the formal zero-knowledge proof statement for Bulletproofs:

Formal proof statement for Bulletproofs (source)

In part 1 of this series, we introduced a rule of thumb for securely implementing the Fiat-Shamir transformation: the Fiat-Shamir hash computation must include all public values from the zero-knowledge proof statement and all public values computed in the proof (i.e., all random “commitment” values). So we want to ensure that all of the public values in this statement (g, h, V, n) are in the Fiat-Shamir calculation, along with the random commitments, which we haven’t covered yet.

As is the case in most cryptography papers, the Bulletproofs algorithm is presented in its interactive version. Here is the prover’s role, as presented in the paper:

Interactive Bulletproofs protocol from the original paper

(To be clear, this is not the full version of the protocol. There is a significant optimization to decrease the size of the proof sent in step (63) that is described later in the Bulletproofs paper, but this is irrelevant for the purposes of this exploit.)

Once the prover sends the final proof in step (63), the verifier will need to verify it by performing the following checks:

The checks performed by the verifier to verify Bulletproofs (source)

In steps (49)/(50) and (55)/(56) of the prover’s protocol, three different Fiat-Shamir challenge values need to be generated: x, y, and z. However, the authors recommend using an insecure computation for these values:

Insecure Fiat-Shamir challenge computation recommended in the original Bulletproofs paper (source)

According to the authors, we should set y = Hash(A,S), z = Hash(A,S,y), and (following their description) x = Hash(A,S,y,z,T1,T2). This violates our rule of thumb: none of the public values from the statement—most importantly, V—are included. This is a Frozen Heart vulnerability! This vulnerability is critical; as you may have guessed, it allows malicious provers to forge proofs for values that actually lie outside of the predefined range.

Now, to actually exploit this, a malicious prover can do the following:

  • Set v equal to any value in the range. So let’s just say v = 3.
  • Pick a random gamma value.
  • Generate aL, aR, A, and S as expected (according to steps (41), (42), (44), and (47), respectively) using these v and gamma values.
  • Compute t1 and t2 as described in the Bulletproofs paper (this is not in the above figure but is described in text in the paper), and then compute values t1 and t2 by randomly generating numbers. When computing T1 and T2, replace t1 with t1 and t2 with t2. So, to be clear, we set Ti = (gt’_i)(htau_i) (as expected, but we switch ti with ti).
  • Compute the rest of the values (l, r, t_hat, tau_x, mu) according to the protocol.
  • Finally, compute a new V using the same gamma but with a new v’ value. Specifically, set v’ = 3 + (t1 - t’1)(x/z2) + (t2 - t’2)(x2/z2) (You’ll see why this was chosen like this in a second). Here, 3 comes from setting v = 3 in step 1.

Now, recall all of the verification checks from above. The only values we computed differently than expected were T1, T2, and V. Since we computed the other values as expected, all of the checks will pass automatically, with the exception of check (65), which depends on T1, T2, and V. But because x, y, and z are computed independently of V, we can compute a malicious V value that depends on these values and will pass check (65). Let’s see how this works:

First, let’s simplify the left-hand side of check (65):

LHS = (gt_hat)(htau_x)
= (g[t0 + t1 * x + t2 * x2])(h[tau2 * x2 + tau1 * x + z2 * y])

Now, let’s simplify the right-hand side of check (65):

RHS = (Vz2)(gdelta(y,z))(T1x)(T2x2)
= (Vz2)(gdelta(y,z))[(gt’1)(htau1)]x[(gt’2)(htau2)]x2

Now, if you look at the v’ value that we picked for V, you’ll see that the T1 and T2 exponents in g will cancel out the exponents in g for V. So the right-hand side is simplified as follows:

= g[3z2 + t1 * x + t2 * x2 + delta(y,z)]h[gamma * z2 + tau1 * x + tau2 * x2]

We can see that the exponent in h is identical on both sides. All we have to do now is check that the exponents in g match.

In the g exponent on the left-hand side, we have t0 + t1 * x + t2 * x2.

In the g exponent on the right-hand side, we have 3z2 + delta(y,z) + t1 * x + t2 * x2.

And t0 is defined in the original paper to be:

t0 as defined in the Bulletproofs paper (source)

This is exactly right, meaning we’ve successfully forged a proof.

To be clear, this is a forgery because we just submitted this proof for the newly computed V value, where v was set to be v’ = 3 + (t1 - t’1)(x / z2) + (t2 - t’2)(x2 / z2). Since x and z are random Fiat-Shamir challenges, v’ will end up being a random value in the interval [0, group order). Since the group order is usually much larger than the interval in question (here, 232), v’ will be a value outside of the range with overwhelming probability, but the proof will still pass. If, for whatever reason, v’ is not outside of this range, a malicious actor could simply start the same process over with new random values (e.g., new gamma, new t’1, etc.) until the desired v’ value is obtained.

Frozen Heart’s Impact on Bulletproofs

Frozen Heart vulnerabilities are critical because they allow attackers to forge proofs, but their impact on the surrounding application depends on how the application uses the proof system. As we saw for the Schnorr and Girault proof systems in part 2 of this series, there may be contexts in which these vulnerabilities are not critical. However, this is unlikely to be the case for Bulletproofs.

In our example, we were able to produce a forgery for a random value in the group order. In most applications, the predefined range for the range proof is typically much smaller than the size of the group order. This means that, although the specific value cannot be chosen, an attacker could easily produce a proof for values outside of the desired range. In the majority of contexts we’ve seen Bulletproofs being used, this is severe.

Watch out for the final part of this series, in which we will explore the Frozen Heart vulnerability in an even more complex proof system: PlonK.

The Frozen Heart vulnerability in Girault’s proof of knowledge

By Jim Miller

In part 1 of this series, we disclosed critical vulnerabilities that break the soundness of multiple implementations of zero-knowledge proof systems. This class of vulnerability, which we dubbed Frozen Heart, is caused by insecure implementations of the Fiat-Shamir transformation that allow malicious users to forge proofs for random statements. The vulnerability is general and can be applied to any proof system that implements the Fiat-Shamir transformation insecurely. In this post, I will show how to exploit a Frozen Heart vulnerability in Girault’s proof of knowledge.

Zero-Knowledge Proofs and the Fiat-Shamir Transformation

This post assumes that you possess some familiarity with zero-knowledge proofs. If you would like to read more about them, there are several helpful blog posts and videos available to you, such as Matt Green’s primer. To learn more about the Fiat-Shamir transformation, check out the blog post I wrote explaining it in more detail. You can also check out ZKDocs for more information about both topics.

Girault’s Proof of Knowledge

In Girault’s proof of knowledge protocol, the prover proves that he knows the discrete log of a certain value over a composite modulus. In other words, a prover convinces a verifier that he knows some secret value, x, such that h = g-x mod N, where g is a high order generator and N is some composite modulus (e.g., N = p * q, where p and q are two primes). To learn more about the protocol, check out our description on ZKDocs. If you are familiar with Schnorr proofs, think of Girault’s proof of knowledge as Schnorr proofs over a composite modulus rather than a prime modulus.

The protocol is interactive by default, but it can be made non-interactive using the Fiat-Shamir transformation:

The interactive and non-interactive versions of Girault’s proof of knowledge from ZKDocs (source)

At a high level, the prover first computes the commitment, u, by generating a random value, r, and computing u = gr. The prover then obtains the random challenge value, e, (either from the verifier in the interactive version or from herself by using the Fiat-Shamir transformation in the non-interactive version) and calculates the final proof value, z = r + x * e. This protocol is proven to be secure as long as the discrete log can’t be computed over the group that is used and the Fiat-Shamir challenge is computed correctly.

The Frozen Heart Vulnerability in Girault’s Proof of Knowledge

But what happens if we don’t compute the Fiat-Shamir challenge correctly? Let’s look at an example. In ZenGo’s implementation of Girault’s proof of knowledge (before it was patched), e is computed by hashing g, N, and u, but not h (in this implementation, h is represented by statement.ni). In part 1 of this series, we introduced a rule of thumb for securely implementing the Fiat-Shamir transformation: the Fiat-Shamir hash computation must include all public values from the zero-knowledge proof statement and all public values computed in the proof (i.e., all random “commitment” values). Therefore, failure to include h in this computation introduces a Frozen Heart vulnerability, allowing malicious provers to forge proofs for random h values, even if they do not know its discrete log.

Let’s walk through how to exploit this Frozen Heart vulnerability. We first choose random values for both u and z. Since h is not included in the Fiat-Shamir implementation, we can now compute e = Hash(g,N,u). Next, we need to find an h value that will pass the verification check: u = gzhe mod N. We already know all of the values except for h: we generated u and z, we computed e, and g and N are both public. Therefore, we can solve for h:

u = gzhe mod N
he = ug-z mod N
einv = e-1 mod phi(N)
h = (ug-z)einv mod N

If we plug in this h value, we know the second verification check will pass because we chose h specifically to pass this check. The only other check performed is the check for the challenge value, e, but this check will pass as well because we’ve computed this value identically. Keep in mind that we simply picked a random u. This means that we don’t actually know the discrete log of u (i.e., it’s infeasible to find a value t such that gt = u mod N). Since we don’t know the discrete log of u, we don’t know the discrete log of h. However, we have tricked the verifier that this proof is valid, even without knowing this discrete log, which means we have successfully forged a proof.

Note: in order to compute e_inv, the malicious prover needs to be able to compute phi(N), which would require knowing the prime factors of N.

Frozen Heart’s Impact on Girault’s Proof of Knowledge

Frozen Heart vulnerabilities are critical in Girault’s proof of knowledge because they allow attackers to forge proofs. However, the impact of this vulnerability on an application using this proof system depends entirely on how the proof system is used. If Girault’s proof of knowledge is used simply for standalone public keys (i.e., keys that are not used as part of some larger protocol), then a Frozen Heart vulnerability may not be that severe.

The reason for this is that, for Girault’s scheme, a Frozen Heart vulnerability makes it possible to forge random h values. But that’s not any more powerful than generating a random x and computing h = g-x, which results in a random h that we can construct a proof for. However, if this proof system is used within a larger, more complex protocol—such as a threshold signature scheme, which requires that the proof be unforgeable—then a Frozen Heart vulnerability is likely very severe.

Even though Frozen Heart vulnerabilities might not be critical for Girault’s scheme (and Schnorr’s scheme, for the same reasons) in certain contexts, this is not true for more complex proof systems. To see this in more detail, check out the upcoming part 3 post of this series, in which we explore the Frozen Heart vulnerability on the Bulletproofs proof system.

Coordinated disclosure of vulnerabilities affecting Girault, Bulletproofs, and PlonK

By Jim Miller

Trail of Bits is publicly disclosing critical vulnerabilities that break the soundness of multiple implementations of zero-knowledge proof systems, including PlonK and Bulletproofs. These vulnerabilities are caused by insecure implementations of the Fiat-Shamir transformation that allow malicious users to forge proofs for random statements.

We’ve dubbed this class of vulnerabilities Frozen Heart. The word frozen is an acronym for FoRging Of ZEro kNowledge proofs, and the Fiat-Shamir transformation is at the heart of most proof systems: it’s vital for their practical use, and it’s generally located centrally in protocols. We hope that a catchy moniker will help raise awareness of these issues in the cryptography and wider technology communities.

This is a coordinated disclosure: we have notified those parties who we know were affected, and they remediated the issues prior to our publication. The following repositories were affected:

The vulnerabilities in one of these proof systems, Bulletproofs, stem from a mistake in the original academic paper, in which the authors recommend an insecure Fiat-Shamir generation. In addition to disclosing these issues to the above repositories, we’ve also reached out to the authors of Bulletproofs who have now fixed the mistake.

Vulnerabilities stemming from Fiat-Shamir implementation issues are not new (e.g., see here and here), and they certainly were not discovered by us. Unfortunately, in our experience, they are incredibly pervasive throughout the zero-knowledge ecosystem. In fact, we have reported several similar vulnerabilities to some of our previous clients. What’s surprising is that despite the pervasiveness of these issues, the cryptography and security communities at large do not appear to be aware of them.

In this blog post, I will first describe the Frozen Heart vulnerability at a high level. I will then discuss the reasons Frozen Heart vulnerabilities occur so commonly in practice (spoiler alert: bad documentation and guidance), steps that the community can take to prevent them, and Trail of Bits’ role in leading that charge. Lastly, I will provide the details of our coordinated disclosure.

This is part 1 of a series of posts on the Frozen Heart vulnerability. In parts 2, 3, and 4, I’ll describe how these vulnerabilities actually work in practice by covering their impact on Girault’s proof of knowledge, Bulletproofs, and PlonK, respectively. Make sure to watch out for these posts in the coming days!

Zero-Knowledge Proofs and the Fiat-Shamir Transformation

This post assumes that you possess some familiarity with zero-knowledge proofs. If you would like to read more about them, there are several helpful blog posts and videos available to you, such as Matt Green’s primer. To learn more about the Fiat-Shamir transformation, check out the blog post I wrote explaining it in more detail. You can also check out ZKDocs for more information about both topics.

The Frozen Heart Vulnerability

The Fiat-Shamir transformation is applied to proof systems with the following structure:

  1. The prover generates a random value: the commitment.
  2. The verifier responds with a random value: the challenge.
  3. The prover then uses the commitment, the challenge, and her secret data to generate the zero-knowledge proof.

Each proof system is accompanied by a security proof, which guarantees that it’s infeasible for an attacker to forge proofs as long as certain assumptions are met. For proof systems of this structure, one very important assumption is that the challenge value generated by the verifier is completely unpredictable and uncontrollable by the prover. At a high level, the reason is that the zero-knowledge proof generated by the prover is considered valid only if it satisfies a very specific mathematical relationship. The prover may satisfy this relationship in one of two ways:

  1. He actually has the necessary secret data to satisfy the relationship, and he generates a zero-knowledge proof the honest way.
  2. He does not have the necessary secret data, but he guesses random values and gets lucky.

For a secure proof system, it’s effectively impossible for malicious provers to achieve option 2 (i.e., they have only a one in 2128 probability of guessing right) as long as the random challenge is completely unpredictable. But if a malicious prover can predict this value in a certain way, it’s actually easy for him to commit a proof forgery by finding random values that will satisfy the necessary mathematical relationship. This is exactly how the Frozen Heart vulnerability works.

This vulnerability is generic; it can be applied to any proof system that insecurely implements the Fiat-Shamir transformation. However, the exact details of the vulnerability depend on the proof system in question, and the impact of the vulnerability depends on how the surrounding application uses the proof system. For examples on how the vulnerability impacts different systems, be sure to read my upcoming posts describing Frozen Heart vulnerabilities in Girault’s proof of knowledge, Bulletproofs, and PlonK. You can also check out my previous Fiat-Shamir blog post, in which I show how this works for the Schnorr proof system.

Preventing Frozen Heart Vulnerabilities

The Frozen Heart vulnerability can affect any proof system using the Fiat-Shamir transformation. To protect against these vulnerabilities, you need to follow this rule of thumb for computing Fiat-Shamir transformations: the Fiat-Shamir hash computation must include all public values from the zero-knowledge proof statement and all public values computed in the proof (i.e., all random “commitment” values).

Here, the zero-knowledge proof statement is a formal description of what the proof system is proving. As an example, let’s look at the Schnorr proof system. The (informal) proof statement for the Schnorr proof system is the following: the prover proves that she knows a secret value, x, such that h = gx mod q, where q is a prime number and g is a generator of a finite group. Here, h, g, and q are all public values, and x is a private value. Therefore, we need to include h, g, and q in our Fiat-Shamir calculation.

But we’re not done. In step 1 of the protocol, the prover generates a random value, r, and computes u = gr (here, u is “the commitment”). This u value is then sent to the verifier as part of the proof, so it is considered public as well. Therefore, u needs to be included in the Fiat-Shamir calculation. You can see that all of these values are included in the hash computation in ZKDocs.

A Frozen Heart vulnerability is introduced if some of these values (specifically, h or u) are missing. If these values are missing, a malicious prover can forge proofs for random h values for which she does not know the discrete log. Again, to see how this works, read my previous Fiat-Shamir blog post.

The details are crucial here. Even for the Schnorr proof system, which is arguably the most simple zero-knowledge proof protocol used in practice, it can be easy to make mistakes. Just imagine how easy it is to introduce mistakes in complex proof systems that use multiple Fiat-Shamir transformations, such as Bulletproofs and PlonK.

The Problem

Why is this type of vulnerability so widespread? It really comes down to a combination of ambiguous descriptions in academic papers and a general lack of guidance around these protocols.

Let’s look at PlonK as an example. If you inspect the details of the protocol, you will see that the authors do not explicitly explain how to compute the Fiat-Shamir transformation (i.e., hash these values). Instead, the authors instruct users to compute the value by hashing the “transcript.” They do describe what they mean by “transcript” in another location, but never explicitly. A few implementations of the proof system got this computation wrong simply because of the ambiguity around the term “transcript.”

Believe it or not, PlonK’s description is actually better than important descriptions in most papers. Sometimes the authors of a paper will actually specify an insecure implementation, as was the case for Bulletproofs (which we’ll see in an upcoming follow-up post). Furthermore, most academic papers present only the interactive version of the protocol. Then, after they’ve described the protocol, they mention in passing that it can be made non-interactive using the Fiat-Shamir transformation. They provide you with the original Fiat-Shamir publication from the 1980s, but they rarely say how this technique should actually be used!

Can you imagine if all cryptographic primitives had such documentation issues? We have an entire RFC dedicated to generating nonces deterministically for ECDSA, but it is still implemented incorrectly. Imagine if there were no standards or guidance for ECDSA, and developers could implement the algorithm only by reading its original paper. Imagine if this paper didn’t explicitly explain how to generate these nonces and instead pointed the reader to a technique from a paper written in the 1980s. That’s essentially the current state of most of these zero-knowledge proof systems.

To be clear, I’m not trying to condemn the authors of these papers. These protocols are incredibly impressive and already take a lot of work to build, and it’s not the job of these authors to write comprehensive implementation details for their protocols. But the problem is that there really is no strong guidance for these protocols, and so developers have to rely largely on academic papers. So, my point is not that we should blame the authors, but rather that we shouldn’t be surprised at the consequences.

The Solutions

The best way to address these issues is to produce better implementation guidance. Academic papers are not designed to be comprehensive guides for implementation. A developer, particularly a non-cryptographer, using only guidance from these papers to implement a complex protocol is likely to make an error and introduce these critical vulnerabilities. This is the exact reason we created ZKDocs. With ZKDocs, we aim to provide clearer implementation guidance, focusing particularly on areas of protocols that are commonly messed up, such as Fiat-Shamir transformations. If you’re creating your own zero-knowledge proof implementation, check out our Fiat-Shamir section of ZKDocs!

It’s also worth mentioning that these issues could be more or less eradicated if test vectors for these protocols were widely available. Implementations with incorrect Fiat-Shamir transformations would fail tests using test vectors from correct implementations. However, given the limited guidance for most of these proof systems, producing test vectors for all of them seems unlikely.

Lastly, investigating these Frozen Heart vulnerabilities was a good reminder of the value of code audits. My team and I reviewed a lot of public repositories, and we found a good number of implementations that performed these transformations correctly. Most of these implementations were built by groups of people who performed internal and, typically, external code audits.

Coordinated Disclosure

Prior to our disclosure, my teammates and I spent the last few months researching and reviewing implementations for as many proof systems as possible. Once our research was complete, we disclosed the vulnerabilities to ZenGo, ING Bank, SECBIT Labs, Adjoint Inc., Dusk Network, Iden3, ConsenSys, and all of their active forks on March 18, 2022. We also contacted the authors of the Bulletproofs paper on this date.

As of April 12, 2022, ZenGo, Dusk Network, Iden3, and ConsenSys have patched their implementations with the required fixes. ING Bank has deleted its vulnerable repository. The authors of the Bulletproofs paper have updated their section on the Fiat-Shamir transformation. We were not able to get in contact with SECBIT Labs or Adjoint Inc.

We would like to thank ZenGo, ING Bank, Dusk Network, Iden3, ConsenSys, and the Bulletproofs authors for working swiftly with us to address these issues.

Acknowledgments

I would like to thank each of my teammates for assisting me in reviewing public implementations and David Bernhard, Olivier Pereira, and Bogdan Warinschi for their work on this vulnerability. Lastly, a huge thanks goes to Dr. Sarang Noether for helping me understand these issues more deeply.

Towards Practical Security Optimizations for Binaries

By Michael D. Brown, Senior Security Engineer

To be thus is nothing, but to be safely thus. (Macbeth: 3.1)

It’s not enough that compilers generate efficient code, they must also generate safe code. Despite the extensive testing and correctness certification that goes into developing compilers and their optimization passes, they may inadvertently introduce information leaks into programs or eliminate security-critical operations the programmer wrote in source code. Let’s take a look at an example.

Figure 1 shows an example of CWE-733, a weakness in which a compiler optimization removes or modifies security-critical code. In this case, the source code written by the programmer sanitizes a variable that previously held a cryptographic key by setting it to zero. This is an important step! If the programmer doesn’t sanitize the variable, the key may be recoverable by an attacker later. However, when this code is compiled the sanitization operation is likely to be removed by a compiler optimization pass called dead store elimination.

This pass optimizes programs by eliminating what it thinks are unnecessary variable assignment operations. It assumes that values assigned to a variable are unnecessary if they are not used later in the program, which unfortunately includes our sanitization code.

Figure 1: Compiler Optimization Removal or Modification of Security-critical Code (CWE-733)

This example is one of several well-documented instances of a compiler optimization inadvertently introducing a security weakness into a program. Recently, my colleagues at Georgia Tech and I published an extensive study of how compiler design choices impact another security property of binaries: malicious code reusability. We discovered that compiler code generation and optimization behaviors generally do not consider malicious reusability. As a result, they produce binaries that are generally more reusable by an attacker than is necessary.

Building powerful code reuse exploits

Attackers use code reuse exploit techniques such as return- and jump-oriented programming (ROP and JOP) to evade malicious code injection defenses. Instead of injecting malicious code, attackers using these techniques reuse snippets of a vulnerable program’s executable code, called gadgets, to write their exploit payloads.

Gadgets consist of one or more binary instructions that perform a useful computational task followed by an indirect branch instruction (return, indirect jump, indirect call) that terminates the gadget. The final control-flow instruction is used to chain one or more gadgets together. Gadgets can be thought of as individual instructions that can be used to write an exploit program.

Since the gadgets are part of the vulnerable program, defenses that prevent injected code from being executed won’t stop the exploit. The downside from the attacker’s perspective is that they are potentially limited in their exploit programming by the gadgets that are available. Ultimately, the gadgets available to the attacker (and how useful they are) are a function of the compiler’s code generation and optimization behaviors, because it is the compiler that produces the program’s binary code.

A simple example of a ROP payload is depicted in Figure 2. The payload consists of a chain of gadget addresses (i.e. the exploit) with some necessary data interspersed (i.e. the exploit’s inputs). The attacker first exploits a memory corruption vulnerability like a stack-based buffer overflow (CWE-121) to place the payload on the stack in such a way that the return address on the stack is overwritten with the address of the first gadget in the chain. This redirects program execution to the first gadget in the chain.

Figure 2: Example ROP Gadget Chain

Summary of study findings

In our study, we analyzed over 1,000 variants of 20 different programs compiled with GCC and clang to determine how optimization behaviors affect the set of gadgets available in the output binaries. We used a static analysis tool, GSA, to measure changes in gadget set size, utility, and composability before and after applying optimization options. Starting at a high level, we first discovered that optimizations increased gadget set size, utility, and/or composability in approximately 85% of cases.

Diving deeper, we performed differential binary analysis on several program variants to identify the root causes of these effects. We identified several compiler behaviors that both directly and indirectly contribute to this problem. Two behaviors stood out as the most impactful: duplicating indirect branch instructions and code layout changes.

Duplicating indirect branch instructions

Chaining gadgets together to create exploit programs relies on the control-flow instructions at the end of each gadget. Since each gadget must end with one of these instructions, the more indirect branch transfers a program has the more likely it is to have a large number of unique and useful gadgets. Many compiler optimizations improve performance by selectively duplicating these instructions, resulting in increases to gadget set size and utility.

This behavior is most apparent with GCC‘s omit frame pointer optimization, which eliminates frame pointer setup and restore instructions at the beginning and end of functions that do not need them. In many cases, such as the one shown in Figure 3, eliminating the pointer restore instruction at the end of a function creates an opportunity to optimize further by duplicating the indirect control flow instruction (retn)at the end of the function. While this secondary optimization slightly reduces code size and execution time, it creates one or more copies of the retn instruction. In turn, this introduces several more gadgets into the program, including ones that may be useful to an attacker.

Figure 3: Duplication of retn instruction by GCC Omit Frame Pointer optimization

Binary layout changes

In general, optimization behaviors insert, remove, or alter instructions in such a way that changes the size of code blocks and functions. This causes changes in how blocks and functions are eventually laid out in binary format, which in turn requires changes to displacements used in control-flow instructions throughout the program.

In some cases, the new displacements contain the binary encoding of an indirect branch instruction, like the example shown in Figure 4. Here, a conditional jump instruction with a short 1-byte displacement in unoptimized code changes to equivalent conditional jump instruction with a near 4-byte displacement as a byproduct of a layout change caused by optimization. This new displacement encodes the retn (i.e., 0xC3) indirect branch instruction. Even though the displacement is not intended to be an instruction, it can be decoded as one during an exploit because x86_64 uses unaligned, variable-length instructions. The sequence of bytes preceding indirect branch instruction encoding can be decoded as a gadget if they happen to encode valid instructions (which is quite likely given the density of the x86_64 ISA). These gadgets are called “unintended” or “unaligned” gadgets.

Figure 4: Binary layout change resulting in introduced gadget terminating instruction encoding

What can we do to address this?

The various behaviors we discovered all have a common property: they are secondary to or independent of the desired optimization. This means we can undo gadget-introducing behaviors without completely sacrificing performance.

Ideally, compilers would patch their optimizations to remove these behaviors. Unfortunately, instruction duplication behaviors are common to many different optimization passes and gadget encodings introduced through displacement changes can’t be detected during optimization as binary layout happens much later.

Fortunately, binary recompilers such as Egalito are well-suited to address this problem. Egalito allows us to transform program binaries in a layout-agnostic manner regardless of the compiler used to generate the binary. This provides many advantages to the problem at hand. First, we can implement recompiler passes to undo negative behaviors once for Egalito instead of for each problematic optimization in each compiler. Additionally, we can undo negative behaviors in programs without access to source code or special compilers!

Practical binary security optimizations

We built an initial set of five binary optimization passes for Egalito that eliminate gadgets in binaries that are introduced carelessly by compilers:

  1. Return Merging: Merge all return instructions in a function to a single instance.
  2. Indirect Jump Merging: Merge all indirect jump instructions targeting the same register in a function to a single instance.
  3. Instruction barrier widening: Eliminate unintended special-purpose gadgets that span consecutive intended instructions.
  4. Offset/Displacement Sledding: Eliminate gadgets rooted in jump displacements.
  5. Function Reordering: Eliminate gadgets rooted in call offsets.

Next, we evaluated the impact of these optimization passes on gadget sets and performance by applying them to several of our study binaries. We found that our passes:

  1. Eliminated 31.8% of useful gadgets on average
  2. Reduced the overall utility of gadget sets in 78% of variants
  3. Eliminated one or more special purpose gadget types (e.g., syscall gadgets) in 75% of variants
  4. Had no effect on execution speed
  5. Increased code size by only 6.1 kB on average

Conclusion

Compiler code generation and optimization behaviors have a massive impact on a binary gadget set. But due to a lack of attention to latent security properties, these behaviors generally create binaries with gadget sets that are easier for attackers to reuse in exploits. There are many root causes for this, however, it is possible to mitigate and undo negative behaviors with simple code transformations that do not sacrifice performance.

While our initial research on this problem has yielded some promising results, it is not exhaustive in dealing with problematic compiler behaviors. Over the coming year, I will be working on additional transformations to address other issues such as problematic register allocations. Additionally, I will be examining how these optimizations may have secondary benefits, such as reducing the performance costs of employing other code reuse defenses such as Control-Flow Integrity (CFI).

Acknowledgements

This research was conducted with my co-authors at Georgia Tech and Georgia Tech Research Institute: Matthew Pruett, Robert Bigelow, Girish Mururu, and Santosh Pande.

Optimizing a smart contract fuzzer

By Sam Alws

During my winternship, I applied code analysis tools, such as GHC’s Haskell profiler, to improve the efficiency of the Echidna smart contract fuzzer. As a result, Echidna is now over six times faster!

Echidna overview

To use Echidna, users provide smart contracts and a list of conditions that should be satisfied no matter what happens (e.g., “a user can never have a negative number of coins”). Then, Echidna generates a large number of random transaction sequences, calls the contracts with these transaction sequences, and checks that the conditions are still satisfied after the contracts execute.

Echidna uses coverage-guided fuzzing; this means it not only uses randomization to generate transaction sequences, but it also considers how much of the contract code was reached by previous random sequences. Coverage allows bugs to be found more quickly since it favors sequences that go deeper into the program and that touch more of its code; however, many users have noted that Echidna runs far slower when coverage is on (over six times slower on my computer). My task for the internship was to track down the sources of slow execution time and to speed up Echidna’s runtime.

Optimizing Haskell programs

Optimizing Haskell programs is very different from optimizing imperative programs since the order of execution is often very different from the order in which the code is written. One issue that often occurs in Haskell is very high memory usage due to lazy evaluation: computations represented as “thunks” are stored to be evaluated later, so the heap keeps expanding until it runs out of space. Another simpler issue is that a slow function might be called repeatedly when it needs to be called only once and have its result saved for later (this is a general problem in programming, not specific to Haskell). I had to deal with both of these issues when debugging Echidna.

Haskell profiler

One feature of Haskell that I made extensive use of was profiling. Profiling lets the programmer see which functions are taking up the most memory and CPU time and look at a flame graph showing which functions call which other functions. Using the profiler is as simple as adding a flag at compile time (-prof) and another pair of flags at runtime (+RTS -p). Then, given the plaintext profile file that is generated (which is very useful in its own right), a flame graph can be made using this tool; here’s an example:

This flame graph shows how much computing time each function took up. Each bar represents a function, and its length represents how much time it took up. A bar stacked on another bar represents one function calling another. (The colors of the bars are picked at random, for aesthetic reasons and readability.)

The profiles generated from running Echidna on sample inputs showed mostly the usual expected functions: functions that run the smart contracts, functions that generate inputs, and so on. One that caught my eye, though, is a function called getBytecodeMetadata, which scans through contract bytecodes and looks for the section containing the contract’s metadata (name, source file, license, etc.). This function needed to be called only a few times at the start of the fuzzer, but it was taking up a large portion of the CPU and memory usage.

A memoization fix

Searching through the codebase, I found a problem slowing down the runtime: the getBytecodeMetadata function is called repeatedly on the same small set of contracts in every execution cycle. By storing the return value from getBytecodeMetadata and then looking it up later instead of recalculating, we could significantly improve the runtime of the codebase. This technique is called memoization.

Adding in the change and testing it on some example contracts, I found that the runtime went down to under 30% of its original time.

A state fix

Another issue I found was with Ethereum transactions that run for a long time (e.g., a for loop with a counter going up to one million). These transactions were not able to be computed because Echidna ran out of memory. The cause of this problem was Haskell’s lazy evaluation filling the heap with unevaluated thunks.

Luckily, the fix for this problem had already been found by someone else and suggested on GitHub. The fix had to do with Haskell’s State data type, which is used to make it more convenient (and less verbose) to write functions that pass around state variables. The fix essentially consisted of avoiding the use of the State data type in a certain function and passing state variables around manually instead. The fix hadn’t been added to the codebase because it produced different results than the current code, even though it was supposed to be a simple performance fix that didn’t affect the behavior. After dealing with this problem and cleaning up the code, I found that it not only fixed the memory issue, but it also improved Echidna’s speed. Testing the fix on example contracts, I found that the runtime typically went down to 50% of its original time.

For an explanation of why this fix worked, let’s look at a simpler example. Let’s say we have the following code that uses the State data type to make a simple change on the state for all numbers from 50 million down to 1:

import Control.Monad.State.Strict

-- if the state is even, divide it by 2 and add num, otherwise just add num
stateChange :: Int -> Int -> Int
stateChange num state
| even state = (state div 2) + num
| otherwise = state + num

stateExample :: Int -> State Int Int
stateExample 0 = get
stateExample n = modify (stateChange n) >> stateExample (n - 1)

main :: IO ()
main = print (execState (stateExample 50000000) 0)

This program runs fine, but it uses up a lot of memory. Let’s write the same piece of code without the State data type:

stateChange :: Int -> Int -> Int
stateChange num state
| even state = (state div 2) + num
| otherwise = state + num

stateExample’ :: Int -> Int -> Int
stateExample’ state 0 = state
stateExample’ state n = stateExample’ (stateChange n state) (n - 1)

main :: IO ()
main = print (stateExample’ 0 50000000)

This code uses far less memory than the original (46 KB versus 3 GB on my computer). This is because of Haskell compiler optimizations. I compiled with the -O2 flag ghc -O2 file.hs; ./file, or ghc -O2 -prof file.hs; ./file +RTS -s for memory allocation stats.

Unoptimized, the call chain for the second example should be stateExample’ 0 50000000 = stateExample’ (stateChange 50000000 0) 49999999 = stateExample’ (stateChange 49999999 $ stateChange 50000000 0) 49999998 = stateExample’ (stateChange 49999998 $ stateChange 49999999 $ stateChange 50000000 0) 49999997 = …. Note the ever-growing (... $ stateChange 49999999 $ stateChange 50000000 0) term, which expands to fill up more and more memory until it is finally forced to be evaluated once n reaches 0.

However, the Haskell compiler is clever. It realizes that the final state will eventually be needed anyway and makes that term strict, so it doesn’t end up taking up tons of memory. On the other hand, when Haskell compiles the first example, which uses the State data type, there are too many layers of abstraction in the way, and it isn’t able to realize that it can make the (... $ stateChange 50000000 0) term strict. By not using the State data type, we are making our code simpler to read for the Haskell compiler and thus making it easier to implement the necessary optimizations.

The same thing was happening in the Echidna memory issue that I helped to solve: minimizing the use of the State data type helped the Haskell compiler to realize that a term could be made strict, so it resulted in a massive decrease in memory usage and an increase in performance.

An alternate fix

Another way to fix the memory issue with our above example is to replace the line defining stateExample n with the following:

stateExample n = do
s <- get
put $! stateChange n s
stateExample (n-1)

Note the use of $! in the third line. This forces the evaluation of the new state to be strict, eliminating the need for optimizations to make it strict for us.
While this also fixes the problem in our simple example, things get more complicated with Haskell’s Lens library, so we chose not to use put $! in Echidna; instead we chose to eliminate the use of State.

Conclusion

The performance improvements that we introduced are already available in the 2.0.0 release. While we’ve achieved great success this round, this does not mean our work on Echidna’s fuzzing speed is done. Haskell is a language compiled to native code and it can be very fast with enough profiling effort. We’ll continue to benchmark Echidna and keep an eye on slow examples. If you believe you’ve encountered one, please open an issue.

I thoroughly enjoyed working on Echidna’s codebase for my winternship. I learned a lot about Haskell, Solidity, and Echidna, and I gained experience in dealing with performance issues and working with relatively large codebases. I’d specifically like to thank Artur Cygan for setting aside the time to give valuable feedback and advice.

Maat: Symbolic execution made easy

By Boyan Milanov

We have released Maat, a cross-architecture, multi-purpose, and user-friendly symbolic execution framework. It provides common symbolic execution capabilities such as dynamic symbolic execution (DSE), taint analysis, binary instrumentation, environment simulation, and constraint solving.

Maat is easy-to-use, is based on the popular Ghidra intermediate representation (IR) language p-code, prioritizes runtime performance, and has both a C++ and a Python API. Our goal is to create a powerful and flexible framework that can be used by both experienced security engineers and beginners that want to get started with symbolic execution.

While our Manticore tool offers a high-level interface to symbolically explore binaries, Maat is a lower-level symbolic execution toolkit that can be easily integrated into other projects or used to build stand-alone analysis tools. For a straight-to-the-point example, read our tutorial on how to solve a basic reverse engineering challenge with Maat.

A user-friendly, flexible API

Maat has a C++ programmatic API that can be used in low-level or performance-critical projects. It also offers Python bindings allowing users to easily and quickly write portable analysis scripts.

The API has been designed to give the user as much control as possible. Its debugger-like interface can be used to start, pause, and even rewind the symbolic execution process. Users can instrument the target code with arbitrary callback functions that are triggered by certain events (such as register and memory accesses and branch operations), write custom dynamic analyses, modify the program state at runtime, specify a particular state at which the process should stop, and even perform path exploration on a portion of a binary.

Last but not least, Maat’s execution engine has customizable settings that allow users to control its fundamental behavior in processing symbolic data. It includes policies for dealing with symbolic pointers, saving state constraints, and making symbolic simplifications, among other customizations. The default settings prioritize soundness over performance and suit the most general use cases, but advanced users can tailor the engine to their own use cases and bypass certain limitations of the defaults.

Rich architecture support

With Maat, we want to bring symbolic execution capabilities to as many architectures as possible. To do so, we based Maat’s symbolic execution engine on p-code, the IR language used by Ghidra. By basing Maat on p-code, we were able to leverage Ghidra’s awesome C++ library, sleigh, for disassembling and lifting binary code, which already supports a very broad range of architectures. The cherry on top: Maat uses a separate standalone version of sleigh, so you don’t have to install Ghidra to use Maat.

The use of sleigh brings three major advantages to Maat:

  • The ability to perform symbolic execution on any architecture supported by Ghidra
  • The reliability of a very popular, open-source, and actively supported disassembling and lifting library
  • The possibility to add additional architectures using the sleigh specification language

While Maat has been tested only on X86 and X64 so far, we plan to add interfaces for other architectures soon. We are particularly excited by the prospect of introducing support in Maat for exotic architectures that are not currently supported by any existing tool; sleigh’s unrivaled architecture support makes this possible. Another thrilling opportunity is the use of Maat to perform symbolic execution on virtual machine bytecode such as Java, Dalvik, and Ethereum.

Performance-driven

It can be a struggle to scale symbolic execution to real-world applications. For generic, binary-only symbolic execution tools, significant runtime overhead is inherent to lifting and executing an IR; it is simply unavoidable. That being said, in any reasonable day-to-day workflow, scripts that run within minutes instead of hours can make all the difference. We thus put care into the design and implementation of Maat so that it runs as fast as possible while also yielding useful results.

The core of Maat is written entirely in C++, many developers’ language of choice for optimizations and performance. We do our best to write efficient code without sacrificing code readability or restricting features. Maat’s runtime performance can vary widely depending on the amount of symbolic computations, on calls to the SMT solver, and on user-provided analysis callbacks; but our early experimental measurements are quite promising, with 100,000 to 300,000 instructions symbolically executed per second on a typical laptop (2.3 GHz Intel Core i7, 32 GB RAM).

We also plan on adding and exposing introspection capabilities to allow users to identify runtime bottlenecks. This will not only help end users to optimize their analysis scripts for their specific use cases but also enable us to make more fundamental improvements to Maat’s core components.

How to get started

Simply install Maat with python3 -m pip install pymaat! Check out our series of tutorials for guidance on using it. While this series offers a few basic tutorials, our long-term goal is to provide a more comprehensive series that covers the basics of the framework and advanced applications and complex features.

Curious readers can check out Maat’s source code on GitHub! Along with the tutorials, you will find installation instructions and C++/Python API documentation on Maat’s website.

Finally, join our GitHub discussions for questions and feedback—let us know what you think!

Part 2: Improving crypto code in Rust using LLVM’s optnone

By Henrik Brodin

Let’s implement crypto!

Welcome to the second part of our posts on the challenges of implementing constant-time Rust code. Part 1 discussed challenges with constant-time implementations in Rust and WebAssembly and how optimization barriers can mitigate risk. The Rust crypto community has responded with several approaches, and in this post, we will explore one such approach: the implementation of a feature in the Rust compiler (rustc) that provides users greater control over generated code. We will also explore the intermediate representation (IR) generated when this feature is implemented and consider problems that arise in later phases of code generation, such as instruction selection.

Revisiting constant time

A constant-time cryptographic algorithm always executes in the same amount of time regardless of the input. Not all of the operations need to execute in the same amount of time, but timing variations must not depend on secret data. If they did, an adversary could draw conclusions about the secret data. However, keeping secret data secret when compiling constant-time cryptographic code can be difficult. The compiler is required to preserve the “observable behavior” of the program, but since it has no notion of time (except for specialized compilers), it is also free to change the constant-time properties of code.

In practice, to prevent the compiler from altering carefully implemented constant-time code, we have to lie to the compiler. We have to tell it that we know things about the code that it cannot know—that we will read or write memory in ways that it cannot see, similar to multithreaded code. The option we are about to explore will instead tell the compiler not to use all its skills to generate efficient code.

We will start with the example in part 1: a choice, or a function that chooses either parameter a or b, depending on the choice parameter. Here it is without considering constant-time execution.

#[inline(never)]
fn conditional_select(a: u32, b: u32, choice: bool) -> u32 {
    if choice { a } else { b }
}

By adding the #[inline(never)] attribute, it’s easier to see the effects of our change. First, let’s compile the code and generate LLVM-IR from it.

rustc  --emit llvm-ir,link -C opt-level=0  test.rs

From this, a test.ll file is generated, in which we’ll find the IR for conditional_select.

; Function Attrs: noinline uwtable
define internal i32 @_ZN4test18conditional_select17h01ca56cd2cc74a72E(i32 %a, i32 %b, i1 zeroext %choice) unnamed_addr #1 {
start:
  %0 = alloca i32, align 4
  br i1 %choice, label %bb1, label %bb2

bb2:                                              ; preds = %start
  store i32 %b, i32* %0, align 4
  br label %bb3

bb1:                                              ; preds = %start
  store i32 %a, i32* %0, align 4
  br label %bb3

bb3:                                              ; preds = %bb2, %bb1
  %1 = load i32, i32* %0, align 4
  ret i32 %1
}

We can see there is a branch that depends on the choice variable. We need to avoid branches to keep information about the value of choice secret. Now let’s reimplement the function in a way that removes branches.

Constant-time implementation

#[inline(never)]
fn conditional_select(a: u32, b: u32, choice: bool) -> u32 {
    let mask = -(choice as i32) as u32;
    b ^ (mask & (a ^ b))
}

According to Rust’s documentation, “The bool represents a value, which could only be either true or false. If you cast a bool into an integer, true will be 1 and false will be 0.” Here, we use this property, converting our bool to an integer.

The variable mask will now be 0xffffff or 0, depending on whether choice is true or false.

The xor and and operations will ensure that the return value is a if choice is true and b if choice is false.

Perfect! Let’s see whether we were able to generate the code without branches:

define internal i32 @_ZN4test18conditional_select17h01ca56cd2cc74a72E(i32 %a, i32 %b, i1 zeroext %choice) unnamed_addr #1 {
start:
  %0 = icmp ule i1 %choice, true
  call void @llvm.assume(i1 %0)
  %_6 = zext i1 %choice to i32
  %_8 = icmp eq i32 %_6, -2147483648
  %1 = call i1 @llvm.expect.i1(i1 %_8, i1 false)
  br i1 %1, label %panic, label %bb1

bb1:                                              ; preds = %start
  %_5 = sub i32 0, %_6
  %_12 = xor i32 %a, %b
  %_10 = and i32 %_5, %_12
  %2 = xor i32 %b, %_10
  ret i32 %2

panic:                                            ; preds = %start
; call core::panicking::panic
  call void @_ZN4core9panicking5panic17h7dfe23182f4d1104E([0 x i8]* 
nonnull align 1 bitcast ([31 x i8]* @str.4 to [0 x i8]*), i64 31, %
"core::panic::location::Location"* align 8 dereferenceable(24) bitcast 
(<{ i8*, [16 x i8] }>* @alloc56 to %"core::panic::location::Location"*)) 
#9
  unreachable
}

Yikes! What happened? The branches were removed from the core operations (see bb1). That’s a success, but there is more going on here. Now there is a conditional branch to panic, depending on the value of choice. This conditional branch is added by rustc in debug builds to detect signed integer overflows. In a production build, the signed integer overflow check will not be emitted and the branch will not exist.

However, one concerning detail is that there is a call to @llvm.assume, which depends on the value of choice. According to LLVM’s documentation,

The intrinsic allows the optimizer to assume that the provided condition is always true whenever the control flow reaches the intrinsic call. No code is generated for this intrinsic, and instructions that contribute only to the provided condition are not used for code generation. If the condition is violated during execution, the behavior is undefined.

Code generation could be influenced by the value of choice. In examining the condition more closely, the condition asserts that the range of values for choice is assumed to be [0,1]. What a relief! There is no leakage of secret information since it reveals only the range of choice (information that is already known), and not its specific value.

It seems that we’ve reached our goal. Let’s ensure that things still look OK in an optimized build.

rustc  --emit llvm-ir,link -C opt-level=3  test.rs:
define internal fastcc i32 
@_ZN4test18conditional_select17h01ca56cd2cc74a72E(i32 %a, i32 %b, i1 
zeroext %choice) unnamed_addr #5 {
start:
  %0 = select i1 %choice, i32 %a, i32 %b
  ret i32 %0
}

Depending on the target architecture, the compiler may lower the select statement to different instructions. On x86, it could be lowered to a cmov-instruction, while on other architectures, it becomes a conditional branch. What’s worse, if you were to compile the non-constant-time version we started out with, you would get the exact same IR. All that work for nothing!

We can see that as long as the code is not compiled with optimizations enabled, the end result is what we would expect. On the other hand, enabling optimizations could break the constant-time properties of the code. This leads us to the question, Can we influence the compiler to not optimize the conditional_select function? Cargo and rustc accept parameters that disable optimizations globally, but doing so on the full system is not typically possible. One possible solution could be to prevent optimizations for a specific function. (This has previously been suggested as a way to improve the situation.)

Fighting Helping the compiler

Now that we have the desired IR code in debug builds, let’s explore how an LLVM attribute, optnone, can be used to disable optimizations at the function level. The LLVM documentation states the following:

This function attribute indicates that most optimization passes will skip this function, with the exception of interprocedural optimization passes. Code generation defaults to the “fast” instruction selector. This attribute cannot be used together with the alwaysinline attribute; this attribute is also incompatible with the minsize attribute and the optsize attribute.

This attribute requires the noinline attribute to be specified on the function as well, so the function is never inlined into any caller. Only functions with the alwaysinline attribute are valid candidates for inlining into the body of this function.

Our next goal is to mark the conditional_select function with the optnone attribute. According to the documentation, the function also requires the noinline attribute. As it happens, we already marked the function with that attribute with #[inline(never)].

We will implement an attribute in Rust that, when compiled, will generate the optnone and noinline attributes for the function.

Building a Rust compiler

To build and run the Rust compiler, refer to this guide. From this point on, we will assume that the command used to compile is rustc +stage1. To verify that the custom compiler is used, invoke it with the additional -vV flag. You should see output similar to the following:

rustc 1.57.0-dev
binary: rustc
commit-hash: unknown
commit-date: unknown
host: x86_64-apple-darwin
release: 1.57.0-dev
LLVM version: 13.0.0

Note the -dev version string, indicating a custom build.

Implementing the optnone attribute

There is already work done in this area; the Rust optimize attribute is already implemented to optimize for either the speed or size of a program. We are aiming to implement a “never” option for the optimize attribute. The goal is to write the conditional_select like this. (There are discussions about naming the “never” attribute. Naming is important, but for our purposes, we don’t need to focus on it.)

#[optimize(never)]
fn conditional_select(a: u32, b: u32, choice: bool) -> u32 {
    let mask = -(choice as i32) as u32;
    b ^ (mask & (a ^ b))
}

Annotating the function with the attribute in a non-optimized build would have no effect. In an optimized build, it would ensure that the optimizer does not touch the function.

To implement such an option, the first step is to extend the OptimizeAttr attribute with the Never member. We will use this value as an information carrier, from parsing to code generation.

#[derive(Clone, Encodable, Decodable, Debug, HashStable_Generic)]
pub enum OptimizeAttr {
    None,
    Never,
    Speed,
    Size,
}

When the symbol never is found in the optimize attribute, we should add the following lines to codegen_fn_attr to emit the OptimizeAttr::Never member previously added:

} else if list_contains_name(&items[..], sym::never) {
                    OptimizeAttr::Never

At this point, we can annotate a function internally in the Rust compiler with OptimizeAttr::Never. What remains is to ensure it is applied to the LLVM IR as well.

To do so, we add the following to from_fn_attrs. This code is what actually marks the LLVM function with the desired attributes when rustc discovers a function with the #[optimize(never)] attribute.

OptimizeAttr::Never => {
            llvm::Attribute::MinSize.unapply_llfn(Function, llfn);
            llvm::Attribute::OptimizeForSize.unapply_llfn(Function, llfn);
            llvm::Attribute::OptimizeNone.apply_llfn(Function, llfn);
            llvm::Attribute::NoInline.apply_llfn(Function, llfn); 
// noopt requires noinline
        }

Now, we can add the optnone and noinline attributes to the LLVM IR from an #[optimize(never)] Rust attribute. Still, there remains some bookkeeping to do.

We need to update the feature gate to include information about the never option in the optimize attribute.

// RFC 2412
    gated!(
        optimize, Normal, template!(List: "size|speed|never"), 
optimize_attribute,
        experimental!(optimize),
    ),

We can build a stage1 compiler to test our changes.

./x.py build -i library/std
rustup toolchain link stage1 build/x86_64-apple-darwin/stage1

Results

Finally, we are ready to test the new attribute. Let’s mark the conditional_select function with the #[optimize(never)] attribute and compile for opt-level=3. To enable the optimize attribute, we add #![feature(optimize_attribute)] to the test.rs file.

rustc  +stage1 --emit llvm-ir,link -C opt-level=3  test.rs:

#[optimize(never)]
fn conditional_select(a: u32, b: u32, choice: bool) -> u32 {
    let mask = -(choice as i32) as u32;
    b ^ (mask & (a ^ b))
}

You’ll find that the corresponding IR is now:

; test::conditional_select
; Function Attrs: noinline optnone uwtable
define internal fastcc i32 @_ZN4test18conditional_select17h01ca56cd2cc74a72E(i32 %a, i32 %b, i1 zeroext %choice) unnamed_addr #5 {
start:
  %0 = icmp ule i1 %choice, true
  call void @llvm.assume(i1 %0)
  %_6 = zext i1 %choice to i32
  %_5 = sub i32 0, %_6
  %_11 = xor i32 %a, %b
  %_9 = and i32 %_5, %_11
  %1 = xor i32 %b, %_9
  ret i32 %1
}

Success! The optnone and noinline attributes are in use and the IR instructions are as desired. Are we done now? Just create the pull request and merge? Hold your horses! Before doing so, we should of course implement tests (the interested reader can find them here).

But we will leave that aside for now. Instead, let’s turn to a different aspect of what we’ve just accomplished: the instruction-selection phase of code generation.

There is always a ‘but’

It seems we’ve made great progress or even solved the problem of generating constant-time code. This is partly true, but as is common with cryptography (and with compilers), it is not that simple. Although we’ve prevented the optimizer from rewriting the function, there is still an instruction-selection phase during code generation. During this phase, the compiler back end chooses any target instruction(s) it sees fit. This is an aspect that we addressed briefly. We implicitly assumed that an instruction in LLVM IR, such as an xor, would become an equivalent instruction in the target instruction set, such as the x86 xor instruction. While it is likely that an IR xor instruction would be implemented as xor in the target architecture, there’s no guarantee that it will. Code generation could also evolve over time, and what once became a specific instruction could change with a different version of the compiler.

To make things worse, there are optimizations in the machine code generation process. An example for x86 is the X86CmovConverterPass that will convert cmov into conditional branches in certain circumstances. This essentially translates a constant-time operation (cmov) to a non-constant-time conditional branch, which could re-enable timing-based side-channel attacks.

It doesn’t stop there. Once we reach the actual target-specific operations, there could still be data-dependent timing, such as executing a div on AMD:

The hardware integer divider unit has a typical latency of 8 cycles plus 1 cycle for every 9 bits of quotient. The divider allows limited overlap between two consecutive independent divide operations. “Typical” 64-bit divides allow a throughput of one divide per 8 cycles (where the actual throughput is data dependent).

Summary

Claims of constant-time executing code become weak when written in a high-level language such as Rust. This holds true for languages like C and C++ as well. There are too many factors that we cannot control.

Does this mean that all is lost? Is every crypto implementation not written in target-specific assembly language broken? Probably not, but these implementations have to rely on tricks and hopes of reasonable code generation.

There is almost always a trade-off, as is true in many areas—size versus speed, time to market versus quality, etc. There are large gains in implementing crypto in a memory-safe, modern language with strong analysis tooling available. However, hand-written, target-specific assembly language can make stronger claims about constant-time properties, with the drawback of potentially introducing memory safety issues.

To be able to make such claims for code written in Rust, there needs to be strong support from the compiler, from the front end, and all the way through to the target machine code generation in the back end. We probably need constant time to be a property that the compiler is aware of in order for it to preserve it. This is a major undertaking, and there are several ongoing discussions and proposals to get us there.

For now, we have to rely on what we have. A small step forward could be incorporating the never optimize option to help.

Part 1: The life of an optimization barrier

By Fredrik Dahlgren

Many engineers choose Rust as their language of choice for implementing cryptographic protocols because of its robust security guarantees. Although Rust makes safe cryptographic engineering easier, there are still some challenges to be aware of. Among them is the need to preserve constant-time properties, which ensure that, regardless of the input, code will always take the same amount of time to run. These properties are important in preventing timing attacks, but they can be compromised by compiler optimizations.

Recently, a client asked us how packaging a library as an npm module using wasm-pack and then running it using node would affect constant-time cryptographic code implemented in Rust. Writing constant-time code is always a fight against the optimizer, but in this case, the code would be optimized twice before being executed, first by LLVM and then by the Turbofan JIT compiler. How would this affect the constant-time cryptographic code used by the library?

We ran a number of small test cases to explore how optimizing the codebase twice would affect the constant-time properties of the code. This post will focus on the challenges of implementing constant-time Rust code and show that LLVM may introduce a new side-channel when compiling constant-time code to WebAssembly (Wasm). In part 2, we look at whether it is possible to selectively disable optimizations for security-critical code that needs to execute in constant time.

Constant-time what?

Cryptography is difficult to implement correctly. This is true when you are implementing both high-level protocols and low-level cryptographic primitives. Apart from worrying about overall correctness and edge cases that could expose secrets in unexpected ways, potential side-channel leakages and timing attacks are also deep concerns.

A timing attack is an attempt to exploit the fact that the application’s execution time may subtly depend on the input. If the application makes control flow-related decisions based on secret data, like the seed for a random number generator or a private key, this could ever so slightly affect the execution time of the application. Likewise, if secret data is used to determine which location in memory to read from, this could cause cache misses, which in turn would affect the execution time of the application. In both cases, information about the secret data is leaked through timing differences during the execution of the program.

To prevent such timing differences, cryptography engineers typically avoid implementing decisions based on secret data. However, in situations in which code needs to make decisions based on secret data, there are clever ways to implement them in constant time, that is, in a way that always executes in the same amount of time regardless of the input. For example, consider the following function, which performs a conditional selection between variables a and b in Rust.

#[inline]
fn conditional_select(a: u32, b: u32, choice: bool) -> u32 {
    if choice { a } else { b }
}

The function returns a if choice is true, otherwise b is returned. Depending on the compiler toolchain and the targeted instruction set, the compiler could choose to implement the conditional selection using a branching instruction like jne on x86 or bne on ARM. This would introduce a timing difference in the execution of the function, which could leak information about the choice variable. The following Rust implementation uses a clever trick to perform the same conditional selection in constant time.

#[inline]
fn conditional_select(a: u32, b: u32, choice: u8) -> u32 {
    // if choice = 0, mask = (-0) = 0000...0000
    // if choice = 1, mask = (-1) = 1111...1111
    let mask = -(choice as i32) as u32;
    b ^ (mask & (a ^ b))
}

Here, we make no choices based on the choice secret value, which means that there is only one path through the function. Consequently, the execution time will always be the same.

Fighting the compiler

Ideally, this should be the end of the story, but in practice, there are risks inherent to this approach. Since the compiler has no concept of time, it doesn’t view timing differences as observable behavior. This means that it is free to rewrite and optimize constant-time code, which could introduce new timing leaks into the program. A carefully written constant-time implementation like the one above could still be optimized down to a branching instruction by the compiler, which would leak the value of choice!

This feels like an impossible situation. If this is really the case, the compiler is actually working against us to break our carefully crafted constant-time implementation of conditional_select. So what could we do to stop the compiler from optimizing the function and potentially breaking the constant-time properties of the code?

The most obvious solution is the nuclear option—to turn off all optimizations and compile the entire codebase with the -C opt-level=0 flag. This is almost always an untenable solution, however. Cryptographic code typically handles huge amounts of data, which means that it needs all the optimizations it can get from the compiler. A more attractive solution is to attempt to stop the compiler from optimizing sensitive code paths using what is known as optimization barriers. The subtle crate uses the following construction to attempt to thwart LLVM’s attempts to optimize constant-time code paths.

#[inline(never)]
fn black_box(input: u8) -> u8 {
    unsafe { core::ptr::read_volatile(&input as *const u8) }
}

Here, the call to core::ptr::read_volatile tells the compiler that the memory at &input is volatile and that it cannot make any assumptions about it. This call functions as an optimization barrier that stops LLVM from “seeing through the black box” and realizing that the input is actually a boolean. This in turn prevents the compiler from rewriting boolean operations on the output as conditional statements, which could leak timing information about the input. The Rust Core Library documentation has the following to say about core::ptr::read_volatile:

“Rust does not currently have a rigorously and formally defined memory model, so the precise semantics of what ‘volatile’ means here is subject to change over time. That being said, the semantics will almost always end up pretty similar to C11’s definition of volatile.”

This doesn’t seem very reassuring, but remember that timing differences are not viewed as observable by the compiler, so the compiler is always free to rewrite constant-time code and introduce new side-channel leaks. Any attempt to stop the compiler from doing so is bound to be on a best-effort basis until there is built-in language and compiler support for secret types. (There is a Rust RFC introducing secret types, but this has been postponed, awaiting LLVM support.)

Let’s see what happens with the conditional_select function if we compile it without an optimization barrier. To better illustrate this, we will target an instruction set that does not have conditional instructions like cmov (like x86_64 and aarch64), which allows the compiler to optimize the function without breaking the constant-time properties of the implementation. The following function simply calls the constant-time version of conditional_select to return either a or b.

pub fn test_without_barrier(a: u32, b: u32, choice: bool) -> u32 {
    let choice = choice as u8;
    conditional_select(a, b, choice)
}

By compiling the function for the ARM Cortex M0+ (which is used by the Raspberry Pi Pico), we get the following decompiled output.

We see that the compiler has replaced our carefully crafted conditional selection with a simple branch on the value of choice (in r2), completely destroying the constant-time properties of the function! Now, let’s see what happens if we insert an optimization barrier.

pub fn test_with_barrier(a: u32, b: u32, choice: bool) -> u32 {
    let choice = black_box(choice as u8);
    conditional_select(a, b, choice)
}

Looking at the corresponding disassembly, we see that it consists of a single basic block, resulting in a single path through the function, independent of the value of choice. This means that we can be reasonably sure that the function will always run in constant time.

So what about Wasm?

Now, let’s come back to the original problem. Our client was running code compiled from Rust down to Wasm using node. This means that the library is first compiled to Wasm using LLVM and then compiled again by node using the Turbofan JIT compiler. We expect that LLVM will respect the optimization guards inserted by libraries like the subtle crate, but what about Turbofan?

To see how the codebase would be affected, we compiled the test_with_barrier function defined above using wasm-bindgen and wasm-pack. We then dumped the code generated by the Turbofan JIT and examined the output to see whether the optimization barrier remained and whether the constant-time properties of the implementation had been preserved.

The following code is the result of compiling our example using wasm-pack and dumping the resulting Wasm in text format using wasm2wat. (We annotated some of the functions and removed some sections related to wasm-bindgen to make the code more readable.)

(module
    (type (;0;) (func (param i32) (result i32)))
    (type (;1;) (func (param i32 i32 i32) (result i32)))
 
    (func $black_box (type 0) (param $input i32) (result i32)
      (local $__frame_pointer i32)
      global.get $__stack_pointer
      i32.const 16
      i32.sub
      local.tee $__frame_pointer   ;; push __stack_pointer - 16
      local.get $input
      i32.store8 offset=15         ;; store input at __stack_pointer - 1
      local.get 1
      i32.load8_u offset=15)       ;; load output from __stack_pointer - 1
 
    (func $test_without_barrier (type 1) 
      (param $a i32) (param $b i32) (param $choice i32) (result i32)
      local.get $a
      local.get $b
      local.get $choice
      select)
 
    (func $test_with_barrier (type 1)
      (param $a i32) (param $b i32) (param $choice i32) (result i32)
      local.get $b
      local.get $a
      i32.xor                     ;; push a ^ b
      i32.const 0
      local.get $choice
      i32.const 0
      i32.ne                      ;; push input = choice != 0
      call $black_box             ;; push output = black_box(input)
      i32.const 255
      i32.and                     ;; push output = output & 0xFF
      i32.sub                     ;; push mask = 0 - output
      i32.and                     ;; push mask & (a ^ b)
      local.get $b
      i32.xor)                    ;; push b ^ (mask & (a ^ b))
 
    (table (;0;) 1 1 funcref)
    (memory (;0;) 17)
    (global $__stack_pointer (mut i32) (i32.const 1048576))
    (export "memory" (memory 0))
    (export "test_without_barrier" (func $test_without_barrier))
    (export "test_with_barrier" (func $test_with_barrier)))

We see that black_box has been compiled down to a simple i32.store8 followed by an (unsigned) i32.load8_u from the same offset. This initially looks like it could be optimized away completely since the memory is never read outside black_box.

We also see that test_with_barrier has not been optimized across the call to black_box. The function still performs a branchless conditional selection controlled by the output from the optimization barrier. This looks good and gives us some confidence that the constant-time properties provided by the subtle crate are preserved when targeting Wasm. However, as soon as the Wasm module is loaded by node, it is passed off to the Liftoff and Turbofan JIT compilers to optimize the code further.

To investigate how this affects our small example, we load the compiled Wasm module using JavaScript and dump the trace output from Turbofan using node. This can be done by passing the --trace-turbo flag to the node runtime. The trace generated by node can then be viewed in the Turbolizer web GUI (which can be found in the V8 repository).

Turbolizer can be used to analyze each step of the Turbofan compilation pipeline. Here, we are interested in displaying only what the emitted assembly code looks like for a given function. Looking at the output for test_with_barrier, we see that no optimizations are performed across the black_box function call on line 2c. The output is essentially identical to the decompiled Wasm code above.

  B0:
       0  push             rbp
       1  REX.W movq   rbp,rsp
       4  push             0x8
       6  push             rsi
       7  REX.W subq   rsp,0x18
       b  REX.W movq   rbx,[rsi+0x2f]
       f  REX.W movq   [rbp-0x18],rdx           
      13  REX.W movq   [rbp-0x20],rax
      17  REX.W cmpq   rsp,[rbx]
      1a  jna          B2 <+0x4a>
 
  B1:
      20  cmpl             rcx,0x0         ;; rax = choice? 1: 0
      23  setnzl       bl
      26  movzxbl      rbx,rbx
      29  REX.W movq   rax,rbx
      2c  call             0x7ffc2400fa31  ;; call to black_box(rax)
      31  movzxbl      rbx,rax             ;; rbx = -black_box(rax)
      34  negl             rbx
      36  REX.W movq   rdx,[rbp-0x20]      ;; rdx = a ^ b
      3a  xorl             rdx,[rbp-0x18]
      3d  andl             rbx,rdx         ;; rbx = rbx & rdx
      3f  REX.W movq   rax,[rbp-0x18]      ;; rax = b ^ (rbx & (a ^ b))
      43  xorl             rax,rbx
      45  REX.W movq   rsp,rbp
      48  pop          rbp
      49  retl                             ;; return rax
 
  B2:
      4a  REX.W movq   [rbp-0x28],rcx
      4e  call             0x7ffc2400fa7b
      53  REX.W movq   rsi,[rbp-0x10]
      57  REX.W movq   rcx,[rbp-0x28]
      5b  jmp          B1 <+0x20>
      5d  nop
      5e  nop

It also is interesting to see what the Turbolizer output for black_box looks like. Looking at the emitted assembly for black_box, we see that apart from setting up the local stack frame, the function simply stores and then immediately loads the input from memory (lines 14 and 18) before returning.

   B0:
       0  push             rbp
       1  REX.W movq   rbp,rsp
       4  push             0x8
       6  push             rsi
       7  REX.W movq   rbx,[rsi+0x17]
       b  REX.W movq   rdx,[rsi+0x67]
       f  movl             rdx,[rdx]
      11  subl             rdx,0x10
      14  movb             [rdx+rbx*1+0xf],al   ;; store input to memory
      18  movzxbl      rax,[rdx+rbx*1+0xf]   ;; load output from memory
      1d  REX.W movq   rsp,rbp
      20  pop          rbp
      21  retl

You may be surprised that this function is not inlined or optimized away by Turbofan. Since there is nothing in Wasm that corresponds to the volatile read in Rust, there is really no reason for Turbofan to keep black_box around anymore. However, since black_box writes to memory, it is not completely side-effect free, and so cannot be optimized away completely by the JIT compiler.

Introducing a new side-channel

The fact that the compiled version of black_box writes the input to memory before returning it is actually somewhat surprising. Since black_box takes a value as input and read_volatile takes a reference as input, LLVM needs to turn the input value into a reference somehow. When compiling for architectures like x86 or ARM, LLVM can simply use the address of the input on the stack, but the Wasm stack is not addressable in this way, which means that LLVM has to write the input to memory to be able to reference it. All of this means that the secret value that we wanted to protect using an optimization barrier is leaked to Wasm memory by LLVM. Moreover, looking at the compiled Wasm code above, we see that this memory is exported by the Wasm module, which means that it can be read from JavaScript. If we call the exported test_with_barrier function and examine the memory before and after the call, we can see that the secret value passed to black_box is now accessible from JavaScript.

  const path = require('path').join(__dirname, 'ct_wasm.wasm');
  const bytes = require('fs').readFileSync(path);
 
  // Load Wasm module from file.
  const wasmModule = new WebAssembly.Module(bytes);
  const wasmInstance = new WebAssembly.Instance(wasmModule);
  const wasmMemory = new Uint8Array(wasmInstance.exports.memory.buffer);
 
  const testWithBarrier = wasmInstance.exports.test_with_barrier;
 
  // __stack_pointer defined by the Wasm module.
  const stackPointer =  1048576;
 
  // Print memory[__frame_pointer + 15] before call to black_box.
  const before = wasmMemory[stackPointer - 1];     
  console.log("Before the call to black_box: " + before);
    
  // Call test_with_barrier which calls black_box with secret value 1.
  testWithBarrier(123, 456, 1);
 
  // Print memory[__frame_pointer + 15] after call to black_box.
  const after = wasmMemory[stackPointer - 1];
  console.log("After the call to black_box:  " + after);

Running this small test produces the following output, showing that the secret value passed to black_box is indeed leaked by the program.

 node js/ct_wasm.js
  Before the call to black_box: 0
  After the call to black_box:  1

Since the purpose of the black_box function is to protect the code from optimizations based on secret values, every value that goes into black_box is sensitive by definition. This is not a good situation.

Using a different optimization barrier

There have been some discussions in the Rust Cryptography Interest Group about defining a new Rust intrinsic based on this C++ optimization barrier. The corresponding Rust implementation would then look something like the following (here using the now deprecated llvm_asm macro).

#[inline(never)]
fn black_box(input: u8) -> u8 {
    unsafe { llvm_asm!("" : "+r"(input) : : : "volatile"); }
 
    input
}

After recompiling the codebase with wasm-pack and decompiling the resulting Wasm module, we see that black_box is now given by a single local.get $input (returning the first argument to the function), which is what we want. This function does not leak secret values to memory, but is it preserved by Turbofan?

By running the corresponding test_with_barrier function through Turbofan, we see that it results in machine code that is identical to the previous constant-time version above. Thus, with the llvm_asm-based barrier, we get a constant-time implementation that does not leak secret values to the surrounding JavaScript runtime.

However, as we have already noted, there is no reason to expect Turbofan not to inline the black_box function in future versions of the compiler. (In fact, if we look at the source code responsible for the Wasm compilation pipeline in the V8 repository, we see that the FLAG_wasm_inlining flag, which controls the WasmInliningPhase in the compiler pipeline, defaults to false in version 9.7.24 of V8; but we expect this optimization phase to be enabled at some point.)

Going forward

It is clear that fighting LLVM by inserting optimization barriers is not a great way to provide constant-time guarantees. There are ongoing efforts to address this problem at the language level. The secret types RFC and the CT-Wasm project, which introduce secret types for Rust and Wasm respectively, are two great examples of such efforts. What is missing is a way forward for getting secret types and the corresponding semantics into LLVM. This is most likely a precondition for the Rust implementation to move forward. (The Rust RFC is currently postponed, awaiting a corresponding RFC for LLVM.) Without LLVM support, it is difficult to see how higher-level languages that depend on LLVM could provide any absolute constant-time guarantees. Until then, we are all left playing hide-and-seek with the compiler back end.

In this post, we examined the use of optimization barriers to prevent the optimizer from wreaking havoc on constant-time cryptographic implementations in Rust, and the security guarantees optimization barriers provide when targeting Wasm. In the upcoming second part of this blog post, we will explore how constant-time properties of the implementation may be preserved by selectively disabling optimizations at the function level.