Trusted publishing: a new benchmark for packaging security

Read the official announcement on the PyPI blog as well!

For the past year, we’ve worked with the Python Package Index to add a new, more secure authentication method called “trusted publishing.” Trusted publishing eliminates the need for long-lived API tokens and passwords, reducing the risk of supply chain attacks and credential leaks while also streamlining release workflows. Critical packages on PyPI are already using trusted publishing to make their release processes more secure.

If you publish packages to PyPI, use the official PyPI documentation to set up trusted publishing for your projects today. The rest of this post will introduce the technical how and why of trusted publishing, as well as where we’d like to see similar techniques applied in the future.

We love to help expand trust in language ecosystems. Contact us if you’re involved in a packaging ecosystem (e.g., NPM, Go, Crates, etc) and want to adopt more of these techniques!

Trusted publishing: a primer

At its core, trusted publishing is “just” another authentication mechanism. In that sense, it’s no different from passwords or long-lived API tokens: you present some kind of proof to the index that states your identity and expected privileges; the index verifies that proof and, if valid, allows you to perform the action associated with those privileges.

What makes trusted publishing interesting is how it achieves that authentication without requiring a preexisting shared secret. Let’s get into it!

OpenID Connect and “ambient” credentials

Trusted publishing is built on top of OpenID Connect (OIDC), an open identity attestation and verification standard built on top of OAuth2. OIDC enables identity providers (IdPs) to produce publicly verifiable credentials that attest to a particular identity (like hamilcar@example.com) . These credentials are JSON Web Tokens (JWTs) under the hood, meaning that an identity under OIDC is the set of relevant claims in the JWT.

To drive that point home, here’s what a (slightly redacted) claim set might look like for a user identity presented by GitHub’s OIDC IdP:

(In an actual JWT, this claim set would be accompanied by a digital signature proving its authenticity for a trusted signing key held by the IdP. Without that digital signature, we’d have no reason to trust the claims!)

Anybody can be an IdP in an OpenID Connect scheme. Still, a large part of the practical value of OIDC is derived from interactions with large, presumed-to-be-trustworthy-and-well-secured IdPs. There’s value in proving ownership over things like GitHub and Google accounts, particularly for things like SSO and service federation.

So far, so good, but none of this is especially relevant to packaging indices like PyPI. PyPI could allow users to sign in with OIDC rather than passwords, but it’s unclear how that would make publishing workflows, particularly CI-based ones, any more convenient.

What makes OIDC useful to package indices like PyPI is the observation that an OIDC identity doesn’t need to be a human: it can be a machine identifier, a source repository, or even a specific instance of a CI run. Moreover, it doesn’t need to be obtained through an interactive OAuth2 flow: it can be offered “ambiently” as an object or resource that only the identity (machine, etc.) can access.

CI providers figured this out not too long ago: GitHub Actions added support for ambient OIDC credentials in late 2021, while GitLab added it just a few months ago. Here’s what retrieving one of those credentials looks like on GitHub Actions:

And here’s what the (again, filtered) claim set for a GitHub Actions workflow run might look like:

This is a lot of context to work with: assuming that we trust the IdP and that the signature checks out, we can verify the identity down to the exact GitHub repository, the workflow that ran, the user that triggered the workflow, and so forth. Each of these can, in turn, become a constraint in an authentication system.

Trust is everything

To recap: OpenID Connect gives us the context and machinery we need to verify proofs of identity (in the form of OIDC tokens) originating from an IdP. The identities in these proofs can be anything, including the identity of a GitHub Actions workflow in a particular repository.

Any third-party service (like PyPI) can, in turn, accept OIDC tokens and determine a set of permissions based on them. Because OIDC tokens are cryptographically tied to a particular OIDC IdP’s public key, an attacker cannot spoof an OIDC token, even if they know the claims within it.

But wait a second: how do we get from an OIDC token containing an identity to a specific PyPI project? How do we know which PyPI project(s) should trust which OIDC identity or identities?

This is where a bit of trusted setup is required: a user (on PyPI) has to log in and configure the trust relationship between each project and the publishers (i.e., the OIDC identities) that are authorized to publish on behalf of the project.

This needs to be done only once, as with a normal API token. Unlike an API token, however, it only involves one party: the CI (and OIDC) provider doesn’t need to be given a token or any other secret material. Moreover, even the trusted setup part is composed of completely public information: it’s just the set of claim values that the user considers trustworthy for publishing purposes. For GitHub Actions publishing to PyPI, the trusted setup would include the following:

  1. The GitHub user/repo slug
  2. The filename of the GitHub Actions workflow that’s doing the publishing (e.g., release.yml)
  3. Optionally, the name of a GitHub Actions environment that the workflow uses (e.g., release)

Together, these states allow the relying party (e.g., PyPI) to accept OIDC tokens, confirm that they’re signed by a trusted identity provider (e.g., GitHub Actions), and then match the signed claims against one or more PyPI projects that have established trust in those claims.

Look ma, no secrets!

At this point, we have everything we need to allow an identity verified via OIDC to publish to PyPI. Here’s what that looks like in the GitHub case:

  1. A developer (or automation) triggers a GitHub Actions workflow to release to PyPI.
  2. The normal build process (python -m build or similar) commences.
  3. Automation retrieves an OIDC token for the current workflow run, attesting to the current workflow’s identity (user/repo, workflow name, environment, etc.) via GitHub Actions’ OIDC IdP.
  4. That OIDC token is shot over to PyPI.
  5. If valid, PyPI verifies it and exchanges it for a short-lived PyPI API token that’s scoped to just the PyPI projects that trust those token claims.
  6. PyPI returns the short-lived API token as a response to the OIDC token.
  7. The workflow continues, performing a normal PyPI publish step (e.g., with twine) with the short-lived API token.

For 99% of package publishers, steps 3 through 7 are entirely implementation details: the official PyPA GitHub Action for publishing to PyPI encapsulates them, making the user-facing piece just this:

Why should I care?

At this point, you might reasonably think:

I’m a competent engineer, and I already do everything right. My tokens are correctly scoped to the smallest permissions required, they’re stored as workflow (or per-environment) secrets, and I carefully audit my release workflows to ensure that all third-party code is trustworthy.” – You, a competent engineer

Here’s the thing: you’ve been doing everything right! Until now, the most secure way to authenticate to PyPI was to do the following:

  1. Create a project-scoped API token.
  2. Store it as a (scoped) secret in your CI.
  3. Access it carefully in a publishing workflow you’ve reviewed and established trust in.

This suffices for many use cases but also leaves a great deal to be desired from both the usability and security perspectives:

  1. Usability. Manually managing and creating API tokens is tedious, especially in scenarios where a single source repository hosts multiple PyPI packages: each needs its own separately scoped token, a unique secret name, and so forth. You and your fellow engineers have better ways to spend your time!
  2. Pre-compromise security. Not all attackers are born equal: some are passive, some are active, some might be able to compromise only a specific step in your publishing process, and so forth. Reducing the power of (or outright eliminating) one of these attackers is useful, even when the mitigation involved doesn’t meaningfully impact other attackers. Unfortunately, doing so with long-lived tokens is difficult: a long-lived token is equally susceptible to any attacker who gets access for any time.
  3. Post-compromise recovery. Designing for security means attempting to thwart attackers and preparing for and mitigating the risk posed by a successful attacker. With long-lived credentials (either passwords or API tokens), this is slow, tedious, and error-prone: missing a single credential leaves a gap for the attacker to return. A better system wouldn’t have this problem to begin with.

Trusted publishing addresses these problems and more:

  1. Usability. With a trusted publisher, no manual API token management is necessary: configuring the publisher is a one-time action for each project, including for projects that haven’t been created yet.

    This avoids the annoying API token dance involved when publishing a brand new project and the game of “credential hot potato” that engineers play when trying to hand an API token to the party responsible for adding it to the CI’s secrets. No more Slack DMs with API tokens!

  2. Pre-compromise security. Trusted publishing reduces the number of adversaries: an attacker with access to only some GitHub Actions environments or particular (non-permission) steps can’t mint the OIDC credential needed to use the trusted publisher. This is in marked contrast to a long-lived token stored in a GitHub Actions secret, where any step (and frequently any environment) can access the credential!
  3. Post-compromise recovery. Trusted publishing is fundamentally ephemeral: the credentials involved (both the OIDC and PyPI credentials) live for only a few minutes at a time, meaning that an attacker who loses access during post-compromise response is automatically sealed off without any human intervention. That means fewer manual steps and fewer possible human errors.

Security and threat model considerations

Trusted publishing is another way to securely authenticate to a package index. Like every security feature, it must be designed and implemented to a threat model. That threat model must justify trusted publishing’s existence, both for addressing attackers that previous authentication methods do not address and for new attack scenarios it exposes.

Existing threats: account takeover and supply chain attacks

Account takeover (ATO) is a known problem in packaging ecosystems: an attacker who manages to compromise a legitimate user’s PyPI or GitHub account can upload malicious releases (or even override previous ones) without any outward indication of inauthenticity.

In the general case, ATO is an unsolvable problem: services like PyPI and GitHub can improve access to security features (and even mandate those features) but fundamentally cannot prevent a user from disclosing their credentials (e.g., via phishing), much less protect them from every piece of potentially vulnerable software they use.

At the same time, features like trusted publishing can reduce the scope of account takeover: a future in which package indices allow packages to opt in to only trusted publishing is one where an ATO on the package index itself doesn’t allow the attacker to upload malicious releases.

Similarly, “supply chain security” is all the rage these days: companies and hobbyists alike are taking a second look at out-of-control dependency trees and their frequently unaccountable and untraceable components.

Without trusted publishing, the status quo for GitHub Actions is that you trust every third-party action you execute: they can all read your configured secrets. This is extremely non-ideal and is one of the key attack models trusted publishing intends to secure against.

New threats: “account resurrection” and malicious committers

Trusted publishing works because it’s tied to a notion of “trusted identity”: the trusted identity on the other side (e.g., on GitHub Actions) is a tuple of user/repo, workflow name, and an optional environment name.

But wait: what happens if a user changes their username and an attacker takes over their old username? We call this “account resurrection,” and it’s explicitly supported by most services: a username isn’t intended to be a permanent, stable identifier for the underlying identity.

This opens up an entirely new attack vector: a PyPI project that trusts hamilcar/cartago might suddenly begin trusting an attacker-controlled hamilcar/cartago, all because the original hamilcar is now hannibal (and the legitimate hamilcar/cartago is now hannibal/cartago).

We thought of this while designing trusted publishing for PyPI and worked with GitHub to add an additional claim that binds the OIDC token not just to the user, but also to their unique, stable user ID. This gives us the state we need to prevent resurrection attacks: even if an attacker manages to become hamilcar on GitHub, their underlying user ID will not change and PyPI will reject any identity tokens they present.

Trusted publishing also reveals a new (potential) division in a project’s trust model: for any given project, do you trust every member of that project to also be a potential publisher? In many cases, the answer is yes: many projects have only one or two repository members, both of whom are also owners or otherwise privileged on the package index.

In some cases, however, the answer is no: many projects have dozens of low-activity or inactive members, not all of whom may be following best practices for securing their accounts. These members might not be removable because of community policy or because they need access for infrequent (but critical) project activities. These users should not necessarily receive the ability to publish releases to the packaging index just because they have the commit bit on the repository.

This is also a consideration we made while designing trusted publishing, and it’s why PyPI’s implementation supports an optional GitHub Actions environment: for communities where users who commit and users who publish do not wholly overlap, an environment can be used to impose additional workflow restrictions that are reflected (and subsequently honored by PyPI) in the OIDC token. A detailed example of this is given in PyPI’s own security model documentation.

Coming to a package index near you

Our work on PyPI was funded by the incredible Google Open Source Security Team (GOSST), who we’ve also worked with to develop new tooling for the Python ecosystem’s overall security. In particular, we’d like to thank Dustin Ingram for tirelessly working alongside us and directing the overall pace and design of trusted publishing for PyPI.

At the moment, PyPI is the only package index offering trusted publishing that we’re aware of. That being said, nothing about trusted publishing is unique to Python or Python packaging: it could just as easily be adopted by Rust’s Crates, Ruby’s RubyGems, JavaScript’s NPM, or any other ecosystem where publishing from a third-party service is common (like GitHub Actions or GitLab’s CI/CD).

It’s our opinion that, much like Two-Factor Authentication in 2019, this kind of trusted publishing scheme will become instrumental to the security model of open-source packaging. We see it as a building block for all kinds of subsequent improvements, including being able to generate strong cryptographic proof that a PyPI release was built from a particular source artifact.

If you or your company are interested in this work, please get in touch with us! We have years of experience working on security features in open-source ecosystems and are always looking for more ways to contribute to critical open-source projects and services.

Real World Crypto 2023 Recap

Last month, hundreds of cryptographers descended upon Tokyo for the first Real World Crypto Conference in Asia. As in previous years, we dispatched a handful of our researchers and engineers to present and attend the conference.

What sets RWC apart from other conferences is that it strongly emphasizes research, collaborations, and advancements in cryptography that affect real-world systems. This year, we happened to notice a couple items that we’ll highlight. First, many talks detailed the painstaking process that is secure cryptographic protocol development. Second, PQC is on the rise and is steadlily advancing from theory into practice. Lastly, sometimes the most interesting cryptographic flaws aren’t even cryptographic flaws: bad RNGs and multi-threading can just as easily break your cryptography.

Our EDHOC presentation

Trail of Bits cryptographer Marc Ilunga spoke (paper, video, slides) about his research analyzing the security of the Ephemeral Diffie-Hellman Over COSE (EDHOC) protocol. EDHOC is a lightweight key exchange protocol similar to TLS but more compact. In collaboration with other researchers, Marc’s work verified multiple security properties of the EDHOC protocol and even identified theoretical security issues that led to updates made to the EDHOC standard.

The talk highlighted how the LAKE working group openly called the community to analyze EDHOC and benefited from many insights, making the protocol safer and better overall. With high-assurance cryptography on the rise, more tools are available to assist with this task. For instance, the presentation on HACSPEC (paper, slides, video) paves the way for high assurance standards for cryptography. It provides tools to specify cryptographic specifications that can further be formally verified.

Invite your friends and adversaries to your protocol party!

Our talk echoed a common theme at the conference, encouraging people to collaborate with researchers and stakeholders to analyze crypto and protocols instead of rolling their own. Many cryptographic protocols, such as End-to-End-Encryption (E2EE) messengers, have been broken in recent years with varying levels of impact. Notable examples include Telegram in 2022, Bridgefy, and Threema (paper, slides, video). These examples have something in common: missing formal analysis. The lesson is that Telegram, Bridgefy, and Threema should not have rolled out their crypto! But to be fair, deployment of a new and ad-hoc protocol is hardly uncommon. The first formal analysis of the highly acclaimed Signal protocol came after the Signal app was already deployed. Even then, further analysis was needed to capture other security aspects.

On its own, the phrase “Don’t roll your crypto” isn’t helpful. Someone has to roll some crypto at some point. The case of Threema shows that an application that uses all the right cryptographic primitives can still be broken. One lesson from the EE2E messaging world is that, perhaps, it doesn’t matter who rolled what. What’s important is the analysis that was performed against the protocol. Formal analysis and good old cryptanalysis of a protocol are necessary to get confidence in a new protocol.

Don’t roll your protocol alone! Use all the tools available to you. If you are unfamiliar with one of these tools, use the army of friends willing to apply these tools against your protocol. If you’d like to learn more about how to analyze these protocols and the tools available, book a call to discuss with one of our cryptographers. Our doors are always open!

Post-quantum cryptography is steadily advancing

NIST announced the post-quantum cryptography (PQC) standard candidates last year, so it did not come as a huge surprise that PQC was a big topic of discussion at RWC. In addition to the RWC conference this year, an additional Real World PQC workshop was run alongside RWC to cover additional topics.

PQC is steadily advancing, with standards for the first primitives expected to emerge over the coming years. However, as the talks indicate this year, many challenges are ahead for the post-quantum world. First, implementing these schemes in real systems is challenging, with many unknown and unforeseen issues. In addition to this, more PQC primitives and protocols are needed. Designing more advanced primitives securely and effectively across many use cases will be challenging. Here are some interesting discussions of these challenges.

Industry applications

A talk from Google described applying PQC to their Application Layer Transport Security (ALTS) protocol. ALTS is an authentication protocol, similar to TLS, that Google uses to secure communication within its infrastructure, such as data centers. Threat modeling shows that this is where PQC is most urgently needed, so Google decided to implement NTRU-HRSS, a lattice-based post-quantum cryptosystem, for ALTS even before the NIST standardization process was complete. This talk presented some implementation issues that occurred; for instance, the public key and ciphertext for this cryptosystem were 9-10 times larger than the existing ALTS algorithms, allocating large HRSS keys on the stack resulted in stack overflows for some architectures, and performance in practice didn’t align with the expected benchmarks. However, with some adjustments, they managed to integrate NTRU-NRSS into ALTS within their requirements.

Cloudflare also presented a talk describing an internal PQC project. Cloudflare supports Privacy Pass, a cryptographic protocol that allows users to prove they are humans and not bots across many websites without revealing unnecessary private information about themselves. To achieve this, Privacy Pass uses advanced cryptographic primitives known as blind signatures and anonymous credentials. Unfortunately, the NIST standardization process does not have any candidates for advanced primitives such as these, so Cloudflare designed its own post-quantum scheme that was based on Dilithium, a digital signature candidate selected by NIST. The result was surprisingly efficient: ~300ms prover time and ~20ms verifier time, with a proof size of ~100KB. It’s exciting to see post-quantum cryptography applied to more advanced cryptographic primitives and protocols such as Privacy Pass.

NXP presented work that implemented the Dilithium PQC candidate on an embedded device. For embedded devices, protecting against side-channel attacks becomes vitally important. This talk identified a gap in the research around Dilithium. Compared with another candidate, Kyber, protecting against side channels in Dilithium has received little attention. NXP’s work identified some improvements to state-of-the-art mitigations. Their work also noted that much of the runtime was spent protecting the implementation by invoking the Keccak hash function, and on their embedded devices, a significant speedup could be obtained if these were replaced with calls to a True Random Number Generator (TRNG). However, using a TRNG instead of Keccak would violate the specification, which is a great example of why these standardization processes are difficult and time-consuming. Designing a system that will run securely and optimally across many different platforms and use cases is difficult.

PQC in other talks

While the NIST PQC standardization effort focuses on cryptographic primitives, these primitives will eventually have to be used in protocols. Updating existing protocols to include post-quantum resilient primitives is nontrivial, as explained in the context of the IETF at RWPQC (slides).

Since the post-quantum candidates are relatively young by cryptographic standards, only some people trust their resistance against attacks. More than one candidate has been broken by the dreaded laptop running in a weekend. Therefore, they are preferably a hybrid approach, alongside their classical counterparts, to ensure the best of both worlds regarding security. (You would need to break both primitives to attack the protocol.)

Several protocol updates were presented at RWPQC and RWC using this approach, starting with a post-quantum variant of the Noise protocol framework (video, slides) for constructing key exchange protocols. Furthermore, lightning talks at RWC and RWPQC introduced Rosenpass, a post-quantum variant of the Wireguard protocol for constructing VPNs.

Cryptographic failures are often non-cryptographic

Previous years of Real World Crypto featured non-cryptographic errors breaking prominent cryptographic schemes. This year was no exception: multiple talks demonstrated fatal non-cryptographic attacks on cryptographic hardware, protocols, and schemes.

Cryptography is a powerful tool for solving many problems in software; many years of research and cryptanalysis have given us a powerful suite of primitives that can, for example, safely encrypt and protect our data. But software is still only as strong as its weakest link, and a secure encryption scheme is useless when an attacker can easily bypass it entirely, as we will see in some of the talks from this year:

Wi-Fi Fails

In Framing Frames: Bypassing Wi-Fi Encryption by Manipulating Transmit Queues (paper, slides, video), Mathy Vanhoef presented two new variants on a well-known class of weaknesses in the 802.11 standards, which include WPA/2 and WPA/3. The first variant completely bypasses Wi-Fi encryption by tricking an access point (AP) into insecurely “encrypting” buffered packets with an all-zero key or otherwise undefined key context. So, rather than developing some complex and novel cryptographic attack against the encryption scheme, this bug tricks an AP into using an empty encryption key.

The second variant involves bypassing client isolation, a common feature of wireless networks intended for use by untrusted clients (e.g., public hotspots and global networks like eduroam). APs that offer client isolation rely on ad-hoc synchronization between two nominally decoupled layers of the 802.11 stack: the security layer, which uses 802.1X identities, and the packet routing layer, which uses IP addresses and device MACs. By observing this dependency between decoupled layers, an attacker can insert a request with a spoofed MAC and, when timed correctly, trick the AP into encrypting the incoming response with a newly generated key. The result is that the attacker cannot only spoof the victim’s MAC (ordinarily just a denial of service) but also decrypt incoming traffic intended for the victim. This attack does not rely on any sort of novel cryptographic attack. It’s easy to trick the AP into decrypting things for you instead!

These two new variants are not particularly similar in procedure or scenario but are similar in origin: ambiguity within the specification with respect to extended functionality (client isolation) or optimizations (packet buffering) regularly offered by vendors. Together, these cleanly represent one of the biggest challenges to the value of formal modeling: the model under-proof must be correct and complete for actual device behavior. As we know from the world of compiler optimizations, observably equivalent behavior is not necessarily safe for all observers — the same basic truth applies to protocol design!

Weak RNGs and weak testing are a toxic combination

In Randomness of random in Cisco ASA (slides, video), Benadjila and Ebalard stepped through their investigation of many duplicated ECDSA keys and nonces observed while testing a large X.509 certificate corpus. When evaluating ~313,000 self-signed ECDSA certificates originating from Cisco ASA boxes, 26% (~82,000) had duplicated ECDSA nonces, and 36% (~113,000) had duplicated ECDSA keys. Additionally, of approximately 200,000 self-signed RSA certificates, 6% (~12,000) had duplicated RSA moduli.

RNG failures from poor RNG selection or poor entropy sourcing have a long and storied history across hardware and software vendors, including Cisco’s ASAs (RWC 2019). The presenters immediately narrowed in on 2019’s disclosure as a likely source, indicating that the previous disclosure and fix were insufficient and potentially deployed without meaningful testing.

Correct construction and use of cryptographically secure pseudo-random number generators (CSPRNGs) are subtle and difficult, with catastrophic failure modes. At the same time, CSPRNG construction and use are well-trodden problems: the sharp edges in the NIST SP-800 90A DRBGs are well understood and documented, and strong seeding has always been a requirement regardless of underlying CSPRNG construction. Much like the talk about bypassing Wi-Fi encryption, the failures here are fundamentally at the design and software development lifecycle layers rather than low-level cryptographic flaws. The takeaway is to maintain a strong test suite covering both the happy and sad code paths and incorporate the best practices regarding the software development lifecycle to prevent reintroducing old bugs or code issues.

It takes a village

Real World Crypto 2023 taught us that old and new cryptographic techniques and protocols benefit most when a diverse set of researchers and analyses are involved. Even after passing several rounds of scrutiny, implementations should be monitored regularly. Whether transferring data, setting up RNGs, or applying PQC, misinterpretations and errors can compromise data integrity and privacy. We are grateful to all the researchers that presented at this year’s RWC conference, who have dedicated so much effort toward securing the world we live in, and we are proud to be active members of this community.

Introducing Windows Notification Facility’s (WNF) Code Integrity

By Yarden Shafir, Senior Security Engineer

WNF (Windows Notification Facility) is an undocumented notification mechanism that allows communication inside processes, between processes, or between user mode processes and kernel drivers. Similar to other notification mechanisms like ETW (Event Tracing for Windows) and ALPC (Advanced Local Procedure Call), WNF communication happens over different “channels,” each representing a unique provider or class of information.

Offensive engineers already found several uses for WNF. Alex Ionescu and Gabrielle Viala reported information leaks and denial-of-service bugs that were fixed by Microsoft; @modexpblog demonstrated code injection through WNF names, which is undetected by pretty much all EDR products; and Alex Plaskett of NCC Group demonstrated how attackers could use WNF allocations to spray the kernel pool.

WNF is used extensively by Windows components to send or receive information about software and hardware states. It can be used in the same way by 3rd party applications to learn about the state of the system, send messages to Windows processes or drivers or as an internal communication channel between processes or drivers (though WNF is undocumented and therefore not meant to be used by 3rd parties at all).

Reading and interpreting WNF state names

In the WNF world, these previously mentioned “channels” are called state names. WNF state names are 64-bit numbers that are composed of:

  • Version
  • Lifetime:
    • Well-known, predefined in the system
    • Permanent, persist in the registry across boot sessions
    • Persistent, persist in memory, but not across boot sessions
    • Temporary, only exist in the context of the process and is gone when the process terminates
  • Scope: System, Session, User, Process, or Physical Machine
  • Permanent data bit
  • Unique sequence number

These attributes are combined into the state name using this bit layout:

typedef struct _WNF_STATE_NAME_INTERNAL {
    ULONG64 Version : 4;
    ULONG64 NameLifetime : 2;
    ULONG64 DataScope : 4;
    ULONG64 PermanentData : 1;
    ULONG64 Unique : 53;
} WNF_STATE_NAME_INTERNAL, * PWNF_STATE_NAME_INTERNAL;

Until recently, this mechanism was almost exclusively meant for hardware components such as the battery, microphone, camera, and Bluetooth, and held very little interest for defensive engineers. But that is beginning to change with the recent addition of several new state names used by the kernel code integrity manager, which now uses WNF to notify the system about interesting code integrity events that might be useful to security tools.

While still undocumented and unrecommended for general use, it may be time for defenders to start looking further into WNF and its potential benefits. Starting from Windows 10, WNF now offers several added well-known state names that get notified by the kernel Code Integrity Manager (CI.dll). This component is responsible for all kernel hashing, signature checks, and code integrity policies, which is rich information for all security products.

How do we find out about those names? Well, to dump all well-known state names on the machine, you’d have to install the Microsoft SDK, then run WnfNameDumper to retrieve all WNF names defined in perf_nt_c.dll and dump their human-friendly names, IDs, and descriptions into a file, which would look like this:

{“WNF_CELL_UTK_PROACTIVE_CMD”, 0xd8a0b2ea3bcf075},
        // UICC toolkit proactive command notification for all slots. SDDL comes from ID_CAP_CELL_WNF_PII
        // and UtkService in %SDXROOT%\src\net\Cellcore\packages\Cellcore\Cellcore.pkg.xml
{“WNF_CELL_UTK_SETUP_MENU_SLOT0”, 0xd8a0b2ea3bce875},
        // UICC toolkit setup menu notification for slot 0. SDDL comes from ID_CAP_CELL_WNF_PII
        // and UtkService in %SDXROOT%\src\net\Cellcore\packages\Cellcore\Cellcore.pkg.xml
{“WNF_CELL_UTK_SETUP_MENU_SLOT1”, 0xd8a0b2ea3bdd075},
        // UICC toolkit setup menu notification for slot 1. SDDL comes from ID_CAP_CELL_WNF_PII
        // and UtkService in %SDXROOT%\src\net\Cellcore\packages\Cellcore\Cellcore.pkg.xml
etc., etc., etc…

In Windows version 22H2, the Windows SDK contains just over 1,400 well-known state names. Many of those names can be revealing, but for now, we’ll focus on the WNF_CI (code integrity) names:

{“WNF_CI_APPLOCKERFLTR_START_REQUESTED”, 0x41c6072ea3bc2875},
        // This event signals that AppLockerFltr service should start.
{“WNF_CI_BLOCKED_DRIVER”, 0x41c6072ea3bc1875},
        // This event signals that an image has been blocked from loading by PNP
{“WNF_CI_CODEINTEGRITY_MODE_CHANGE”, 0x41c6072ea3bc2075},
        // This event signals that change of CodeIntegrity enforcement mode has occurred.
{“WNF_CI_HVCI_IMAGE_INCOMPATIBLE”, 0x41c6072ea3bc1075},
        // This event signals that an image has been blocked from loading as it is incompatible with HVCI.
{“WNF_CI_SMODE_CHANGE”, 0x41c6072ea3bc0875},
        // This event signals that change of S mode has occurred.

In this version we can see five state names with the prefix WNF_CI, all generated by the Code Integrity manager, and each one has a helpful description telling us what it’s used for. And unlike most other WNF names, here we see a few events that could be helpful to defensive engineers:

  • WNF_CI_APPLOCKERFLTR_START_REQUESTED – Signals that AppLockerFltr service should start
  • WNF_CI_BLOCKED_DRIVER – Signals that a driver has been blocked from loading by HVCI (Hypervisor-protected Code Integrity) because it was found in the block list
  • WNF_CI_CODEINTEGRITY_MODE_CHANGE – Signals that a change of CodeIntegrity enforcement mode has occurred
  • WNF_CI_HVCI_IMAGE_INCOMPATIBLE – Signals that an image has been blocked from loading as it is incompatible with HVCI, most likely because it has regions that are both writable and executable or allocates memory from the executable non-paged pool
  • WNF_CI_SMODE_CHANGE – Signals that a change of S mode has occurred

Usually, the buffers passed into WNF state names are a mystery, and their contents must be reverse-engineered. But in this case, Microsoft exposes the data passed into one of the state names in the public Microsoft Symbol Server, accessed through symchk.exe and symsrv.dll:

typedef struct _WNF_CI_BLOCKED_DRIVER_CONTEXT
{
  /* 0x0000 */ struct _GUID Guid;
  /* 0x0010 */ unsigned long Policy;
  /* 0x0014 */ unsigned short ImagePathLength;
  /* 0x0016 */ wchar_t ImagePath[1];
} WNF_CI_BLOCKED_DRIVER_CONTEXT, *PWNF_CI_BLOCKED_DRIVER_CONTEXT;

We can get some information from the Code Integrity manager, which could be useful for EDR products. Some of this can also be found in the Microsoft-Windows-CodeIntegrity ETW (Event Tracing for Windows) channel, as well as other interesting events (which deserve a post of their own). Still, some of the data in these WNF names can’t be found in any other data source.

Now if we update our SDK version to the preview build (25336 during the writing of this post), we can see a few other WNF state names that haven’t been released to the regular builds yet:

{“WNF_CI_APPLOCKERFLTR_START_REQUESTED”, 0x41c6072ea3bc2875},
        // This event signals that AppLockerFltr service should start.
{“WNF_CI_BLOCKED_DRIVER”, 0x41c6072ea3bc1875},
        // This event signals that an image has been blocked from loading by PNP
{“WNF_CI_CODEINTEGRITY_MODE_CHANGE”, 0x41c6072ea3bc2075},
        // This event signals that change of CodeIntegrity enforcement mode has occurred.
{“WNF_CI_HVCI_IMAGE_INCOMPATIBLE”, 0x41c6072ea3bc1075},
        // This event signals that an image has been blocked from loading as it is incompatible with HVCI.
{“WNF_CI_LSAPPL_DLL_LOAD_FAILURE”, 0x41c6072ea3bc3075},
        // This event signals that a dll has been blocked from loading as it is incompatible with LSA running
        // as a protected process.
{“WNF_CI_LSAPPL_DLL_LOAD_FAILURE_AUDIT_MODE”, 0x41c6072ea3bc3875},
        // This event signals that an unsigned dll load was noticed during LSA PPL audit mode.
{“WNF_CI_SMODE_CHANGE”, 0x41c6072ea3bc0875},
        // This event signals that change of S mode has occurred.

Here, we see two new state names that add information about PPL-incompatible DLLs being loaded into the Local Security Authority Subsystem Service (LSASS). LSASS, the OS authentication manager, runs as a PPL (Protected Process Light). This ensures that the process isn’t tampered with and is only running code signed with the correct signature level.

Investigating LSASS protections with WNF

Microsoft has been trying to make LSASS run as a PPL for a while now. Still, it couldn’t fully enable it because of compatibility issues with products that require full access to LSASS, including the injection of different plugins. However, they’re attempting to still protect LSASS as much as possible from credential stealers like Mimikatz, while still allowing users the option to turn LSASS back to a regular process.

Since Windows 8.1, there is an option to run LSASS as a PPL in audit mode. This means the system still treats it as a normal process but logs any operation that would have been blocked if it ran as a PPL. In Windows 11 it runs as a regular PPL by default, with an option to run it in audit mode exposed through the registry and the security center (in Preview builds)

So this is where our two new state names come in:

WNF_CI_LSAPPL_DLL_LOAD_FAILURE gets notified when LSASS is running as a regular PPL, and a DLL isn’t signed according to the PPL requirements is blocked from loading into the process. And WNF_CI_LSAPPL_DLL_LOAD_FAILURE_AUDIT_MODE gets notified when LSASS is running as a PPL in audit mode and loads a DLL that would have been blocked if it was running as a normal PPL.

Endpoint Detection & Response (EDR) tools can be alerted about all DLL loads through a documented kernel callback. The image load notify routine does include cached signing information inside IMAGE_INFO in the fields ImageSignatureLevel and ImageSignatureType, however this information may not always be available, and the callback isn’t notified about blocked DLL loads. Blocked DLL loads are interesting as they indicate what could be an exploitation attempt (or an organization trying to load their plugin written in 2003 into LSASS).

So, while none of these new state names contain any information that is exceptionally interesting to EDRs, they do have some interesting data that security products could find useful and, at a minimum, add some visibility or save the EDR some work.

And, of course, there is already one user for some of these WNF_CI state names: The Windows Defender command-line tool MpCmdRun.exe. MpSvc.dll, one of the DLLs loaded into MpCmdRun.exe, subscribes to two WNF state names: WNF_CI_CODEINTEGRITY_MODE_CHANGE and WNF_CI_SMODE_CHANGE. Whenever they are notified, the DLL queries them to get the new values and updates its internal configuration accordingly.

Other pieces of the system subscribe to these state names too. I used WinDbg commands to extract this list from my own system:

  • The DcomLaunch service registers to WNF_CI_SMODE_CHANGE, WNF_CI_BLOCKED_DRIVER and WNF_CI_APPLOCKERFLTR_START_REQUESTED
  • Utcsvc service (through utcsvc.dll) registers to WNF_CI_SMODE_CHANGE
  • SecurityHealthService.exe registers to WNF_CI_SMODE_CHANGE
  • Msteams.exe registers to WNF_CI_SMODE_CHANGE
  • PcaSvc service (through PcaSvc.dll) registers to WNF_CI_HVCI_IMAGE_INCOMPATIBLE and WNF_CI_BLOCKED_DRIVER – this is the service responsible for displaying the pop-up message when your favorite vulnerable driver won’t load on your HVCI-enabled system.

Currently, no process subscribes to the new LSA (Local Security Authority) state names (WNF_CI_LSAPPL_DLL_LOAD_FAILURE and WNF_CI_LSAPPL_DLL_LOAD_FAILURE_AUDIT_MODE), but since those are still in the preview stage that isn’t very surprising and I’m sure we’ll be seeing some subscriptions to it in the future.

Explore the possibilities with new WNF info

Windows has empowered security enthusiasts on both the offensive and defensive sides with newly attainable information in WNF. By expanding beyond the historical scope and adding state names to WNF, researchers have a more transparent view of how things operate. Over time, it’s likely you’ll see security researchers correlate this information with other events and processes to showcase novel security research!

Here, we’ve provided a quick introduction to WNF and its new features along with a simple example of how it can be used to investigate LSASS. If you’re interested in more details on WNF internals and its offensive capabilities, Alex Ionescu and Gabrielle Viala presented it in detail in a BlackHat 2018 talk. They later published a blog post and a collection of useful scripts.

What should governments consider when getting involved with blockchain?

Last September, Principal Security Engineer Dr. Evan Sultanik was on a panel hosted by the Naval Postgraduate School’s Distributed Consensus: Blockchain & Beyond (DC:BB) movement, where faculty and students there are seeking opportunities to learn and share knowledge, research, funding, and events focused on distributed consensus technologies.

The panel of nine government, academia, and industry experts discussed how blockchains, digital assets, and other Web3 technologies intersect with national security challenges. Dr. Sultanik discussed how the U.S. could help push global adoption and take a broader strategic outlook toward blockchain and Web3 technologies.

He talked about the inherent limitations of blockchain technologies and the Web3 movement and also offered suggestions from a training perspective that could lead to a more robust ecosystem. We’ve summarized the most important parts of that discussion here.

What are the some important things to consider when using blockchain technologies for a project?

It’s fundamental to better understand the tradeoffs one must make when using a blockchain and its security implications. Everyone at this point is aware that using a blockchain has significant additional overhead in terms of deployment and the cost of interacting with smart contracts. The cost gradually decreases with the transitions to the new forms of consensus and higher-level protocols, but there’s still a significant difference.

You have to realize that all data stored on a public blockchain is publicly available. Anyone can look through the entire history of each account or contract and understand the implications of those actions. You need to do something additional to ensure its privacy if that’s a requirement of your system.

The majority of participants in a public blockchain are untrusted. You are shifting trust from what would otherwise be a central authority to other entities that you may or may not have control over. You’re not only trusting the developers of the smart contracts that your system is interacting with, but you’re also inherently trusting the developers of the technology stack running that particular blockchain. You’re trusting the node software, the mining hardware, the mining software, the mining pool protocol, and everything else down the line. A bug in any one piece of that stack can cause the whole thing to collapse.

Blockchains allow developers to prototype new ideas quickly. You don’t have to worry about things like setting up infrastructure, and you don’t have to worry much about DevOps because that’s all handled by the blockchain itself. That allows you to significantly reduce the time between when an idea is created and when it is in the users’ hands. But that cycle also comes with risk because a tight development cycle can lead to poorly tested or designed protocols or sloppy development, leading to bugs with significant consequences, like being a big target for attackers.

Another thing that makes DeFi, blockchain, and Web3 so appealing is that you can prototype quickly and instantly connect your application to the whole ecosystem. Since the blockchain acts as a huge shared database, contracts and assets created by competitors can be made to interact with each other in ways that would be disincentivized if implemented on a traditional centralized platform.

This composition does come at a price. It’s difficult to reason about the system because you suddenly must understand all the different contracts that created these tokens. It’s different code in each case. And your code suddenly interacts with the whole universe of code on the blockchain. So, you must be mindful of all these other externalities and third-party components your app might interact with.

We’ve seen this complexity play out recently with new types of financial instruments and technology that have become available, particularly on Ethereum, such as flash loans or maximum extractable value, which are really deep technical concepts. Still, millions of dollars have been lost because a bunch of different DeFi apps are composed in a single transaction in a way that none intended to be composed.

Computer scientist Leslie Lamport wrote in 1987, “A distributed system is one in which the failure of a computer you didn’t even know existed can render your computer unusable.” This is still true today and will always be true in blockchains.

Should the U.S. care about blockchain technologies, and if so, what’s the best application for the government?

It’s a matter of national security that the U.S. government gets involved with blockchains: Other than perhaps lost tax revenue, Uncle Sam doesn’t really care if you lose your Bitcoin. But Uncle Sam should care if North Korea steals it. U.S. adversaries are already exploiting these technologies to circumvent sanctions and undermine our markets.

It’s more productive to ask, “Can blockchain and Web3 technologies ever be made secure? If so, how?” The U.S. government needs to foster research and innovation to answer this question to stay relevant and remain a world leader in distributed ledger technology.

How should the U.S. handle the training regimen needed in the Web3 space?

There is a large need to change how we educate the incoming workforce because traditional software development expertise does not directly translate into Web3. I have friends who don’t have a background in computer science, yet they learned one programming language, wrote a mobile app, and are now millionaires. They don’t have any technical knowledge of what a phone is doing, how iOS or Android is running, or how the hardware works. They just needed to know that one programming language, and that was sufficient for them to build something very popular and effective.

That isn’t true for Web3. Knowing the entire stack is helpful when creating smart contracts, because you need to understand the compiler that you’re using. You need to understand the virtual machine that’s running. You need to understand byzantine, fault-tolerant, and consensus protocols. You should understand zero-knowledge proofs or zk-SNARKs. You should understand all of these esoteric technologies, and very few experts know any of them, let alone all of them. You need to be an expert in them to avoid all the pitfalls and footguns.

We need policies incentivizing people to enter the workforce with these necessary skills. At Trail of Bits, we’ve developed a blockchain security apprenticeship because finding people with all the necessary skills is difficult in this competitive market. Some security people know how to analyze a C++ program or a mobile app, but they have no idea about blockchain. And then you have blockchain people who have no background in security. So we developed this in-house program.

For mobile app stores, there has always been a low barrier to entry for people looking to get involved in the app economy. With Web3, that doesn’t seem to be the case, yet there is a lot of activity in this space. What more needs to be done to bring developers to a level where blockchain is mature from a security perspective, and what entities or organizations should lead that effort?

The barrier to entry is surprisingly low for Web3, too, which is part of the problem: Web3 development toolchains have been modeled after familiar toolchains from traditional app development. Developer friendliness has been prioritized at the expense of security. We need to modernize and improve the tooling to flip the balance of that prioritization.

Conclusion

It’s not enough for governments to only express interest in securing blockchain technologies. Real, purposeful investments need to be made. Beyond the design of secure architectures, languages, compilers, and protocols, these investments should also include educating a robust workforce to meet tomorrow’s Web3 demands.

If you’re considering whether a blockchain might be the solution to a problem you’re trying to solve, we recommend our operational risk assessment titled, “Do You Really Need a Blockchain?” This will give you a thorough look into the advantages and risks you may be taking.

Finally, if you would like to hear more from the other experts on the panel about blockchain technologies and national security, you can view the discussion in its entirety at: https://nps.edu/web/nps-video-portal/-/blockchain-research-opportunities-for-nps-students-and-faculty.

Typos that omit security features and how to test for them

By Dominik ‘disconnect3d’ Czarnota

During a security audit, I discovered an easy-to-miss typo that unintentionally failed to enable _FORTIFY_SOURCE, which helps detect memory corruption bugs in incorrectly used C functions. We searched, found, and fixed twenty C and C++ bugs on GitHub with this same pattern. Here is a list of some of them related to this typo:

We’ll show you how to test your code to avoid this issue that could make it easier to exploit bugs.

How source fortification works

The source fortification is a security mitigation that replaces certain function calls with more secure wrappers that perform additional runtime or compile-time checks.

Source fortification is enabled by defining a special macro, “_FORTIFY_SOURCE=”, with a value of 1, 2, or 3 and compiling a program with optimizations. The higher the value, the more functions fortified or checks performed. Also, the libc library and compiler must support the source fortification option, which is the case for glibc, Apple Libc, gcc, and LLVM/Clang, but not musl libc and uClibc-ng. The implementation specifics may also vary. For example, level value 3 was only recently added in glibc 2.34, but it does not seem to be available in Apple Libc.

The following example shows source fortification in action. Whether or not we enable the mitigation, the resulting binary will call either the strcpy function or its __strcpy_chk wrapper:

Figure 1: Compiler Explorer comparison of the assembly generated by the compiler.
In this case, the __strcpy_chk wrapper function is implemented by glibc (source):

/* Copy SRC to DEST and check DEST buffer overflow*/
char * __strcpy_chk (char *dest, const char *src, size_t destlen) {
    size_t len = strlen (src);
    if (len >= destlen)
    __chk_fail ();
    return memcpy (dest, src, len + 1);
}

Figure 2: The __strcpy_chk function from glibc
As we can see, the wrapper takes one more argument—the destination buffer size—and then checks if the length of the source is bigger than the destination. If it is, the wrapper calls the __chk_fail function, which aborts the process. Figure 1 shows that the compiled code passes the correct length of the dest destination buffer in the mov edx, 10 instruction.

Tpying is hard

Since a preprocessor macro determines source fortification, a typo in the macro spelling effectively disables it, and neither the libc nor the compiler catches this issue, unlike typos made in other security hardening options enabled with compiler flags instead of macros.

Effectively, if you pass in “-DFORTIFY_SOURCE=2 -O2” instead of “-D_FORTIFY_SOURCE=2 -O2” to the compiler, the source fortification won’t be enabled, and the wrapper functions will not be used:

Figure 3: Assembly when making a typo in the _FORTIFY_SOURCE macro (created with Compiler Explorer)
I searched for this and similar bug patterns using grep.app, sourcegraph.com, and cs.github.com tools, and I sent 20 pull requests. Three of my pull requests were slightly different outside of the list at the beginning of this post.

  • kometchtech/docker-build#50 used “-FORTIFY_SOURCE=2 -O2”. This is not detected as a compiler error because it is a “-F<dir>” flag, which sets “search path for framework include files.”
  • ned14/quickcpplib#37 had a typo in the “-fsanitize=safe-stack” compiler flag. Although compilers detect such a typo, the flag was used in a CMake script to determine if the compiler supports the safe stack mitigation. The CMake script never enabled this mitigation because of this typo. I found this case thanks to my colleague, Paweł Płatek, who suggested checking whether compilers detect typos in security-related flags. Although they do, flag typos may still cause issues during compiler feature detection.
  • OpenImageIO/oiio#3729 was an invalid report/PR since the “-DFORTIFY_SOURCE=2” option provided a value for a CMake variable that eventually led to setting the proper _FORTIFY_SOURCE macro. (However, that is still an unfortunate CMake variable name.)

The three code search tools I used can find more cases like this, but I didn’t send PRs to all of them, like when a project seemed abandoned.

Testing _FORTIFY_SOURCE mitigation

In addition to testing code during continuous integration, developers should also test the results of build systems and the options they have chosen to enable. Apart from helping to detect regressions, this can also help understand what the options really do, like when source fortification is disabled when optimizations are disabled.

So, how do you see if you enabled source fortification correctly? You can scan the symbols used by your binary and ensure that the fortified source functions you expect to be used are really used. A simple Bash script like the one shown below can achieve this:

if readelf --symbols /bin/ls | grep -q ' __snprintf_chk@'; then
    echo "snprintf is fortified";
else
    echo "snprintf is not fortified";
fi

Figure 3: Simple Bash script to check for a fortified symbol
However, in practice, you should just scan your binary for security mitigations with a binary analysis tool such as checksec.rs, BinSkim Binary Analyzer, Pwntools’ checksec, checksec.sh or winchecksec (a tool Trail of Bits created for checksec on Windows).

Before using a tool, it’s a good idea to double-check if it works properly. As referenced in the above list of bugs, BinSkim had a typo in its recommendations text. Another bug, this time in checksec.sh, resulted in incorrect results in “Home Router Security Report 2020.” What was the reason for the bug in checksec.sh? If a scanned binary used stack canaries, the “__stack_chk_fail” symbol (used to abort the program if the canary was corrupted) incorrectly accounted for source fortification. This is because checksec.sh looked for a “_chk” string in the output of the readelf –symbols command, instead of expecting that the symbol name suffix matches the “_chk” string. This bug appears to be fixed after the issues reported in slimm609/checksec.sh#103 and slimm609/checksec.sh#130 were resolved.

It is also worth noting that both BinSkim and checksec.sh can tell you how many fortifiable functions there are vs. how many are fortified in your binary. How do they do that? BinSkim keeps a hard-coded list of fortifiable function names deduced from glibc, and checksec.sh scans your own glibc to determine those names. Although this can prevent some false positives, those solutions are still imperfect. What if your binary is linked against a different libc or, in the case of BinSkim, what if glibc added new fortifiable functions? Last but not least, none of the tools detect the actual fortification level used, but perhaps that only impacts the number of fortifiable functions. I am not sure.

Fun fact: Typo in Nginx

During this research, I also found out that the Nginx package from Debian had this kind of typo bug in the past. Currently, the Nginx package uses a dpkg-buildflags tool that provides the proper macro flag:

$ dpkg-buildflags --get CPPFLAGS
-Wdate-time -D_FORTIFY_SOURCE=2

$ dpkg-buildflags --get CFLAGS
-g -O2 -fdebug-prefix-map=/tmp/nginx-1.18.0=. -fstack-protector-strong -Wformat -Werror=format-security

It is weird that the source fortification and optimization flags are separated into CFLAGS and CPPFLAGS. Wouldn’t some projects use one but not the other and miss some of the options? I haven’t checked that.

Some wishful thinking

In an ideal world, a compiler would automatically include information about all necessary security mitigations and hardening options in the generated binary. However, we are limited by the incomplete information we must work with.

When testing your build system, there doesn’t seem to be a silver bullet, especially since not all security mitigations are straightforward to check, and some may require analyzing the resulting assembly. We haven’t analyzed the tools exhaustively, but we would probably recommend using checksec.rs or BinSkim for Linux and winchecksec for Windows. We also plan to extend Blight, our build instrumentation tool, to find the mistakes described in this blog post during build time. Even so, it probably still makes sense to scan the resulting binary to confirm what the compiler and linker are doing.

Finally, contact us if you find this research interesting and you want to secure your software further, as we love to work on hard security problems.

A Winter’s Tale: Improving messages and types in GDB’s Python API

By Matheus Branco Borella, University of São Paulo

As a winter associate at Trail of Bits, my goal was to make two improvements to the GNU Project Debugger (GDB): make it run faster and improve its Python API to support and improve tools that rely on it, like Pwndbg. The main goal was to run symbol parsing in parallel and better use all available CPU cores. I ultimately implemented three changes that enhanced GDB’s Python API.

Beyond the actual code, I also learned about upstreaming patches in GDB. This process can take a while, has a bit of a learning curve, and involves a lot of back and forth with the project’s maintainers. I’ll discuss this in the post, and you can also follow along as my work is still being debated in the GDB patches mailing list.

Why make GDB faster?

GDB has three ways to load DWARF symbols from a program:

  1. Partial symbol table loader: The index loader is responsible for loading in symbol names and connecting them to their respective compilation units (CUs), leaving the parsing and building of their symbol tables to the full loader. Parsing will be done later only when full information about the symbol is required.
  2. Full symbol table loader: Finishes the work the index loader has left for later by parsing the CUs and building their symbol tables as needed. This loader fully parses the DWARF information in the file and stores it in memory.
  3. Index parser: ELFs can have a special .gdb_index section, added either with the –gdb-index linker flag or with the gdb-add-index tool provided by GDB. The tool stores an index for the internal symbol table that allows GDB to skip the index construction pass, significantly reducing the time required to load the binary in GDB.

The original idea was to port the parallel parsing approach in drgn, Meta’s open-source debugger, to GDB. Parallel parsing had already been implemented for the index loader, leaving only the full loader and the index parser as potential next candidates in line for parallelization.

You can think of GDB’s parsing routines as split into concurrent tasks on a per-CU basis since they’re already invoked sequentially once per CU. However, this understanding has a major issue: despite the ostensive separation of the data, it is not separated into data that is fully read-write, partially read-write with implicit synchronization, and read-only. The parsing subroutines fully expect all of these data structures to be read-write, at least to a degree.

While solving most of these is a simple case of splitting the values into separate read-write copies (one owned by each thread), things like the registries, the caches, and particularly the obstacks are much harder to move to a concurrent model.

What’s an obstack?

General purpose allocations, like malloc(), are time-consuming. They may not be efficient when users need to allocate many small objects as quickly as possible since they store metadata within each allocation.

Enter allocation stacks. Each new object is allocated on the top and freed from the top in order. The GNU Obstack, an implementation of such an allocator, is used heavily in GDB. Each reasonably long-lived container object, including objfile and gdbarch, has its instance of an obstack and is used to hold the objects it references and frees them all at once, together with the object itself.

If you’re knowledgeable on object lifetime tracking—be it dynamic, like you’d get with std::shared_ptr, or static, like with references in Rust—the last paragraph will have sounded familiar. Judging by how obstack allocations are used in GDB, someone might assume there is a way to guarantee that objects will live as long as the container that owns them.

After discussing this with others in the IRC and mailing list, I reached two conclusions: it would take a considerable amount of time to investigate it, and I was better off prioritizing the Python API so that I could have a chance at completing the improvements on time. Ultimately, I spent most of my time on those attainable goals.

GDB objects __repr__ methods

The first change is fairly simple. It adds __repr__() implementations to a handful of types in the GDB Python API. This change makes the messages we get from inspecting types in the Python REPL more informative about what those types represent.

Previously, we would get something like this, which is hardly helpful (note: pi is the GDB command to run the Python REPL):

(gdb) pi
>>> gdb.lookup_type("char")
<gdb.Type object at 0x7ff8e01aef20>

Now, we can get the following, which tells us what kind of type this is, as well as its name, rather than where the object is located in memory:

(gdb) pi
>>> gdb.lookup_type("char")
<gdb.Type code=TYPE_CODE_INT name=char>

This also applies to gdb.Architecture, gdb.Block, gdb.Breakpoint, gdb.BreakpointLocation, and gdb.Symbol.

This helped me understand how GDB interfaces with Python and how the Python C API generally works. It allowed me to add my own functions and types later.

Types ahoy!

The second change adds the ability to create types from the Python API, where previously, you could only query for existing types using gdb.lookup_type(). Now you can directly create any primitive type supported by GDB, which can be pretty handy if you’re working on code but don’t have the symbols for it, or if you’re writing plugins to help people work with that sort of code. Types from weird extra binaries need not apply!

GDB supports a fairly large number of types. All of them can be created directly using gdb.init_type or one of the specialized gdb.init_*_type functions, which let you specify parameters relevant to the type being created. Most of them work similarly, except for gdb.init_float_type, which has its own new gdb.FloatFormat type to go along with it. This lets you specify how the floating point type you’re trying to create is laid out in memory.

An extra consideration that comes with this change is where exactly the memory for these new types comes from. Since these functions are based on functions already available internally in GDB, and since these functions use the obstack from a given objfile, the obstack is the memory source for these allocations. This has one big advantage: objects that reference these types and belong to the same objfile are guaranteed never to outlive them.

You may already have realized a significant drawback to this method: any type allocated on it has a high chance of not being on the top of the stack when the Python runtime frees it. So regardless of their real lifetime requirements, types can be freed only along with the objfile that owns them. The main implication is that unreachable types will leak their memory for the lifetime of the objfile.

Keeping track of the initialization of the type by hand would require a deeper change to the existing type object infrastructure. This is too ambitious for a first patch.

Here are a few examples of this method in action:

(gdb) pi
>>> objfile = gdb.lookup_objfile("servo")
>>>
>>> # Time to standardize integer extensions. :^)
>>> gdb.init_type(objfile, gdb.TYPE_CODE_INT, 24, "long short int")
<gdb.Type code=TYPE_CODE_INT name=long short int>

This creates a new 24-bit integer type named “long short int”:

(gdb) pi
>>> objfile = gdb.lookup_objfile("servo")
>>>
>>> ff = gdb.FloatFormat()
>>> ff.totalsize = 32
>>> ff.sign_start = 0
>>> ff.exp_start = 1
>>> ff.exp_len = 8
>>> ff.man_start = 9
>>> ff.man_len = 23
>>> ff.intbit = False
>>>
>>> gdb.init_float_type(objfile, ff, "floatier")
<gdb.Type code=TYPE_CODE_FLOAT name=floatier>

This creates a new floating point type reminiscent of the one available in standard x86 machines.

What about the symbols?

The third change adds the ability to register three symbols: types, goto labels, and statics. This makes it much easier to add new symbols, which is especially useful if you’re reverse engineering and don’t have any original symbols. Without this patch, the main way to add new symbols involves adding them to a separate file, compiling the file to the target architecture, and loading it into GDB after the base program is loaded with the add-symbol-file command.

GDB’s internal symbol infrastructure is mostly not meant for on-the-fly additions. Let’s look at how GDB creates, stores, and looks up symbols.

Symbols in GDB are found through pointers deep inside structures called compunit_symtab. These structures are set up through a builder that allows symbols to be added to the table as it’s being built. This builder is later responsible for registering the new structure with the (in the case of this patch) objfile that owns it. In the objfile case, these tables are stored in a list that, during lookup—disregarding the symbol lookup cache—is traversed until a symbol matching the given requirements is found in one of the tables.

Currently, tables aren’t set up so that symbols can be added to the table at will after it’s been built. So if we don’t want to make deep changes to GDB before the first patch, we must find a way around this limitation. What I landed on was building a new symbol table and stringing it to the end of the list for every new symbol. Although this is a rather inefficient approach, it’s sufficient to get the feature to work.

As this patch continues to be upstreamed, I aim to iron out and improve the mechanism by which this functionality is implemented.

Lastly, I’d like to show an example of a new type being created and registered as a symbol for future lookup:

(gdb) pi
>>> objfile = gdb.lookup_objfile("servo")
>>> type = gdb.init_type(objfile, gdb.TYPE_CODE_INT, 24, "long short int")
>>> objfile.add_type_symbol("long short int", type)
>>> gdb.lookup_type("long short int")
<gdb.Type code=TYPE_CODE_INT name=long short int>

Getting it all merged

Overall, this winter at Trail of Bits produced more informative messages the ability to create supported types in GDB’s Python API, which is helpful when you don’t have symbols for the code you’re working on.

GDB is old school regarding how it handles contributions. Its maintainers use email to submit, test, and comment on patches before being upstreamed. This generally means there’s a very rigid etiquette to follow when submitting a patch.

As someone who had never dealt with email-based projects, my first attempt to submit a patch was bad. I cobbled together a text file with the output of git diff and then wrote the entire message by hand before sending it through a client that poorly handled non-UNIX line endings. This caused a mess that, understandably, none of the maintainers in the list inclined to patch in and test. Still, they were nice enough to tell me I should’ve done it using Git’s built-in email functionality: git send-email directly.

After that particular episode, I put in the time to split off my changes into proper branches and to rebase them so that they would all be condensed into a single commit per major change. This created a more rational and descriptive message that covers the entire change and is much better suited for use with git send-email. Since then, things have been rolling pretty smoothly, though there has been a lot of back and forth trying to get all of my changes in.

While the three changes have already been submitted, the one implementing __repr__() is further down the pipeline, while the other two are still awaiting review. Keep an eye out for them!

How to avoid the aCropalypse

By Henrik Brodin, Lead Security Engineer, Research

The aCropalypse is upon us!

Last week, news about CVE-2023-21036, nicknamed the “aCropalypse,” spread across Twitter and other media, and I quickly realized that the underlying flaw could be detected by our tool, PolyTracker. I’ll explain how PolyTracker can detect files affected by the vulnerability even without specific file format knowledge, which parts of a file can become subject to recovery using acropalypse.app, and how Google and Microsoft could have caught this bug by using our tools. Coincidentally, my colleagues, Evan Sultanik and Marek Surovič, and I wrote a paper that describes this class of bugs, defines a novel approach for detecting them, and introduces our implementation and tooling. It will appear at this year’s workshop on Language-Theoretic Security (LangSec) at the IEEE Security and Privacy Symposium.

We use PolyTracker to instrument the image parser, libpng. (Any parser will do, not just aCropalyptic ones.) The PolyTracker instrumentation tells us which portions of the input file are completely ignored by the parser, which we call blind spots. Blind spots are almost always indicators of design flaws in the file format, malformation in the input file, and/or a bug in the parser. Normal images should have almost no blind spots, but parsing malformed aCropalyptic images through libpng reveals the cropped data in a large blind spot. The aCropalypse bugs could have been caught if the vulnerable products had been instrumented with PolyTracker and their output tested for blind spots.

# parse the screenshot with an instrumented version of pngtest
$ ./pngtest.instrumented re3eot.png.png out_re3eot.png.png
# ask polytracker to identify any blindspots in the file
$ polytracker cavities polytracker.tdag                                                                 
Re3eot.png,697120,1044358
# found a blind spot starting at offset 697120 (size ~300KiB), it is ignored and contains the cropped out image data that could be retrieved

Understanding the aCropalypse

According to this tweet, it is possible to recover parts of an original image from a cropped or redacted screenshot. The TL;DR is that when the Google Pixel built-in screenshot editing tool, Markup, is used to crop or resize an image, it overwrites the original image, but only up to the offset where the new image ends. Any data from the original image after that offset is left intact in the file. David Buchanan devised an algorithm to recover the original image data still left in the file; you can read more about the specifics on his blog.

More recently, Chris Blume identified a similar vulnerability for the Windows Snipping Tool. The methodology we describe here for the Markup tool can be used on images produced by the Windows Snipping Tool.

PolyTracker has a feature we introduced a couple of years ago called blind spot detection. We define blind spots as the set of input bytes whose data flow never influences either the control flow that leads to an output or an output itself. Or, in layman’s terms, unused file data that can be altered to have any content without affecting the output. The cropped-out regions of an aCropalypse image are, by definition, blind spots, so PolyTracker should be able to detect them!

One of the challenges of tracking input bytes and detecting blind spots for real-world inputs like PNG images or PDF documents is taint explosion. The PNG file format contains compressed chunks of image data. Compression is especially keen on contributing to taint explosion as input bytes combine in many ways to produce output bytes. PolyTracker’s unique representation of the taint structure allows us to track 2^31 unique taint labels, which is necessary for analyzing taints propagated during zlib-decompression of image data.

aCropalyptic files will have Blind Spots when processed

To understand why the aCropalypse vulnerability produces blind spots, we need to combine our knowledge of the vulnerability with the description of blind spots. When parsing a PNG file with a PNG parser, the parser will interpret the header data and consume chunks according to the PNG specification. In particular, it will end at a chunk with type IEND, even if that is not at the actual end of the file.

We use PolyTracker to instrument a tool (pngtest from the libpng project) that reads PNG files and writes them to disk again. This will produce an additional output file, called polytracker.tdag, that captures the data flow from the runtime trace. Using that file and PolyTracker’s blind spot detection feature, we can enumerate the input bytes that do not affect the resulting image. Remember, these are the bytes of the input file that neither affect any control flow, nor end up (potentially mixed with other data) in the output file. They have no actual meaning in interpreting the format for the given parser.

Show me!

Using the PolyTracker-instrumented pngtest application, we load, parse, and then store the below image to disk again. During this processing, we track all input bytes through PNG and zlib processing until they eventually reach the output file in some form.

We use a Docker image containing the PolyTracker instrumented pngtest application.

$ docker run -ti --rm -v $(pwd):/workdir acropalypse
$ cd /workdir
$ /polytracker/acropalypse/libpng-1.6.39/pngtest.instrumented re3eot.png.png out_re3eot.png.png

The re3eot.png image is 1044358 bytes in size, whereas the out_re3eot.png is 697,182 bytes. Although this indicates a fairly large reduction in size, at this point we can’t tell why; it could, for example, be the result of different compression settings.

Next, let’s find the blind spots from this process:

$ polytracker cavities polytracker.tdag 
                                                                
100%|███████████████████| 1048576/1048576 [00:01<00:00, 684922.43it/s]
re3eot.png,697120,1044358
out_re3eot.png,37,697182

The output we are interested in is:

re3eot.png,697120,1044358

This tells us that the data starting from offset 697,120 to the end of the file was ignored when producing the output image. We have found a blind spot! The additional 347,238 bytes of unused data could be left from an original image—an indication of the aCropalypse vulnerability. Let’s use the acropalypse.app web page to see if we can recover it.

This indicates that the file was in fact produced by the vulnerable application. At this point, we know that the image contains data from the original image at the end, as that is the core of the vulnerability. We also know the exact location and extent of that data (according to the blind spot’s starting offset and size). To confirm that the data is in fact a blind spot, let’s manually crop the original image and redo the pngtest operation to ensure that the resulting files are in fact equal. First, let’s copy only the portion that is not a blind spot—the data that is used to produce the output image.

$ dd if=re3eot.png of=manually_cropped_re3eot.png count=1 bs=697120

Next, let’s run the pngtest application again:

$ /polytracker/acropalypse/libpng-1.6.39/pngtest.instrumented manually_cropped_re3eot.png out_manually_cropped_re3eot.png

If our assumption—that only the first 697,120 bytes were used to produce the output image— is correct, we should have two identical output files, despite the removal of 347,238 bytes from the manually_cropped_re3eot.png input file.

$ sha1sum out_manually_cropped_re3eot.png out_re3eot.png
8f4a0417da4c68754d2f85e059ee2ad87c02318f  out_manually_cropped_re3eot.png
8f4a0417da4c68754d2f85e059ee2ad87c02318f  out_re3eot.png

Success! To ensure that the manually cropped file isn’t still affected by the vulnerability, let’s use the web page to try to reconstruct additional image data in the file. This attempt was unsuccessful, as we have removed the original image contents. (Yes, I have checked the cropped screenshot for blind spots 😁).

To better understand why the blind spot started at the particular offset, we need to examine the structure of the original image.

PolyFile to the rescue

PolyTracker has a sibling tool: PolyFile, a pure Python cleanroom implementation of libmagic, with instrumented parsing from Kaitai struct and an interactive hex viewer. We will use PolyFile’s ability to produce an HTML rendering of the file structure to understand why file processing ends before the file ends.

First, we use the following command to produce an HTML file representing the file format:

 $ polyfile --html re3eot.html re3eot.png.

When we open the re3eot.html file in a browser, we’ll see an initial representation of the file.

By repeatedly expanding the file structure on the left-hand side, we eventually reach the final chunk.

As shown in the above picture, the final chunk, when interpreting the PNG-format, has type IEND. Following that chunk is the remaining data from the original file. Note how the superfluous data starts at offset 0xaa320—that is, 697,120, the exact same offset of the identified blind spot. If you were to scroll all the way to the end, you would find an additional IEND structure (from the original image), but that is not interpreted as a valid part of the PNG file.

It doesn’t stop here

Having almost no knowledge of the PNG file format, we were able to use PolyTracker instrumentation on an existing PNG processing application to detect not only files that have blind spots, but also their exact location and extent.

PolyTracker can detect blind spots anywhere in the file, not only at the end. Even though we analyzed PNG files, PolyTracker isn’t limited to a specific format. We have previously analyzed conversion of PDFs to PostScript using MμPDF. The same technique is valid for any application that does a load/store or deserialize/serialize operation. To further increase our understanding of the format and the effects of the vulnerability, we used PolyFile to inspect the file structure.

These are just a couple of use cases for our tools, there are plenty of others! We encourage you to try our PolyTracker and PolyFile tools yourself to see how they can help you identify unexpected processing and prevent vulnerabilities similar to the aCropalypse in your application.

Acknowledgements

This research was supported in part by the Defense Advanced Research Projects Agency (DARPA) SafeDocs program as a subcontractor to Galois under HR0011-19-C-0073. The views, opinions, and findings expressed are those of the author and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.

Many thanks to Evan Sultanik, Marek Surovič, Michael Brown, Trent Brunson, Filipe Casal, Peter Goodman, Kelly Kaoudis, Lisa Overall, Stefan Nagy, Bill Harris, Nichole Schimanski, Mark Tullsen, Walt Woods, Peter Wyatt, Ange Albertini, and Sergey Bratus for their invaluable feedback on the approach and tooling. Thanks to Ange Albertini for suggesting angles morts—French for “blind spots”—to name the concept, and to Will Tan for sharing a file affected by the vulnerability. Special thanks to Carson Harmon, the original creator of PolyTracker, whose ideas and discussions germinated this research, and Evan Sultanik for helping write this blog post.

Codex (and GPT-4) can’t beat humans on smart contract audits

By Artem Dinaburg, Chief Technology Officer; Josselin Feist, Principal Engineer; and Riccardo Schirone, Security Engineer

Is artificial intelligence (AI) capable of powering software security audits? Over the last four months, we piloted a project called Toucan to find out. Toucan was intended to integrate OpenAI’s Codex into our Solidity auditing workflow. This experiment went far beyond writing “where is the bug?” in a prompt and expecting sound and complete results.

Our multi-functional team, consisting of auditors, developers, and machine learning (ML) experts, put serious work into prompt engineering and developed a custom prompting framework that worked around some frustrations and limitations of current large language model (LLM) tooling, such as working with incorrect and inconsistent results, handling rate limits, and creating complex, templated chains of prompts. At every step, we evaluated how effective Toucan was and whether it would make our auditors more productive or slow them down with false positives.

The technology is not yet ready for security audits for three main reasons:

  1. The models are not able to reason well about certain higher-level concepts, such as ownership of contracts, re-entrancy, and fee distribution.
  2. The software ecosystem around integrating large language models with traditional software is too crude and everything is cumbersome; there are virtually no developer-oriented tools, libraries, and type systems that work with uncertainty.
  3. There is a lack of development and debugging tools for prompt creation. To develop the libraries, language features, and tooling that will integrate core LLM technologies with traditional software, far more resources will be required.

Whoever successfully creates an LLM integration experience that developers love will create an incredible moat for their platform.

The above criticism still applies to GPT-4. Although it was released only a few days before the publication of this blog post, we quickly ran some of our experiments against GPT-4 (manually, via the ChatGPT interface). We conclude that GPT-4 presents an incremental improvement at analyzing Solidity code. While GPT-4 is considerably better than GPT-3.5 (ChatGPT) at analyzing Solidity, it is still missing key features, such as the ability to reason about cross-function reentrancy and inter-function relationships in general. There are also some capability regressions from Codex, like identification of variables, arithmetic expressions, and understanding of integer overflow. It is possible that with the proper prompting and context, GPT-4 could finally reason about these concepts. We look forward to experimenting more when API access to the large context GPT-4 model is released.

We are still excited at the prospect of what Codex and similar LLMs can provide: analysis capabilities that can be bootstrapped with relatively little effort. Although it does not match the fidelity of good algorithmic tools, for situations where no code analysis tools exist, something imperfect may be much better than having nothing.

Toucan was one of our first experiments with using LLMs for software security. We will continue to research AI-based tooling, integrating it into our workflow where appropriate, like auto-generating documentation for smart contracts under audit. AI-based capabilities are constantly improving, and we are eager to try newer, more capable technologies.

We want AI tools, too

Since we like to examine transformational and disruptive technologies, we evaluated OpenAI’s Codex for some internal analysis and transformation tasks and were very impressed with its abilities. For example, a recent intern integrated Codex within Ghidra to use it as a decompiler. This inspired us to see whether Codex could be applied to auditing Solidity smart contracts, given our expertise in tool development and smart contract assessments.

Auditing blockchain code is an acquired skill that takes time to develop (which is why we offer apprenticeships). A good auditor must synthesize multiple insights from different domains, including finance, languages, virtual machine internals, nuances about ABIs, commonly used libraries, and complex interactions with things like pricing oracles. They must also work within realistic time constraints, so efficiency is key.

We wanted Toucan to make human auditors better by increasing the amount of code they could investigate and the depth of the analysis they could accomplish. We were particularly excited because there was a chance that AI-based tools would be fundamentally better than traditional algorithmic-based tooling: it is possible to learn undecidable problems to an arbitrarily high accuracy, and program analysis bumps against undecidability all the time.

We initially wanted to see if Codex could analyze code for higher-level problems that could not be examined via static analysis. Unfortunately, Codex did not provide satisfactory results because it could not reason about higher-level concepts, even though it could explain and describe them in words.

We then pivoted to a different problem: could we use Codex to reduce the false positive rate from static analysis tools? After all, LLMs operate fundamentally different from our existing tools. Perhaps they provide enough signals to create new analyses previously untenable due to unacceptable false positives. Again, the answer was negative, as the number of failures was high even in average-sized code, and those failures were difficult to predict and characterize.

Below we’ll discuss what we actually built and how we went about assessing Toucan’s capabilities.

Was this worth our time?

Our assessment does not meet the rigors of scientific research and should not be taken as such. We attempted to be empirical and data-driven in our evaluation, but our goal was to decide whether Toucan warranted further development effort—not scientific publication.

At each point of Toucan development, we tried to assess whether we were on the right track. Before starting development, we manually used Codex to identify vulnerabilities that humans had found in specific open-source contracts—and with enough prompt engineering, Codex could.

After we had the capability to try small examples, we focused on three main concepts that seemed within Codex’s capability to understand: ownership, re-entrancy, and integer overflow. (A quick note for the astute reader: Solidity 0.8 fixed most integer overflow issues; developing overflow checks was an exercise in evaluating Codex’s capability against past code.) We could, fairly successfully, identify vulnerabilities regarding these concepts in small, purpose-made examples.

Finally, as we created enough tooling to automate asking questions against multiple larger contracts, we began to see the false positive and hallucination rates become too high.  Although we had some success with ever more complex prompts, it was still not enough to make Toucan viable.

Below are some key takeaways from our experience.

Codex does not fully grasp the higher-level concepts that we would like to ask about, and explaining them via complex prompt engineering does not always work or produce reliable results. We had originally intended to ask questions about higher-level concepts like ownership, re-entrancy, fee distribution, how pricing oracles are used, or even automated market makers (AMMs). Codex does not fully understand many of these abstract concepts, and asking about them failed in the initial evaluation stage. It somewhat comprehends the simplest concept — ownership — but even then it often cannot always correlate changes in the ‘owner’ variable with the concept of ownership. Codex does not appear to grasp re-entrancy attacks as a concept, even though it can describe them with natural language sentences.

It is very easy to delude yourself by p-hacking a prompt that works for one or a few examples. It is extremely difficult to get a prompt that generalizes very well across multiple, diverse inputs. For example, when testing whether Toucan could reason about ownership, we initially tried seven small (<50 LOC) examples from which we could determine a baseline. After a thorough prompt-engineering effort, Toucan could pass six out of seven tests, with the lone failing test requiring complex logic to induce ownership change. We then tried the same prompt on eight larger programs (> 300 LOC), among which Toucan identified 15 potential changes of ownership, with four false positives—including complete hallucinations. However, when we tried slight permutations of the original small tests, we could usually get the prompt to fail given relatively minor changes in input. Similarly, for integer overflow tests, we could get Toucan to successfully identify overflows in 10 out of 11 small examples, with one false positive—but a larger set of five contracts produced 12 positives — with six of them being false, including four instances of complete hallucinations or inability to follow directions.

Codex can be easily misled by small changes in syntax. Codex is not as precise as existing static analysis tools. It is easily confused by up comments, variable names, and small syntax changes. A particular thorn is reasoning about conditionals (e.g. ==, !=, <, >), where Codex will seemingly ignore them and create a conclusion based on function and variable names instead.

Codex excels at abstract tasks that are difficult to define algorithmically, especially if errors in the output are acceptable. For example, Codex will excel at queries like “Which functions in this contract manipulate global state?” without having to define “global state” or “manipulate.” The results might not be exact, but they will often be good enough to experiment with new analysis ideas. And while it is possible to define queries like this algorithmically, it is infinitely easier to ask in plain language.

The failure modes of Codex are not obvious to predict, but they are different from those of Slither and likely similar static analysis tools based on traditional algorithms.

A comparison of false and true positives found by Toucan and Slither
Figure 1: True positives (green) and false positives (red) found by Slither, Toucan, and both on some simple re-entrancy tests. The Toucan results are not encouraging.

We tried looking at the true/false positive sets of Slither and Toucan, and found that each tool had a different set of false positives/false negatives, with some overlap (Figure 1). Codex was not able to effectively reduce the false positive rate from a prototype Slither integer overflow detector. Overall, we noticed a tendency to reply affirmatively to our questions, increasing the number of positives discovered by Toucan.

Codex can perform basic static analysis tasks, but the rate of failure is too high to be useful and too difficult to characterize. This capability to perform successful analysis, even on short program fragments, is very impressive and should not be discounted! For languages that Codex understands but for which no suitable tooling exists, this capability could be extremely valuable—after all, some analysis could be much better than nothing. But the benchmark for Solidity is not nothing; we already have existing static analysis tooling that works very well.

How we framed our framework

During Toucan’s development, we created a custom prompting framework, a web-based front end, and rudimentary debugging and testing tools to evaluate prompts and to aid in unit and integration tests. The most important of these was the prompting framework.

Prompting framework

If we were making Toucan today, we’d probably just use LangChain. But at the time, LangChain did not have the features we needed. Frustratingly, neither OpenAI nor Microsoft offered an official, first-party prompting framework. This led us to develop a custom framework, with the goal that it should be possible for auditors to create new prompts without ever modifying Toucan’s code.

requires = [“emit-ownership-doc”, “emit-target-contract”,]
name = “Contract Ownership”
scope = “contract”
instantiation_condition = “any(‘admin’ in s.name.lower() or ‘owner’ in s.name.lower() for s in contract.state_variables)”

[[questions]]
name = “can-change”
query = “Is it possible to change the `{{ contract | owner_variable }}` variable by calling a function in the `{{ contract.name }}` contract without aborting the transaction? Think through it step by step, and answer as ‘Yes’, ‘No’, or ‘Unknown’. If ‘Yes’, please specify the function.”
is_decision = true

[[questions]]
name = “who-can-call”
runtime_condition = “questions[‘can-change’].is_affirmative()”
query = “””To reason about ownership:
1) First, carefully consider the code of the function
2) Second, reason step by step about the question.
Who can call the function successfully, that is, without aborting or revering the transaction?”””
answer_start = “””1) First, carefully consider the code of the function:”””

[[questions]]
name = “can-non-owner-call”
runtime_condition = “questions[‘can-change’].is_affirmative()”
query = “Can any sender who is not the current owner call the function without reverting or aborting?”
is_decision = true
finding_condition = “question.is_affirmative()”

Figure 2: Sample question chain asking about contract ownership. Before questions are emitted, the prompting framework also emits a specific explanation of what ownership means, with examples and information about the target contract.

Our framework supported chaining multiple questions together to support Chain of Thought and similar prompting techniques (Figure 2). Since GPT models like Codex are multi-shot learners, our framework also supported adding background information and examples before forming a prompt.

The framework also supported filtering on a per-question basis, as there may also be some questions relevant only to specific kinds of contracts (say, only ERC-20 tokens), and others questions may have a specific scope (e.g., a contract, function, or file scope). Finally, each question could be optionally routed to a different model.

The prompting framework also took great lengths to abide by OpenAI’s API limitations, including batching questions into one API invocation and keeping track of both the token count and API invocation rate limits. We hit these limits often and were very thankful the Codex model was free while in beta.

Test data

One of our development goals was that we would never compromise customer data by sending it to an OpenAI API endpoint. We had a strict policy of running Toucan only against open-source projects on GitHub (which would already have been indexed by Codex) with published reports, like those on our Publications page).

We were also able to use the rather extensive test set that comes with Slither, and our “building secure contracts” reference materials as additional test data. It is important to note that some of these tests and reference materials may have been a part of the Codex training set, which explains why we saw very good results on smaller test cases.

The missing tools

The lack of tooling from both OpenAI and Microsoft has been extremely disappointing, although that looks to be changing: Microsoft has a prompting library, and OpenAI recently released OpenAI Evals. The kinds of tools we’d have loved to see include a prompt debugger; a tree-graph visualization of tokens in prompts and responses with logprobs of each token; tools for testing prompts against massive data sets to evaluate quality; ways to ask the same question and combine results from counterexamples; and some plugins to common unit testing frameworks. Surely someone is thinking of the developers and making these tools?

Current programming languages lack the facilities for interfacing with neural architecture computers like LLMs or similar models. A core issue is the lack of capability to work with nondeterminism and uncertainty. When using LLMs, every answer has some built-in uncertainty: the outputs are inherently probabilistic, not discrete quantities. This uncertainty should be handled at the type system level so that one does not have to explicitly deal with probabilities until it is necessary. A pioneering project from Microsoft Research called Infer.NET does this for .NET-based languages, but there seem to be few concrete examples and no real tooling to combine this with LLMs.

Prompt engineering, and surrounding tooling, are still in their infancy. The biggest problem is that you never know when you are done: even now, it is always possible that we were just one or two prompts away from making Toucan a success. But at some point, you have to give up in the face of costs and schedules. With this in mind, the $300K salary for a fantastic prompt engineer does not seem absurd: if the only difference between a successful LLM deployment and a failure is a few prompts, the job quickly pays for itself. Fundamentally, though, this reflects a lack of tooling to assess prompt quality and evaluate responses.

There is no particularly good way to determine if one prompt is better than another or if you’re on the right track. Similarly, when a prompt fails against an input, it is frustratingly difficult to figure out why and to determine, programmatically, which prompts are merely returning the wrong result versus completely hallucinating and misbehaving.

Unit tests are also problematic; the results are not guaranteed to be the same across runs, and newer models may not provide the same results as prior ones. There is certainly a solution here, but again, the tooling developers expect just wasn’t present. OpenAI Evals is likely going to improve this situation.

Overall, the tooling ecosystem is lacking, and surprisingly, the biggest names in the field have not released anything substantial to improve the adoption and integration of LLMs into real software projects that people use. However, we are excited that the open source community is stepping up with really cool projects like LangChain and LlamaIndex.

Humans still reign supreme

OpenAI’s Codex is not yet ready to take over the job of software security auditors. It lacks the ability to reason about the proper concepts and produces too many false positives for practical usage in audit tasks. However, there is clearly a nascent capability to perform interesting analysis tasks, and underlying models should quickly get more capable. We are very excited to keep using the technology as it improves. For example, the new larger context window with GPT-4 may allow us to provide enough context and direction to handle complex tasks.

Even though Codex (and GPT-4) do not currently match mature algorithmic-based tools, LLM-based tools—even those of lower quality—may have interesting uses. For languages for which no analysis tooling exists, developers can bootstrap something from LLMs relatively quickly. The ability to provide some reasonable analysis where none previously existed may be considerably better than nothing at all.

We hope the ability to integrate language models into existing programs improves quickly, as there is currently a severe lack of languages, libraries, type systems, and other tooling for the integration of LLMs into traditional software. Disappointingly, the main organizations releasing LLMs have not released much tooling to enable their use. Thankfully, open-source projects are filling the gap. There is still enormous work to be done, and whoever can make a wonderful developer experience working with LLMs stands to capture developer mindshare.

LLM capability is rapidly improving, and if it continues, the next generation of LLMs may serve as capable assistants to security auditors. Before developing Toucan, we used Codex to take an internal blockchain assessment occasionally used in hiring. It didn’t pass—but if it were a candidate, we’d ask it to take some time to develop its skills and return in a few months. It did return—we had GPT-4 take the same assessment—and it still didn’t pass, although it did better. Perhaps the large context window version with proper prompting could pass our assessment. We’re very eager to find out!

Circomspect has more passes!

By Fredrik Dahlgren, Principal Security Engineer

TL;DR: We have released version 0.8.0 of Circomspect, our static analyzer and linter for Circom. Since our initial release of Circomspect in September 2022, we have added five new analysis passes, support for tags, tuples, and anonymous components, links to in-depth descriptions of each identified issue, and squashed a number of bugs. Please download the new version and tell us what you think!

New analysis passes

The new analysis passes, added to the tool’s initial nine, check for a range of issues that could occur in Circom code:

  1. Failure to properly constrain intermediate signals
  2. Failure to constrain output signals in instantiated templates
  3. Failure to constrain divisors in division operations to nonzero values
  4. Use of BN254-specific templates from Circomlib with a different curve
  5. Failure to properly constrain inputs to Circomlib’s LessThan circuit

Apart from finding the issue related to the Circomlib LessThan circuit discussed below, these analysis passes would also have caught the “million dollar ZK bug” recently identified by Veridise in the circom-pairing library.

To understand the types of issues that Circomspect can identify, let’s dig into the final example in this list. This analysis pass identifies an issue related to the LessThan circuit implemented by Circomlib, the de facto standard library for Circom. To fully understand the issue, we first need to take a step back and understand how signed values are represented by Circom.

Signed arithmetic in GF(p)

Circom programs operate on variables called signals, which represent elements in the finite field GF(p) of integers modulo a prime number p. It is common to identify the elements in GF(p) with the unsigned integers in the half-open interval [0, p). However, it is sometimes convenient to use field elements to represent signed quantities in the same way that we may use the elements in [0, 232) to represent signed 32-bit integers. Mirroring the definition for two’s complement used to represent signed integer values, we define val(x) as follows:

We then say that a is less than b in GF(p) if val(a) < val(b) as signed integers. This means that q = floor(p/2) is the greatest signed value representable in GF(p), and that -q = q + 1 is the least signed value representable in GF(p). It also means, perhaps somewhat surprisingly, that q + 1 is actually less than q. This is also how the comparison operator < is implemented by the Circom compiler.

As usual, we say that a is positive if a > 0 and negative if a < 0. One way to ensure that a value a is nonnegative is to restrict the size (in bits) of the binary representation of a. In particular, if the size of a is strictly less than log(p) - 1 bits, then a must be less than or equal to q and, therefore, nonnegative.

Circomlib’s ‘LessThan’ template

With this out of the way, let’s turn our attention to the LessThan template defined by Circomlib. This template can be used to constrain two input signals a and b to ensure that a < b, and is implemented as follows:

The LessThan template defined by Circomlib

Looking at the implementation, we see that it takes an input parameter n and two input signals in[0] and in[1], and it defines a single output signal out. Additionally, the template uses the Num2Bits template from Circomlib to constrain the output signal out.

The Num2Bits template from Circomlib takes a single parameter n and can be used to convert a field element to its n-bit binary representation, which is given by the array out of size n. Since the size of the binary representation is bounded by the parameter n, the input to Num2Bits is also implicitly constrained to n bits. In the implementation of LessThan above, the expression (1 << n) + in[0] - in[1] is passed as input to Num2Bits, which constrains the absolute value |in[0] - in[1]| to n bits.

To understand the subtleties of the implementation of the LessThan template, let’s first consider the expected use case when both inputs to LessThan are at most n bits, where n is small enough to ensure that both inputs are nonnegative.

We have two cases to consider. If in[0] < in[1], then in[0] - in[1] is a negative n-bit value, and (1 << n) + in[0] - in[1] is a positive n-bit value. It follows that bit n in the binary representation of the input to Num2Bits is 0, and thus out must be equal to 1 - 0 = 1.

On the other hand, if in[0] ≥ in[1], then in[0] - in[1] is a nonnegative n-bit value (since both inputs are positive), and (1 << n) + in[0] - in[1] is a positive (n + 1)-bit value with the most significant bit equal to 1, It follows that bit n in the binary representation of the input to Num2Bits is 1, and out must be given by 1 - 1 = 0.

This all makes sense and gives us some confidence if we want to use LessThan for range proofs in our own circuits. However, things become more complicated if we forget to constrain the size of the inputs passed to LessThan.

Using signals to represent unsigned quantities

To describe the first type of issue that may affect circuits defined using LessThan, consider the case in which signals are used to represent unsigned values like monetary amounts. Say that we want to allow users to withdraw funds from our system without revealing sensitive information, like the total balance belonging to a single user or the amounts withdrawn by users. We could use LessThan to implement the part of the circuit that validates the withdrawn amount against the total balance as follows:

The ValidateWithdrawal template should ensure that users cannot withdraw more than their total balance.

Now, suppose that a malicious user with a zero balance decides to withdraw p - 1 tokens from the system, where p is the size of the underlying prime field. Clearly, this should not be allowed since p - 1 is a ridiculously large number and, in any case, the user has no tokens available for withdrawal. However, looking at the implementation of LessThan, we see that in this case, the input to Num2Bits will be given by (1 << 64) + (p - 1) - (0 + 1) = (1 << 64) - 2 (as all arithmetic is done modulo p). It follows that bit 64 of the binary representation of the input will be 0, and the output from LessThan will be 1 - n2b.out[64] = 1 - 0 = 1. This also means that ValidateWithdrawal will identify the withdrawal as valid.

The problem here is that p - 1 also represents the signed quantity –1 in GF(p). Clearly, -1 is less than 1, and we have not constrained the withdrawn amount to be nonnegative. Adding a constraint restricting the size of the amount to be less than log(p) - 1 bits would ensure that the amount must be positive, which would prevent this issue.

More generally, since the input parameter n to LessThan restricts only the size of the difference |in[0] - in[1]|, we typically cannot use LessThan to prevent overflows and underflows. This is a subtle point that many developers miss. As an example, consider the section on arithmetic overflows and underflows from the zero-knowledge (ZK) bug tracker maintained by 0xPARC. In an earlier version, 0xPARC suggested using LessThan to constrain the relevant signals in an example that was almost identical to the vulnerable ValidateWithdrawal template defined above!

Another vulnerability of this type was found by Daira Hopwood in an early version of the ZK space-conquest game Dark Forest. Here, the vulnerability allowed users to colonize planets far outside the playing field. The developers addressed the issue by adding a range proof based on the Num2Bits template that restricted the size of the coordinates to 31 bits.

Using signals to represent signed quantities

Now, suppose signals are used to represent signed quantities. In particular, let’s consider what would happen if we passed q = floor(p/2) and q + 1 as inputs to LessThan. We will show that even though q + 1 < q according to the Circom compiler, q is actually less than q + 1 according to LessThan. Returning to the input to Num2Bits in the definition of LessThan, we see that if in[0] = q and in[1] = q + 1, the input to Num2Bits is given by the following expression:

(1 << n) + in[0] - in[1] = (1 << n) + q - (q + 1) = (1 << n) - 1

It follows that the nth bit in the binary representation of this value is 0, and the output from LessThan is 1 - n2b.out[n] = 1 - 0 = 1. Thus, q < q + 1 according to LessThan, even though q + 1 < q according to the compiler!

It is worth reiterating here that the input parameter n to LessThan does not restrict the size of the inputs, only the absolute value of their difference. Thus, we are free to pass very large (or very small) inputs to LessThan. Again, this issue can be prevented if the size of both of the inputs to the LessThan template are restricted to be less than log(p) - 1 bits.

Circomspect to the rescue (part 1)

To find issues of this type, Circomspect identifies locations where LessThan is used. It then tries to see whether the inputs to LessThan are constrained to less than log(p) - 1 bits using the Num2Bits template from Circomlib, and it emits a warning if it finds no such constraints. This allows the developer (or reviewer) to quickly identify locations in the codebase that require further review.

The output from the latest version of Circomspect running on the ValidateWithdrawal template defined above.

As shown in the screenshot above, each warning from Circomspect will now typically also contain a link to a description of the potential issue, and recommendations for how to resolve it.

Circomspect to the rescue (part 2)

We would also like to mention another of our new analysis passes. The latest version of Circomspect identifies locations where a template is instantiated but the output signals defined by the template are not constrained.

As an example, consider the ValidateWithdrawal template introduced above. Suppose that we rewrite the template to include range proofs for the input signals amount and total. However, during the rewrite we accidentally forget to include a constraint ensuring that the output from LessThan is 1. This means that users may be able to withdraw amounts that are greater than their total balance, which is obviously a serious vulnerability!

We rewrite ValidateWithdrawal to include range proofs for the two inputs amount and total, but accidentally forget to constrain the output signal lt.out to be 1!

There are examples (like Num2Bits) in which a template constrains its inputs and no further constraints on the outputs are required. However, forgetting to constrain the output from a template generally indicates a mistake and requires further review to determine whether it constitutes a vulnerability. Circomspect will flag locations where output signals are not constrained to ensure that each location can be manually checked for correctness.

Circomspect now generates a warning if it finds that the output signals defined by a template are not properly constrained.

Let’s talk!

We at Trail of Bits are excited about contributing to the growing range of tools and libraries for ZK that have emerged in the last few years. If you are building a project using ZK, we would love to talk to you to see if we can help in any way. If you are interested in having your code reviewed by us, or if you’re looking to outsource parts of the development to a trusted third party, please get in touch with our team of experienced cryptographers.

We need a new way to measure AI security

Tl;dr: Trail of Bits has launched a practice focused on machine learning and artificial intelligence, bringing together safety and security methodologies to create a new risk assessment and assurance program. This program evaluates potential bespoke risks and determines the necessary safety and security measures for AI-based systems.

If you’ve read any news over the past six months, you’re aware of the unbridled enthusiasm for artificial intelligence. The public has flocked to tools built on systems like GPT-3 and Stable Diffusion, captivated by how they alter our capacity to create and interact with each other. While these systems have amassed headlines, they constitute a small fraction of AI-based systems that are currently in use, powering technology that is influencing outcomes in all aspects of life, such as finance, healthcare, transportation and more. People are also attempting to shoehorn models like GPT-3 into their own applications, even though these models may introduce unintended risks or not be adequate for their desired results. Those risks will compound as the industry moves to multimodal models.

With people in many fields trying to hop on the AI bandwagon, we are dealing with security and safety issues that have plagued the waves of innovation that have swept through society in the last 50 years. This includes issues such as proper risk identification and quantification, responsible and coordinated vulnerability disclosures, and safe deployment strategies. In the rush to embrace AI, the public is at a loss as to the full scope of its impact, and whether these systems are truly safe. Furthermore, the work seeking to map, measure, and mitigate against newfound risks has fallen short, due to the limitations and nuances that come with applying traditional measures to AI-based systems.

The new ML/AI assurance practice at Trail of Bits aims to address these issues. With our forthcoming work, we not only want to ensure that AI systems have been accurately evaluated for potential risk and safety concerns, but we also want to establish a framework that auditors, developers and other stakeholders can use to better assess potential risks and required safety mitigations for AI-based systems. Further work will build evaluation benchmarks, particularly focused on cybersecurity, for future machine-learning models. We will approach the AI ecosystem with the same rigor that we are known to apply to other technological areas, and hope the services transform the way practitioners in this field work on a daily basis.

In a paper released by Heidy Khlaaf, our engineering director of ML/AI assurance, we propose a novel, end-to-end AI risk framework that incorporates the concept of an Operational Design Domain (ODD), which can better outline the hazards and harms a system can potentially have. ODDs are a concept that has been used in the autonomous vehicle space, but we want to take it further: By having a framework that can be applied to all AI-based systems, we can better assess potential risks and required safety mitigations, no matter the application.

We also discuss in the paper:

  • When “safety” doesn’t mean safety: The AI community has conflated “requirements engineering” with “safety measures,” which is not the same thing. In fact, it’s often contradictory!
  • The need for new measures: Risk assessment practices taken from other fields, i.e. hardware safety, don’t translate well to AI. There needs to be more done to uncover design issues that directly lead to systematic failures.
  • When “safety” doesn’t mean “security”: The two terms are not interchangeable, and need to be assessed differently when applied to AI and ML systems.
  • It hasn’t been all bad: The absence of well-defined operational boundaries for general AI and ML models has made it difficult to accurately assess the associated risks and safety, given the vast number of applications and potential hazards. We discuss what models can be adapted, specifically those that can ensure security and reliability.

The AI community, and the general public, will suffer the same or worse consequences we’ve seen in the past if we cannot safeguard the systems the world is rushing to adopt. In order to do so, it’s essential to get on the same page when it comes to terminology and techniques for safety objectives and risk assessments. However, we don’t need to reinvent the wheel. Applicable techniques already exist; they just need to be adapted to the AI and machine-learning space. With both this paper and our practice’s forthcoming work, we hope to bring clarity and cohesion to AI assurance and safety, in the hope that it can counter the marketing hype and exaggerated commercial messaging in the current marketplace that deemphasizes the security of this burgeoning technology.

This approach builds on our previous machine-learning work, and is just the beginning of our efforts in this domain. Any organizations interested in working with this team can contact Trail of Bits to inquire about future projects.