Trail of Bits Named in Forrester Wave as a Leader in Midsize Cybersecurity Consulting Services

Trail of Bits was among the select companies that Forrester invited to participate in its recent report, The Forrester Wave™: Midsize Cybersecurity Consulting Services, Q2 2019. In this evaluation, Trail of Bits was cited as a Leader. We received the highest score among all participants in the current offering category, among the highest scores in the strategy category, and the highest scores possible in sixteen of the evaluation’s criteria.

You can download the full report here!

What is the Forrester Wave™?

Forrester is a leading global research and advisory firm that offers a variety of market reports, research, analysis, and consulting services. The Forrester Wave is a specific Forrester evaluation that provides “a guide for buyers considering their purchasing options in a technology marketplace.”

When Forrester reached out to us to participate in their study, we jumped at the chance. We respect their publication a lot. In our view, the Wave is:

  • A trusted source of truth for top companies – Forrester reports are the gold standard for citations on market data.
  • It’s not a paid publication – If they find a weakness in a participant’s company, they won’t gloss over it.
  • The criteria are thoughtful – It’s hard to fully comprehend what’s important in cybersecurity consulting, especially for someone relatively new to our niche industry. Assessing efficacy is tough even if you know what you’re looking for. Forrester overcomes this by getting feedback on its questions and ranking criteria from participants.

What happened?

Forrester reached out to us and our competitors for an introductory call on how to prepare as participants. We were given the option to opt-out of participating but told we would be included in the report, regardless. Participants then provided feedback on criteria that Forrester Wave would use to assess our competencies. Once the criteria were finalized, Forrester gathered data from us and from some of our clients on how we performed against those criteria. When their report was complete, they let all participants fact-check the report. Needless to say, we were pleased to see that we’d received the highest score in the Current Offering category, and among the highest scores in the Strategy category.

The results

In addition to our ranking, Forrester made this analysis of our strengths and weaknesses:

Trail of Bits’ innovation efforts set its technical services apart. Its unique services include binary analysis, blockchain security, cryptography, and software development — many of which rely on tools that Trail of Bits developed in-house. Trail of Bits is also actively involved in building the cybersecurity community, especially emerging talent. It hosts a variety of office hours and local meetups and develops free educational resources for all levels of cybersecurity professionals. Reference customers highlighted Trail of Bits’ thought leadership, deep expertise in fields like blockchain security and cryptography, and thoroughness as strengths. However, those high standards come with some drawbacks as customers also noted limited resources and price as concerns. Trail of Bits’ deliverables are quite technical, and clients needed to do some extra translation of those deliverables for nonsecurity[sic] executives. For clients seeking a high level of technical cybersecurity proficiency from its services firm, Trail of Bits is a top-tier choice.

Our reflections

We’re celebrating the good

Trail of Bits’ innovation efforts set its technical services apart.

Our Take: Indeed! Our investments in R&D, our focus on using cutting-edge open source tooling, and our preference for tackling tough problems helps us hone our advanced techniques and innovative approach.

Trail of Bits is also actively involved in building the cybersecurity community, especially emerging talent.

Our Take: We’re happy for this work to shine through! We’re passionate about empowering emerging talent with the information and skills necessary to break into our industry and push science forward with us. Sharing our proprietary tools open-source, sharing knowledge through online resources and our Empire Hacking meetup, and sponsoring emerging research are all core to our mission.

Reference customers highlighted Trail of Bits’ thought leadership, deep expertise in fields like blockchain security and cryptography, and thoroughness as strengths.

Our Take: We are intentional about focusing deeply on our niche skillset because it prepares us for solving our clients’ most challenging problems. We produce and use publicly available tools in our assessments, resulting in repeatable, verifiable results that other firms can’t offer.

We’re finding more ways to improve

Even as a Leader, this report shows us some opportunities for improvement. Our efforts to grow at a pace that meets increasing market demands for our services is a challenge. We prefer to hire well rather than hire quickly. We know that the price point of our niche services puts our paid expertise out of reach for some smaller and under-resourced companies. We will address that by continuing to offer our knowledge on our blog, our open-source tools on github, and our community resources like Empire Hacking and the NYC-Infosec directory. Finally, we are committed to translating summaries of our highly technical work for clients’ non-security executives. You can check out how we’re doing that in our public reports and presentations.

Overall, we’re honored to be included in this year’s report, encouraged by its findings, and excited to share the results.

Could you use a Forrester Wave Leader’s advice on a cybersecurity problem you’re facing? Contact us

On LibraBFT’s use of broadcasts

LibraBFT is the Byzantine Fault Tolerant (BFT) consensus algorithm used by the recently released Libra cryptocurrency. LibraBFT is based on another BFT consensus algorithm called HotStuff. While some have noted the similarities between the two algorithms, they differ in some crucial respects. In this post we highlight one such difference: in LibraBFT, non-leaders perform broadcasts. This has implications on liveness analysis, communication complexity, and leader election. We delve into each of these topics following a brief review of what BFT consensus algorithms are and why they are useful.

A brief history of BFT consensus algorithms

The term Byzantine Fault Tolerant (BFT) originates from a paper by Lamport, Shostak, and Pease. In that paper, the authors consider a fictional scenario in which a group of Byzantine generals are considering whether to attack a city. The problem is that some of the generals are traitors. The question is, if the traitors spread misinformation, then under what conditions can the group agree to attack or retreat from the city? Note that neither attacking nor retreating is considered more favorable than the other: the problem is simply to agree on one action!

This scenario is a metaphor for one that commonly arises in distributed systems: the generals are nodes in a network, the traitors are faulty nodes, and attacking the city is, say, committing a transaction to a database.

The Lamport et al. paper proposes a solution to this problem. However, the solution implicitly assumes a synchronous network, which means that messages between nodes are delivered within a fixed, known time bound. Compare this to an asynchronous network, where messages can be arbitrarily delayed and even reordered.

Dwork, Lynch, and Stockmeyer were the first to propose deterministic algorithms that solve the Byzantine generals problem for “partially synchronous” networks. (Earlier, Rabin and Ben-Or had proposed randomized algorithms.) Dwork et al.’s analysis of what it means for a network to be “partially synchronous” was a significant contribution to the study of BFT algorithms. However, the purpose of Dwork et al.’s algorithms appears to have been to establish the existence of such solutions. Thus, their algorithms are of more theoretical than practical interest.

Toward the end of the century, Castro and Liskov proposed a solution on which many contemporary algorithms have been based (e.g., Tendermint, described below). Castro and Liskov’s algorithm, called Practical Byzantine Fault Tolerance (PBFT) works as follows. Each round has a designated leader chosen from among the nodes, and each round is composed of three phases. Roughly, in the first phase, the leader proposes a command for all nodes to execute; in the second phase, the nodes vote on the command; and in the third phase, the nodes acknowledge receipt of each others’ votes and execute the command. When a node believes that a round has gone on for too long, it sends a timeout message to all other nodes. Nodes must agree that a round should timeout. The process is somewhat complicated. PBFT, as a whole, works up to when ⌊(n-1)/3⌋ of the n nodes are faulty, which is the best that one can do.

In order to circumvent a classic impossibility result of Fischer, Lynch, and Paterson, PBFT prioritizes safety over liveness. This means that, say, a leader could propose a command and that command could fail to be executed, e.g., because of network problems. However, the algorithm is safe in the sense that if two non-faulty nodes execute some sequence of commands, then one is necessarily a prefix of the other.

It is interesting to note how PBFT uses broadcasts. During the first phase of each round, the leader broadcasts to all nodes. But, during the second and third phases, all nodes (not just the leader) broadcast to one another.

Figure 1 from the PBFT paper, which calls the three phases of each round “pre-prepare,” “prepare,” and “commit.” The numbers represent network nodes and the arrows represent messages. Note how non-leader nodes 1 and 2 broadcast to all other nodes during the second and third phases.

Many variants of PBFT have been proposed. One notable example is Tendermint. Like PBFT, Tendermint proceeds in rounds, each round is divided into three phases, and the use of broadcasts in each phase is similar. However, whereas timeouts in PBFT occur with respect to entire rounds, timeouts in Tendermint occur with respect to individual phases, which many regard as a simpler strategy. (See here for additional discussion.) Also, Tendermint has a well-tested implementation that is actively developed. For this reason, if none other, Tendermint has seen widespread use.

HotStuff may also be regarded as a variant of PBFT. The HotStuff algorithm is “pipelined” so that timing out in a round is essentially no different than timing out in a phase. But perhaps HotStuff’s most notable improvement over PBFT is its reduced “authenticator complexity.” As defined in the HotStuff paper, “an authenticator is either a partial signature or a signature” (page 4). The paper argues that “Authenticator complexity is a useful measure of communication complexity… it hides unnecessary details about the transmission topology… n messages carrying one authenticator count the same as one message carrying n authenticators.”

HotStuff achieves reduced authenticator complexity, in part, by using threshold signatures. This allows non-leaders to send their messages to only the leader and not to one another. For example, when voting on a command, nodes in HotStuff send their “share” of a signature to the leader. Once the leader has accumulated 2⌊(n-1)/3⌋ + 1 such shares, it broadcasts the combined signature to all other nodes. The other phases of the HotStuff algorithm are similar. In this way, HotStuff achieves linear authenticator complexity.

LibraBFT

LibraBFT is the consensus algorithm used by the Libra cryptocurrency. LibraBFT is based on HotStuff. According to the Libra authors, there were “three reasons for selecting the HotStuff protocol as the basis for LibraBFT: (1) simplicity and modularity of the safety argument; (2) ability to easily integrate consensus with execution; and (3) promising performance in early experiments.”

LibraBFT and HotStuff are very similar. For example, LibraBFT retains HotStuff’s pipelined design. Furthermore, when voting on a command (or “block” to use LibraBFT’s terminology), nodes send their votes to only the leader and not to all other nodes.

Figure 3 from the LibraBFT paper. The letters B, V, and C are mnemonic for “block,” “vote,” and “[quorum] certificate,” respectively. Note how nodes send their votes to only the leader (like in HotStuff) and do not broadcast them to all other nodes (like in PBFT). On the other hand, note the caption, which indicates that “round synchronization” is not reflected in the figure. Round synchronization does involve broadcasts.

However, to achieve certain goals (explained below), LibraBFT uses broadcasts in ways that HotStuff does not. (In this way, LibraBFT resembles its distant ancestor, PBFT.) Specifically, LibraBFT requires non-leader nodes to perform broadcasts under the following circumstances:

  1. Nodes regularly synchronize their states using broadcasts.
  2. When a timeout is reached, e.g., because of network problems or a faulty leader, nodes send a timeout message to all other nodes.

The use of broadcasts for state synchronization makes it easier to establish a liveness result for LibraBFT. The reason for broadcasting timeouts is not given in the LibraBFT paper. However, as we explain below, we suspect it is to allow for the future use of an adaptive leader election mechanism. Despite these advantages, the additional broadcasts have the unfortunate side-effect of increasing the algorithm’s communication complexity. However, such an increase in communication complexity may be unavoidable. (Neither the state synchronization broadcasts nor the timeout broadcasts should affect the algorithm’s safety.)

Note: A recent blog post entitled “Libra: The Path Forward” makes clear that LibraBFT is a work-in-progress. Perhaps for this reason, there are discrepancies between LibraBFT as defined in the LibraBFT paper and as implemented in the Libra source code. For the remainder of this blog post, we focus on the former. We point out a couple of the differences between the LibraBFT specification and the Libra source code below. (Also note that, while the LibraBFT paper does contain code, that code is not part of Libra itself, but of a simulator. The LibraBFT authors “intend to share the code for this simulator and provide experimental results in a subsequent version of the report” (page 4). That version of the report has not yet been released.)

Liveness analysis

In LibraBFT, “nodes broadcast their states at least once per period of time [the minimum broadcast interval] 𝐼 > 0” (page 22). No similar notion exists in HotStuff. Unsurprisingly, this makes it easier to establish a liveness result for LibraBFT. However, liveness analysis of the two algorithms differ in other ways, as we now explain.

HotStuff’s liveness theorem asserts that, under certain technical conditions including that the round leader is non-faulty, there exists a time bound within which all non-faulty nodes execute a command and move on to the next round. The liveness theorem is parameterized by two functions: Leader and NextView. Leader is a function of the round number and determines the leader of that round. The arguments of NextView are not given specifically, but include at least the round number. NextView determines when round timeout “interrupts” are generated.

HotStuff’s liveness theorem

LibraBFT’s liveness theorem has a similar form. However, there are two significant differences. First, LibraBFT’s analogs of the Leader and NextView functions are not left as parameters, but are given explicitly. Second, the time bound within which a command is executed is not merely to asserted to exist, but is also given explicitly.

LibraBFT_liveness

LibraBFT’s liveness theorem

Of note is the fact that LibraBFT’s explicit time bound features 𝐼, the minimum broadcast interval, mentioned above. The appearance of 𝐼 demonstrates that LibraBFT’s liveness analysis differs fundamentally from that of HotStuff, and is not merely a byproduct of explicitly given Leader and NextView mechanisms.

We would also point out that this is a place where LibraBFT specification and the Libra source code do not exactly match. In LibraBFT, nodes broadcast their states to one another using DataSyncNotification messages (defined in Appendix A.3 of the LibraBFT paper). Nothing analogous to a DataSyncNotification message seems to exist within the Libra source code. For example, if one looks at the start_event_processing function from Libra’s chained_bft module, one can see calls to functions that process proposals, votes, etc. However, there is no call that would seem to process something resembling a DataSyncNotification message.

fn start_event_processing(
    &self,
    event_processor: ConcurrentEventProcessor,
    executor: TaskExecutor,
    ...
) {
    executor.spawn(Self::process_new_round_events(...)...);
    executor.spawn(Self::process_proposals(...)...);
    executor.spawn(Self::process_winning_proposals(...)...);
    executor.spawn(Self::process_block_retrievals(...)...);
    executor.spawn(Self::process_chunk_retrievals(...)...);
    executor.spawn(Self::process_votes(...)...);
    executor.spawn(Self::process_new_round_msg(...)...);
    executor.spawn(Self::process_outgoing_pacemaker_timeouts(...)...);
}

An excerpt of the “start_event_processing” function from Libra’s “chained_bft” module. Note that none of the calls seem to process something resembling a LibraBFT DataSyncNotification message.

The bottom line is that the LibraBFT liveness analysis does not directly apply to the LibraBFT source code in its present form. However, as mentioned above, LibraBFT is a work-in-progress. So, the existence of some discrepancies between the LibraBFT specification and the Libra source code is not surprising.

Communication complexity

As mentioned above in the description of HotStuff, HotStuff achieves linear authenticator complexity, where “an authenticator is either a partial signature or a signature” sent in a network message in a single round. What is LibraBFT’s authenticator complexity? It is unclear and somewhat ambiguous.

The LibraBFT paper does not mention “authenticator complexity” specifically, though it does claim that “LibraBFT has two desirable properties that BFT consensus protocols preceding HotStuff were not able to simultaneously support — linearity and responsiveness. … Informally, linearity guarantees that driving transaction commits incurs only linear communication…” (page 2). The claim is not addressed later in the paper. As we argue, LibraBFT’s authenticator complexity is at least O(n2), where n is the number of nodes. So, either the authors were not referring to authenticator complexity, or the claim is some sort of an oversight.

When a LibraBFT node believes that a round has gone on for too long, it broadcasts a timeout message to all other nodes. These timeout messages are signed. Therefore, LibraBFT’s authenticator complexity is at least O(n2).

How does LibraBFT’s use of broadcasts in state synchronization affect authenticator complexity?

State synchronization is parameterized by the minimum broadcast interval, 𝐼, mentioned above in the discussion of liveness analysis. The LibraBFT paper states that “nodes broadcast their states at least once per period of time 𝐼 > 0” (page 22). Taking this into account, the authenticator complexity of LibraBFT should be not just a function of n, but also of 𝐼.

However, the paper’s Appendix A.3 states “we have assumed that [data-synchronization messages] are transmitted over authenticated channels and omitted message signatures” (page 36). One could use this fact to argue that data synchronization does not affect authenticator complexity (though, that would feel a lot like cheating).

So, LibraBFT’s authenticator complexity is at least O(n2) due to its use of timeout broadcasts. It could be more, depending upon one’s assumptions about the network.

To be clear, this is a worst case analysis. In the common case, timeouts will not occur. Thus, if 𝐼 is large, say, many multiples of the typical round duration, then LibraBFT’s performance will be close to linear.

One final point is worth mentioning. HotStuff uses a predictable leader election strategy. As explained in the next section, the Libra authors are trying to avoid using such a strategy. To the best of our knowledge, no sub-quadratic adaptive leader election strategy currently exists. Thus, in comparing the communication complexity of HotStuff and LibraBFT, some increase may be unavoidable.

Leader election

At the time of this writing, the Libra source code’s leader election strategy is round-robin. However, the LibraBFT authors note that this strategy makes round leaders predictable, which facilitates denial-of-service attacks. In considering alternatives, the authors note that depending on hashes in a naive way could facilitate “grinding attacks,” e.g., where an attacker influences the hash in such a way as to increase the likelihood of a particular node being selected the leader.

To address the former problem and circumvent the latter, the authors state, “we intend to use a verifiable random function (VRF) in the future.” (In fact, a VRF is already implemented in the Libra source code, though it does not yet seem to be used.) In the next two paragraphs, we explain what VRFs are. Then, following a brief explanation of what it means for a LibraBFT block to be “committed,” we explain how LibraBFT’s proposed use of VRFs in leader elections is enabled by its use of broadcasts.

Intuitively, a VRF is a function that, given some input, produces two outputs: a random-looking value, and a proof that the value was derived from the input. The function is expected to have two properties. First, it should not be obvious how the random-looking value was derived from the input. Second, it should be possible to convince an observer that the random-looking value was derived from the input, even though that fact is not obvious. This latter property is made possible via the proof.

A simple example of a VRF is given in Section 4 of Goldberg, Reyzin, Papadopoulos, and Vcelak’s draft RFC. In that example, the random-looking value is an RSA signature computed over the hash of the input, and the proof is the hash. An observer can verify the proof by checking that the input has the hash, and that the signature corresponds to the hash. (Note: for production code, we recommend using the elliptic-curve-based VRF of Section 5 of the RFC, and not the RSA-based VRF of Section 4.)

We now briefly explain what it means for a LibraBFT block to be “committed.” In LibraBFT, blocks and quorum certificates alternate to form a chain:

B0 ← C0 ← B1 ← C1 ← B2 ← C2 ← ⋯

These blocks (certificates) need not be proposed (assembled) in contiguous rounds; there could be arbitrarily large gaps between them. A block is said to be “committed” when: (1) it appears below two other blocks, (2) all three blocks have associated quorum certificates, and (3) the blocks were proposed in contiguous rounds.

The reason for this rule is a bit technical. But, intuitively, a “committed” block is considered permanent by all non-faulty nodes. In contrast, a block that appears below fewer than two certified blocks is considered impermanent, and may have to be abandoned in favor of another block.

The LibraBFT authors intend to use VRFs in leader elections as follows. Each node will have its own instance of some VRF (e.g., a VRF instantiated with a private key belonging to the node). The leader for round i will be determined using the VRF instance of the round leader for the most recently committed block. That VRF instance will determine a seed for a pseudo-random function (PRF). The resulting PRF instance will then determine the leader for round i and each subsequent round until a new block becomes committed. The reason for using two functions like this is (we believe) so that round leaders can be determined even if the proposer of the most recently committed block goes offline.

We now explain how LibraBFT’s proposed use of VRFs in leader elections is enabled by its use of broadcasts. Or, more specifically, we explain why LibraBFT’s proposed use of VRFs would not work if a node could send a timeout message to just one other node, as is the case in HotStuff. The purpose of a timeout message is to tell the next round leader to start the next round. But, as we show in the next several paragraphs, who that next leader is can depend upon the cause of the timeout.

Suppose that in contiguous rounds i-2, i-1, and i, blocks Bm-2, Bm-1, and Bm are proposed. Further suppose that the leader for round i assembles a quorum certificate Cm and broadcasts it to all other nodes. Thus, at the start of round i+1, the most recently committed block is Bm-2:

B0 ← C0 ← ⋯ ← Bm-2 ← Cm-2 ← Bm-1 ← Cm-1 ← Bm ← Cm

But suppose that the leader for round i does not receive a block proposal for round i+1 within a reasonable timeframe. If the leader for round i could send a timeout message to just one other node, to whom should she send it? (Note that the leader for round i plays no special role in detecting or announcing the timeout of round i+1.)

Consider the following two explanations for why the leader for round i does not receive a block proposal for round i+1.

  • Case 1: The leader for round i+1 is faulty and never proposed a block. In this case, the most recently committed block at round i+2 is still Bm-2. Thus, the leader for round i+2 should be determined by the same PRF instance that determined the leader for round i+1. If the leader for round i could send a timeout message to just one other node, she should send it to the leader determined by this existing PRF instance.
  • Case 2: There is a network delay. The leader for round i+1 proposed a block and assembled a quorum certificate, but the leader for round i did not receive them in time. In this case, the most recently committed block at round i+2 is Bm-1. Thus, the leader for round i+2 should be determined by a new PRF instance, one seeded by the VRF instance belonging to the proposer of Bm-1. If the leader for round i could send a timeout message to just one other node, she should send it to the leader determined by this new PRF instance.

This ambiguity is resolved by having the leader for round i send its timeout message to all other nodes. We would not be surprised if the LibraBFT authors introduced timeout broadcasts to address this type of scenario specifically.

Finally, note that a similar problem does not exist in HotStuff. In HotStuff, the round leader is determined by “some deterministic mapping from view number to a replica, eventually rotating through all replicas,” and does not depend on the most recently committed block. On the other hand, the predictability of round leaders makes HotStuff susceptible to denial-of-service attacks, which the LibraBFT authors are trying to avoid.

LibraBFT and HotStuff are distinct algorithms

LibraBFT and HotStuff are very similar, but the two algorithms differ in some crucial respects. In LibraBFT, non-leaders perform broadcasts. The use of broadcasts in state synchronization makes it easier to establish a liveness result for LibraBFT. We suspect that timeout broadcasts were introduced to allow for the future use of an adaptive leader election mechanism, e.g., one based on VRFs. Despite these advantages, the additional broadcasts increase the algorithm’s communication complexity. However, this increase in communication complexity may be unavoidable.

We intend to keep a close eye on Libra. If you are developing a Libra application, please consider us for your security review.

Updated July 16, 2019.

Seriously, stop using RSA

Here at Trail of Bits we review a lot of code. From major open source projects to exciting new proprietary software, we’ve seen it all. But one common denominator in all of these systems is that for some inexplicable reason people still seem to think RSA is a good cryptosystem to use. Let me save you a bit of time and money and just say outright—if you come to us with a codebase that uses RSA, you will be paying for the hour of time required for us to explain why you should stop using it.

RSA is an intrinsically fragile cryptosystem containing countless foot-guns which the average software engineer cannot be expected to avoid. Weak parameters can be difficult, if not impossible, to check, and its poor performance compels developers to take risky shortcuts. Even worse, padding oracle attacks remain rampant 20 years after they were discovered. While it may be theoretically possible to implement RSA correctly, decades of devastating attacks have proven that such a feat may be unachievable in practice.

What is RSA again?

RSA is a public-key cryptosystem that has two primary use cases. The first is public key encryption, which lets a user, Alice, publish a public key that allows anyone to send her an encrypted message. The second use case is digital signatures, which allow Alice to “sign” a message so that anyone can verify the message hasn’t been tampered with. The convenient thing about RSA is that the signing algorithm is basically just the encryption algorithm run in reverse. Therefore for the rest of this post we’ll often refer to both as just RSA.

To set up RSA, Alice needs to choose two primes p and q that will generate the group of integers modulo N = pq. She then needs to choose a public exponent e and private exponent d such that ed = 1 mod (p-1)(q-1). Basically, e and d need to be inverses of each other.

Once these parameters have been chosen, another user, Bob, can send Alice a message M by computing C = Me (mod N). Alice can then decrypt the ciphertext by computing M = Cd (mod N). Conversely, if Alice wants to sign a message M, she computes S = Md (mod N), which any user can verify was signed by her by checking M = Se (mod N).

That’s the basic idea. We’ll get to padding—essential for both use cases—in a bit, but first let’s see what can go wrong during parameter selection.

Setting yourself up for failure

RSA requires developers to choose quite a few parameters during setup. Unfortunately, seemingly innocent parameter-selection methods degrade security in subtle ways. Let’s walk through each parameter choice and see what nasty surprises await those who choose poorly.

Prime Selection

RSA’s security is based off the fact that, given a (large) number N that’s the product of two primes p and q, factoring N is hard for people who don’t know p and q. Developers are responsible for choosing the primes that make up the RSA modulus. This process is extremely slow compared to key generation for other cryptographic protocols, where simply choosing some random bytes is sufficient. Therefore, instead of generating a truly random prime number, developers often attempt to generate one of a specific form. This almost always ends badly.

There are many ways to choose primes in such a way that factoring N is easy. For example, p and q must be globally unique. If p or q ever gets reused in another RSA moduli, then both can be easily factored using the GCD algorithm. Bad random number generators make this scenario somewhat common, and research has shown that roughly 1% of TLS traffic in 2012 was susceptible to such an attack. Moreover, p and q must be chosen independently. If p and q share approximately half of their upper bits, then N can be factored using Fermat’s method. In fact, even the choice of primality testing algorithm can have security implications.

Perhaps the most widely-publicized prime selection attack is the ROCA vulnerability in RSALib which affected many smartcards, trusted platform modules, and even Yubikeys. Here, key generation only used primes of a specific form to speed up computation time. Primes generated this way are trivial to detect using clever number theory tricks. Once a weak system has been recognized, the special algebraic properties of the primes allow an attacker to use Coppersmith’s method to factor N. More concretely, that means if the person sitting next to me at work uses a smartcard granting them access to private documents, and they leave it on their desk during lunch, I can clone the smartcard and give myself access to all their sensitive files.

It’s important to recognize that in none of these cases is it intuitively obvious that generating primes in such a way leads to complete system failure. Really subtle number-theoretic properties of primes have a substantial effect on the security of RSA. To expect the average developer to navigate this mathematical minefield severely undermines RSA’s safety.

Private Exponent

Since using a large private key negatively affects decryption and signing time, developers have an incentive to choose a small private exponent d, especially in low-power settings like smartcards. However, it is possible for an attacker to recover the private key when d is less than the 4th root of N. Instead, developers are encouraged to choose a large d such that Chinese remainder theorem techniques can be used to speed up decryption. However, this approach’s complexity increases the probability of subtle implementation errors, which can lead to key recovery. In fact, one of our interns last summer modelled this class of vulnerabilities with our symbolic execution tool Manticore.

People might call me out here and point out that normally when setting up RSA you first generate a modulus, use a fixed public exponent, and then solve for the private exponent. This prevents low private exponent attacks because if you always use one of the recommended public exponents (discussed in the next section) then you’ll never wind up with a small private exponent. Unfortunately this assumes developers actually do that. In circumstances where people implement their own RSA, all bets are off in terms of using standard RSA setup procedures, and developers will frequently do strange things like choose the private exponent first and then solve for the public exponent.

Public Exponent

Just as in the private exponent case, implementers want to use small public exponents to save on encryption and verification time. It is common to use Fermat primes in this context, in particular e = 3, 17, and 65537. Despite cryptographers recommending the use of 65537, developers often choose e = 3 which introduces many vulnerabilities into the RSA cryptosystem.

Developers have even used e = 1, which doesn’t actually encrypt the plaintext

When e = 3, or a similarly small number, many things can go wrong. Low public exponents often combine with other common mistakes to either allow an attacker to decrypt specific ciphertexts or factor N. For instance, the Franklin-Reiter attack allows a malicious party to decrypt two messages that are related by a known, fixed distance. In other words, suppose Alice only sends “chocolate” or “vanilla” to Bob. These messages will be related by a known value and allow an attacker Eve to determine which are “chocolate” and which are “vanilla.” Some low public exponent attacks even lead to key recovery. If the public exponent is small (not just 3), an attacker who knows several bits of the secret key can recover the remaining bits and break the cryptosystem. While many of these e = 3 attacks on RSA encryption are mitigated by padding, developers who implement their own RSA fail to use padding at an alarmingly high rate.

RSA signatures are equally brittle in the presence of low public exponents. In 2006, Bleichenbacher found an attack which allows attackers to forge arbitrary signatures in many RSA implementations, including the ones used by Firefox and Chrome. This means that any TLS certificate from a vulnerable implementation could be forged. This attack takes advantage of the fact that many libraries use a small public exponent and omit a simple padding verification check when processing RSA signatures. Bleichenbacher’s signature forgery attack is so simple that it is a commonly used exercise in cryptography courses.

Parameter Selection is Hard

The common denominator in all of these parameter attacks is that the domain of possible parameter choices is much larger than that of secure parameter choices. Developers are expected to navigate this fraught selection process on their own, since all but the public exponent must be generated privately. There are no easy ways to check that the parameters are secure; instead developers need a depth of mathematical knowledge that shouldn’t be expected of non-cryptographers. While using RSA with padding may save you in the presence of bad parameters, many people still choose to use broken padding or no padding at all.

Padding oracle attacks everywhere

As we mentioned above, just using RSA out of the box doesn’t quite work. For example, the RSA scheme laid out in the introduction would produce identical ciphertexts if the same plaintext were ever encrypted more than once. This is a problem, because it would allow an adversary to infer the contents of the message from context without being able to decrypt it. This is why we need to pad messages with some random bytes. Unfortunately, the most widely used padding scheme, PKCS #1 v1.5, is often vulnerable to something called a padding oracle attack.

Padding oracles are pretty complex, but the high-level idea is that adding padding to a message requires the recipient to perform an additional check–whether the message is properly padded. When the check fails, the server throws an invalid padding error. That single piece of information is enough to slowly decrypt a chosen message. The process is tedious and involves manipulating the target ciphertext millions of times to isolate the changes which result in valid padding. But that one error message is all you need to eventually decrypt a chosen ciphertext. These vulnerabilities are particularly bad because attackers can use them to recover pre-master secrets for TLS sessions. For more details on the attack, check out this excellent explainer.

The original attack on PKCS #1 v1.5 was discovered way back in 1998 by Daniel Bleichenbacher. Despite being over 20 years old, this attack continues to plague many real-world systems today. Modern versions of this attack often involves a padding oracle slightly more complex than the one originally described by Bleichenbacher, such as server response time or performing some sort of protocol downgrade in TLS. One particularly shocking example was the ROBOT attack, which was so bad that a team of researchers were able to sign messages with Facebook’s and PayPal’s secret keys. Some might argue that this isn’t actually RSA’s fault – the underlying math is fine, people just messed up an important standard several decades ago. The thing is, we’ve had a standardized padding scheme with a rigorous security proof, OAEP, since 1998. But almost no one uses it. Even when they do, OAEP is notoriously difficult to implement and often is vulnerable to Manger’s attack, which is another padding oracle attack that can be used to recover plaintext.

The fundamental issue here is that padding is necessary when using RSA, and this added complexity opens the cryptosystem up to a large attack surface. The fact that a single bit of information, whether the message was padded correctly, can have such a large impact on security makes developing secure libraries almost impossible. TLS 1.3 no longer supports RSA so we can expect to see fewer of these attacks going forward, but as long as developers continue to use RSA in their own applications there will be padding oracle attacks.

So what should you use instead?

People often prefer using RSA because they believe it’s conceptually simpler than the somewhat confusing DSA protocol or moon math elliptic curve cryptography (ECC). But while it may be easier to understand RSA intuitively, it lacks the misuse resistance of these other more complex systems.

First of all, a common misconception is that ECC is super dangerous because choosing a bad curve can totally sink you. While it is true that curve choice has a major impact on security, one benefit of using ECC is that parameter selection can be done publicly. Cryptographers make all the difficult parameter choices so that developers just need to generate random bytes of data to use as keys and nonces. Developers could theoretically build an ECC implementation with terrible parameters and fail to check for things like invalid curve points, but they tend to not do this. A likely explanation is that the math behind ECC is so complicated that very few people feel confident enough to actually implement it. In other words, it intimidates people into using libraries built by cryptographers who know what they’re doing. RSA on the other hand is so simple that it can be (poorly) implemented in an hour.

Second, any Diffie-Hellman based key agreement or signature scheme (including elliptic curve variants) does not require padding and therefore completely sidesteps padding oracle attacks. This is a major win considering RSA has had a very poor track record avoiding this class of vulnerabilities.

Trail of Bits recommends using Curve25519 for key exchange and digital signatures. Encryption needs to be done using a protocol called ECIES which combines an elliptic curve key exchange with a symmetric encryption algorithm. Curve25519 was designed to entirely prevent some of the things that can go wrong with other curves, and is very performant. Even better, it is implemented in libsodium, which has easy-to-read documentation and is available for most languages.

Seriously, stop using RSA

RSA was an important milestone in the development of secure communications, but the last two decades of cryptographic research have rendered it obsolete. Elliptic curve algorithms for both key exchange and digital signatures were standardized back in 2005 and have since been integrated into intuitive and misuse-resistant libraries like libsodium. The fact that RSA is still in widespread use today indicates both a failure on the part of cryptographers for not adequately articulating the risks inherent in RSA, and also on the part of developers for overestimating their ability to deploy it successfully.

The security community needs to start thinking about this as a herd-immunity problem—while some of us might be able to navigate the extraordinarily dangerous process of setting up or implementing RSA, the exceptions signal to developers that it is in some way still advisable to use RSA. Despite the many caveats and warnings on StackExchange and Github READMEs, very few people believe that they are the ones who will mess up RSA, and so they proceed with reckless abandon. Ultimately, users will pay for this. This is why we all need to agree that it is flat out unacceptable to use RSA in 2019. No exceptions.

Avoiding Smart Contract “Gridlock” with Slither

A denial-of-service (DoS) vulnerability, dubbed ‘Gridlock,’ was publicly reported on July 1st in one of Edgeware’s smart contracts deployed on Ethereum. As much as $900 million worth of Ether may have been processed by this contract. Edgeware has since acknowledged and fixed the “fatal bug.”

When we heard about Gridlock, we ran Slither on the vulnerable and fixed Edgeware contracts. Its publicly available dangerous-strict-equality detector correctly identifies the dangerous assertion in the vulnerable contract, and shows the absence of this vulnerability in the fixed contract.

This blog post details why this vulnerability is so subtle, the implementation details behind Slither’s dangerous-strict-equality detector that identified this vulnerability, and how Slither can help prevent developers from introducing such vulnerabilities in the future.

Strict Equality and DoS Vulnerability

The Gridlock vulnerability was identified in the snippet below. Upon discovery, the bug was acknowledged by the maintainers and a second version addressing the issue was deployed.

   /**
     * @dev        Locks up the value sent to contract in a new Lock
     * @param      term         The length of the lock up
     * @param      edgewareAddr The bytes representation of the target edgeware key
     * @param      isValidator  Indicates if sender wishes to be a validator
     */
    function lock(Term term, bytes calldata edgewareAddr, bool isValidator)
        external
        payable
        didStart
        didNotEnd
    {
        uint256 eth = msg.value;
        address owner = msg.sender;
        uint256 unlockTime = unlockTimeForTerm(term);
        // Create ETH lock contract
        Lock lockAddr = (new Lock).value(eth)(owner, unlockTime);
        // ensure lock contract has all ETH, or fail
        assert(address(lockAddr).balance == msg.value); // BUG
        emit Locked(owner, eth, lockAddr, term, edgewareAddr, isValidator, now);
    }

Specifically, the source of this vulnerability is the assertion which performs a strict equality check between the balance of the newly created Lock contract and the msg.value sent to this contract.

From the contract developer’s perspective, this assertion should hold. The new Lock contract was just created in the previous line. Also, it was credited with an Ether value equal to the msg.value sent as part of the current transaction.

However, this assumes that the newly created Lock contract address will have zero Ether balance before its creation. This is incorrect. Ether can be sent to contract addresses before the contracts are instantiated at those addresses. This is possible because Ethereum address generation is based on deterministic nonces.

The DoS attack consists of pre-calculating the next Lock contract address and sending some Wei to that address. This forces the lock() function to fail at the assertion in all future transactions, bringing the contract to a “Gridlock.”

The fix is to replace the assertion with the one below, where the strict equality ‘==’ is replaced by ‘>=’, accounting for Ether already present at the address of the new Lock contract being created.

assert(address(lockAddr).balance >= msg.value);

Avoiding strict equality to determine if an account has enough Ethers or tokens is a well-understood defensive programming technique in Solidity.

Slither’s Dangerous-Strict-Equality Detector

Slither has had a publicly available dangerous-strict-equality detector targeting this vulnerability since version 0.5.0, released on January 14th, 2019. We classify results from this detector as Medium impact and High confidence because strict equality is nearly always misused in logic fundamental to the operation of the contract. The results of this check are worth reviewing closely!

Running Slither on the Lockdrop.sol contract immediately identifies the vulnerable assertion:

$ slither --detect incorrect-equality Lockdrop.sol 
INFO:Detectors:
Lockdrop.lock(Lockdrop.Term,bytes,bool) (Lockdrop.sol#53-67) uses a dangerous strict equality:
        - assert(bool)(address(lockAddr).balance == msg.value)
Reference: https://github.com/crytic/slither/wiki/Detector-Documentation#dangerous-strict-equalities
INFO:Slither:Lockdrop.sol analyzed (2 contracts), 1 result(s) found

This detector is implemented using a lightweight taint analysis, where the tainted sources are program constructs with msg.value, now, block.number, block.timestamp, and the results of ERC token balanceOf() function calls. The taint sinks are expressions using strict equality comparisons, i.e., ‘==‘. The analysis works on Slither’s intermediate language representation, SlithIR, and tracks the propagation of tainted values across assignments and function calls. An alert is generated when the taint sinks have a data dependency on the tainted sources.

A simple textual search might have caught this vulnerability, but syntactic regular expressions would raise a fog of false alerts or miss it entirely. This is because of the many ways this vulnerability pattern can manifest, including across function calls and variable assignments. Hardcoding such a regular expression is challenging. Other security tools lack a detector for this vulnerability, or produce a substantial number of false positives. The lightweight semantic taint analysis enabled by SlithIR greatly improves this detector’s accuracy and reduces false positives.

In the case of the Lockdrop contract, Slither’s dangerous-strict-equality detector generates such an alert because msg.value and an address balance are used in a strict equality comparison within an assertion. This is a textbook example of a strict equality vulnerability which is caught effortlessly by Slither. We also verified that this alert is not present in the recently fixed code.

Crytic.io correctly identifies “Gridlock” in the Edgeware smart contracts

Besides this detector, Slither has 35 more that catch many Solidity smart contract vulnerabilities. They work together with 30 additional proprietary detectors in crytic.io, our continuous assurance system (think “Travis-CI but for Ethereum”). So, go ahead and give Slither a shot. We would love to hear about your experience, and welcome feedback.

State of the Art Proof-of-Work: RandomX

RandomX is a new ASIC and GPU-resistant proof-of-work (PoW) algorithm originally developed for Monero, but potentially useful in any blockchain using PoW that wants to bias towards general purpose CPUs. Trail of Bits was contracted by Arweave to review this novel algorithm in a two person-week engagement and provide guidance on alternate parameter selection. But what makes it unusual and why should you care about it?

Standard Proof-of-work

In classical PoW algorithms (such as hashcash, used in Bitcoin), the core is typically a cryptographic hash function where the only variable is the data input to the function. To achieve a target “hardness” a number of zero bits are required as the prefix of the hash output. Each zero bit added doubles the difficulty of mining. However, this type of algorithm is highly amenable to acceleration via ASICs and GPUs because a fixed set of operations are performed on all input with limited memory requirements. This is undesirable.

Why do we care about ASIC resistance?

Blockchain mining is ideally a heavily decentralized task with no singular entity controlling a significant amount of the hashing power. Blockchains are susceptible to 51% attacks, where a malicious majority can override the global state, e.g., allowing double-spends. If an ASIC can be built that significantly improves mining efficiency over a general purpose CPU, then economic factors will disincentivize CPU-based mining. The result of this can be seen in the Bitcoin network, where ASIC manufacturers have built large-scale mining farms and a handful of entities control a shockingly high percentage of the hash rate.

For the last several years, ASIC manufacturers have shown great capacity for rapid design and fabrication of ASICs. A project that wants to be ASIC-resistant (without switching to a proof-of-stake model, which has its own set of tradeoffs) must therefore seek to take advantage of the specific strengths a general-purpose CPU possesses over a hypothetical ASIC.

This desire has led to a significant amount of research around ASIC resistance. RandomX represents a concrete implementation of the most modern ASIC-resistant ideas as applied to cryptocurrency.

How RandomX works

The core of RandomX is the concept of randomized execution. Put simply, we want to execute a series of random instructions to take advantage of a general-purpose CPU’s flexibility with dynamic code execution. The RandomX developers have extensively documented their design rationale and provided a specification with a more rigorous explanation, but with some simplification the algorithm does the following:

Step 1

A data structure called the Cache is created using argon2d with the input key K. Argon2d was originally designed as a memory-hard password hashing function. Computers generally have large pools of fast memory available to them, but memory is expensive on an ASIC. Requiring large amounts of memory is one of the most common defenses against specialized hardware. Argon2 uses a variety of techniques to ensure that a large (configurable) quantity of memory is used and that any time/memory tradeoff attacks are ineffective. You can read more about them in the argon2 specification.

Step 2

The Dataset (a read-only memory structure) is expanded from the Cache. Datasets are designed to be a large segment of memory that contain data the virtual machine will read. There are two values that control the size of the Dataset (RANDOMX_DATASET_BASE_SIZE and RANDOMX_DATASET_EXTRA_SIZE). Together, they place an upper bound on the total memory the algorithm requires. Extra size is used to push the memory slightly beyond a power of two boundary, which makes life harder for ASIC manufacturers. The actual Dataset generation is performed by loading data from the Cache, generating a set of SuperscalarHash instances, and then invoking those instances to get a final output. SuperscalarHash is designed to consume power while waiting for data from DRAM. This hurts an ASIC that attempts to compute Datasets dynamically from the Cache.

Step 3

The Scratchpad (read/write memory) is initialized by performing a blake2 hash on the input data and using the resulting output to seed the AesGenerator. This generator uses AES-NI instructions to fill the Scratchpad. Generation of the initial Scratchpad uses AES transformations. This algorithm is already hardware-accelerated on modern CPUs, so an ASIC will gain no advantage implementing it. The Scratchpad itself is a (relatively) large read/write data structure designed specifically to fit in caches that are available in CPUs.

Step 4

Now we get to the core of the algorithm: the randomized programs running on a virtual machine. The VM is executed by building a program using random bytes created using another generator. The RandomX virtual machine architecture is carefully designed to allow any sequence of 8-byte words to be a valid instruction. These instructions are designed to:

  • Require double precision floating point operations
  • Use 128-bit vector math
  • Use all four IEEE 754 floating point rounding modes
  • Read and write to the Scratchpad, which as previously stated is designed to fit entirely in CPU caches and thus be very fast
  • Take advantage of branch prediction via a low probability branch instruction
  • Execute instructions using the superscalar and out-of-order execution capabilities of a CPU

Each of these properties is a particular strength of general-purpose CPUs and requires additional die area to implement on an ASIC, reducing its advantage.

The resulting state of the VM after program execution is hashed and used to generate a new program. The number of loops executed in this fashion is configurable but is set to eight by default. This looping behavior was chosen to avoid situations where an ASIC miner might only implement a subset of possible operations and only run “easy” programs on the virtual machine. Since the subsequent program cannot be determined until the current one has been fully executed, a miner cannot predict whether the entirety of the chain will be “easy,” so it becomes impractical to implement a partial set of instructions.

Step 5

Finally, the Scratchpad and Register File (the virtual machine’s registers) are hashed using AesHash followed by a final blake2 hash. This step offers no significant ASIC resistance beyond the use of AES instructions, but is included to show the final hashing to a 64-byte value.

What we found

In the course of our two person-week review we found three issues (two low severity, one informational).

Single AES rounds used in AesGenerator

The AES encryptions described by the RandomX specification refer to a single round of AES, which is insufficient for full mixing of the input data. RandomX doesn’t depend on the encryption of AES for its security. Instead, AES is used as a CPU-biased fast transformation that provides diffusion across the output. However, diffusion of bits through the output is dependent upon the number of rounds of AES.

The severity of this finding is “low” because the requisite lack of diffusion requires finding additional bias in almost every step of the algorithm, and then crafting an input that can propagate that bias through the entire chain.

Subsequent to the disclosure of this finding, the RandomX team developed a new AesGenerator4R function that performs four rounds. This functionality has been merged into RandomX as of pull request 46. Four rounds as part of program generation resolves the concerns documented in this issue.

Insufficient Testing and Validation of VM Correctness

The RandomX codebase lacks test coverage validating the semantics of the virtual machine (VM). Trail of Bits devoted half of this engagement (one person-week) to assessing the general security properties of the algorithm implementation. While this effort revealed several code-quality findings, it was insufficient to validate the absence of semantic errors in the VM implementation.

The severity of this finding is “low” because the correctness of RandomX is irrelevant as long as: (1) its output is deterministic, (2) its output is cryptographically random, and (3) its reference implementation is the sole one used for mining in a blockchain. However, any discrepancy between the specification and the reference implementation can lead to consensus issues and forks in the blockchain.

Consider the case where a third-party cleanroom implementation of the RandomX specification becomes popular on a blockchain using RandomX for proof-of-work. The blockchain will fork—potentially at some distant point in the past—if there is even a subtle semantic difference between the miners’ implementations.

Configurable parameters are brittle

RandomX contains 47 configurable parameters, including flags for parallelization, memory consumption, and iterations of the initial KDF, memory size of the dataset, sizing of three levels of cache for the virtual CPU, the size and iteration count of programs executed on the VM, and cache access/latency. The default parameters have been chosen to maximize CPU advantage for the algorithm. However, the threat of 51% attacks forces alternate blockchains interested in using RandomX to make different choices for these parameters. These choices must be made without clear insight into which ones may compromise the algorithm’s advantages. This brittleness could impede third-party adoption.

Subsequent to this finding, the RandomX team has removed a few unnecessary parameters, written additional guidance about what configuration values are safe to change, and added a new set of checks that prohibit a set of unsafe configurations.

Assessment depth

There is a belief in the blockchain industry that many small reviews is a better use of review capital than one large one. This belief is predicated on the notion that every review team will approach the codebase differently and apply different expertise. The supposed diversity of approaches and expertise will provide better coverage and shake out more bugs than having a single team do a deep dive on a given project.

We believe that larger, singular assessments provide better overall value to the client. As a customer, you are paying the assessment team to provide their expert opinion, but like any new employee they need time to ramp up on your codebase. Once that initial learning period is complete, the quality and depth of the assessment rapidly increases. Many large-scale, long-term code assessments do not reveal their most critical or meaningful findings until late in the engagement.

As a client you should hire a single firm for a larger engagement rather than multiple firms for smaller ones for precisely the same reasons you place value on employee retention. Replacing a firm that has domain knowledge will cost time and money if you choose to employ a new firm.

This principle of software assurance is captured well in an old talk from Dave Aitel: The Hacker Strategy. Namely, the quality of vulnerabilities a researcher can find is strongly correlated to the time spent. One hour nets you extremely shallow problems, one week significantly more depth, and with one month you can find vulnerabilities that no one else is likely to discover.

This is further discussed in Zero Days, Thousands of Nights — the authoritative reference on vulnerability research based on dozens of interviews with experts in the field:

The method of finding vulnerabilities can have an impact on which vulnerabilities are actually found. As one example, recent research claims that fuzzers find only the tip of the iceberg in terms of vulnerabilities (Kirda et al., no date). Vulnerabilities can be found via fuzzing in newer or less mature products, or those with simple code bases. For products that have a longer life, are more complex, are popular with a large market share, or are high revenue generators, more people have evaluated the code bases, and finding vulnerabilities often requires more in-depth auditing, logic review, and source code analysis, in order to go several layers deep.

For example, manually validating the correctness of the RandomX VM (which is the most serious issue we discovered) would take several person-weeks alone. It’s highly likely that, at the end of all four audits, there will be no guarantee that the VM implementation is free of semantic errors.

Similarly, analyzing the cryptographic strength of each function in RandomX was achievable within the engagement, but exploring whether there exist methods to propagate potential bias across steps requires a deeper look. Small engagements prohibit this type of work, favoring shallower results.

Current project status

Our two person-week audit was the first of multiple reviews the RandomX team has scheduled. Over the next few weeks the project is undergoing three additional small audits, the results of which should be published later this month. Once these audits are published and any additional findings are resolved by the RandomX authors, it is the intent of both Arweave and Monero to adopt this algorithm in their respective products in a scheduled protocol upgrade.

Siderophile: Expose your Crate’s Unsafety

Today we released a tool, siderophile, that helps Rust developers find fuzzing targets in their codebases.

Siderophile trawls your crate’s dependencies and attempts to finds every unsafe function, expression, trait method, etc. It then traces these up the callgraph until it finds the function in your crate that uses the unsafety. It ranks the functions it finds in your crate by badness—the more unsafety a function makes use of, the higher its badness rating.

Siderophile ([ˈsidərəˌfīl]) – Having an affinity for metallic iron

We created Siderophile for an engagement where we were delivered a massive Rust codebase with a tight timeframe for review. We wanted to fuzz it but weren’t even sure where to start. So, we created a tool to determine which functions invoked the most unsafe behavior. We were able to speed up our bug discovery by automating the targeting process with siderophile. We’re now open-sourcing this tool so everyone can benefit from it!

Sample Output

Here is a sample of siderophile when run on molasses, a crate we’re building that fully implements the MLS cryptographic protocol:

Badness Function

    012 molasses::crypto::hash::HashFunction::hash_serializable

    005 molasses::crypto::hash::HashContext::feed_serializable

    003 molasses::utils::derive_node_values

    003 molasses::application::encrypt_application_message

    003 molasses::application::decrypt_application_message

    003 molasses::group_ctx::GroupContext::new_from_parts

    003 molasses::group_ctx::GroupContext::from_welcome

    003 molasses::group_ctx::GroupContext::update_transcript_hash

    003 molasses::group_ctx::GroupContext::update_tree_hash

    003 molasses::group_ctx::GroupContext::update_epoch_secrets

    003 molasses::group_ctx::GroupContext::apply_update
    
...

As you can see, much of the unsafety comes from the serialization and crypto-heavy routines. We’ll be sure to fuzz this bad boy before it goes 1.0.

Limitations

This is not guaranteed to catch all the unsafety in a crate’s deps. For instance, we don’t have the ability to inspect macros or resolve dynamically dispatched methods since unsafe tagging only occurs at a source-level. The ergonomics for the tool could be better, and we’ve already identified some incorrect behavior on certain crates. If you’re interested in helping out, please do! We are actively maintaining the project and have some issues written out.

Try it Out

Siderophile is on Github along with a better explanation of how it works and how to run the tool. You should run it on your Rust crate, and setup fuzzers for what it finds. Check it out!

Finally, thanks to cargo-geiger and rust-praezi for current best practices. This project is mostly due to their work.

Use constexpr for faster, smaller, and safer code

With the release of C++14, the standards committee strengthened one of the coolest modern features of C++: constexpr. Now, C++ developers can write constant expressions and force their evaluation at compile-time, rather than at every invocation by users. This results in faster execution, smaller executables and, surprisingly, safer code.

Undefined behavior has been the source of many security bugs, such as Linux kernel privilege escalation (CVE-2009-1897) and myriad poorly implemented integer overflow checks that are removed due to undefined behavior. The C++ standards committee decided that code marked constexpr cannot invoke undefined behavior when designing constexpr. For a comprehensive analysis, read Shafik Yaghmour’s fantastic blog post titled “Exploring Undefined Behavior Using Constexpr.”

I believe constexpr will evolve into a much safer subset of C++. We should embrace it wholeheartedly. To help, I created a libclang-based tool to mark as much code as possible as constexpr, called constexpr-everything. It automatically applies constexpr to conforming functions and variables.

Constexpr when confronted with undefined behavior

Recently in our internal Slack channel, a co-worker was trying to create an exploitable binary where the vulnerability was an uninitialized stack local, but he was fighting the compiler. It refused to generate the vulnerable code.

/* clang -o example example.cpp -O2 -std=gnu++14 \
   -Wall -Wextra -Wshadow -Wconversion
 */
typedef void (*handler)();

void handler1();
void handler2();
void handler3();

handler handler_picker(int choice) {
    handler h;
    switch(choice) {
    case 1:
        h = handler1;
        break;
    case 2:
        h = handler2;
        break;
    case 3:
        h = handler3;
        break;
    }

    return h;
}

When compiling the example code with a modern compiler (clang 8.0), the compiler silently eliminates the vulnerable cases. If the caller specifies a choice not handled by the switch (such as 0 or 4), the function returns handler2. This is true on optimization levels greater than -O0. Try it for yourself on Compiler Explorer!

My default set of warnings (-Wall -Wextra -Wshadow -Wconversion) doesn’t warn about this on clang at all (Try it). It prints a warning on gcc but only with optimizations enabled (-O0 vs -O1)!

Note: If you want to print all the warnings clang knows about, use -Weverything on clang when developing.

The reason for this is, of course, undefined behavior. Since undefined behavior can’t exist, the compiler is free to make assumptions on the code — in this case assuming that handler h can never be uninitialized.

Right now the compiler silently accepts this bad code and just assumes we know what we’re doing. Ideally, it would error out. This is where constexpr saves us.

/* clang -o example example.cpp -O2 -std=gnu++14 \
   -Wall -Wextra -Wshadow -Wconversion
 */
typedef void (*handler)();

void handler1();
void handler2();
void handler3();

constexpr handler handler_picker(int choice)
{
    handler h;
    switch(choice) {
    case 1:
        h = handler1;
        break;
    case 2:
        h = handler2;
        break;
    case 3:
        h = handler3;
        break;
    }

    return h;
}
# https://gcc.godbolt.org/z/gKrZV3
<source>:9:13: error: variables defined in a constexpr 
function must be initialized
    handler h;
            

1 error generated.
Compiler returned: 1

constexpr forced an error here, which is what we want. It works on most forms of undefined behavior but there are still gaps in the compiler implementations.

constexpr everything!

After some digging in the clang source, I realized that I can use the same machinery libclang uses to determine if something can be constexpr during its semantic analysis to automatically mark functions and methods as constexpr. While this won’t detect more undefined behavior directly, it will help us mark as much code as possible as constexpr.

Initially I started writing a clang-tidy pass, but ran into trouble with the available APIs and the context available in the pass. I decided to create my own stand-alone tool: constexpr-everything. It is available on our GitHub and should work with recent libclang versions.

I wrote two visitors, one which tries to identify if a function can be marked as constexpr. This turned out to be fairly straightforward; I iterate over all the clang::FunctionDecls in the current translation unit and ask if they can be evaluated in a constexpr context with clang::Sema::CheckConstexprFunctionDecl, clang::Sema::CheckConstexprFunctionBody, and clang::Sema::CheckConstexprParameterTypes. I skip over functions that are already constexpr or can’t be (like destructors or main). When the analysis detects a function that can be constexpr but isn’t already, it issues a diagnostic and a FixIt:

$ ../../../build/constexpr-everything ../test02.cpp
constexpr-everything/tests/02/test02.cpp:13:9: warning: function can be constexpr
        X(const int& val) : num(val) {
        
        constexpr
constexpr-everything/tests/02/test02.cpp:17:9: warning: function can be constexpr
        X(const X& lVal)
        
        constexpr
constexpr-everything/tests/02/test02.cpp:29:9: warning: function can be constexpr
        int getNum() const { return num; }
        
        constexpr
3 warnings generated.

FixIts can be automatically applied with the -fix command line option.

Trouble applying constexpr variables

We need to mark variables as constexpr in order to force evaluation of constexpr functions. Automatically applying constexpr to functions is easy. Doing so on variables is quite difficult. I had issues with variables that weren’t previously marked const getting marked const implicitly through the addition of constexpr.

After trying to apply constexpr as widely as possible and fighting with my test cases, I switched tactics and went with a much more conservative approach: only mark variables that are already const-qualified and have constexpr initializers or constructors.

$ ../../../build/constexpr-everything ../test02.cpp -fix
constexpr-everything/tests/02/test02.cpp:47:5: warning: variable can be constexpr
const X x3(400);</code>

constexpr
constexpr-everything/tests/02/test02.cpp:47:5: note: FIX-IT applied suggested code changes
1 warnings generated.

While this approach won’t apply constexpr in every case possible, it can safely apply it automatically.

Try it on your code base

Benchmark your tests before and after running constexpr-everything. Not only will your code be faster and smaller, it’ll be safer. Code marked constexpr can’t bitrot as easily.

constexpr-everything is still a prototype – it has a couple of rough edges left. The biggest issue is FixIts only apply to the source (.cpp) files and not to their associated header files. Additionally, constexpr-everything can only mark existing constexpr-compatible functions as constexpr. We’re working on using the machinery provided to identify functions that can’t be marked due to undefined behavior.

The code is available on our GitHub. To try it yourself, you’ll need cmake, llvm and libclang. Try it out and let us know how it works for your project.

Panicking the right way in Go

A common Go idiom is to (1) panic, (2) recover from the panic in a deferred function, and (3) continue on. In general, this is okay, so long there are no global state changes between the entry point to the function calling defer, and the point at which the panic occurs. Such global state changes can have a lasting effect on the program’s behavior. Moreover, it is easy to overlook them and to believe that all actions are undone by a call to recover.

At Trail of Bits, we have developed a tool called OnEdge to help detect such incorrect uses of the “defer, panic, recover” pattern. OnEdge reduces the problem of finding such global state changes to one of race detection. Go’s outstanding race detector can then be used to find these errors. Moreover, as we explain below, you can incorporate OnEdge into your own programs in order to find these types of errors.

OnEdge is one of the tools that we use to verify software. For example, we audit a lot of blockchain software written in Go, where it is common to panic upon receiving an invalid transaction, to recover from the panic, and to continue processing transactions. However, care must be taken to ensure that an invalid transaction is reverted completely, as a partially applied transaction could, oh say for example, cause the blockchain to fork.

“Defer, Panic, and Recover”

The definitive reference on this technique is Andrew Gerrand’s blog post, referenced above. We will not give such a thorough account here, though we will walk through an example.

In Figure 1 is a simple program that employs the “defer, panic, and recover” pattern. The program randomly generates deposits and withdrawals. If there are not sufficient funds to cover a withdrawal, the program panics. The panic is caught in a deferred function that reports the error, and the program continues on.

package main

import (
	"fmt"
	"log"
	"math/rand"
)

var balance = 100

func main() {
	r := rand.New(rand.NewSource(0))
	for i := 0; i < 5; i++ {
		if r.Intn(2) == 0 {
			credit := r.Intn(50)
			fmt.Printf("Depositing %d...\n", credit)
			deposit(credit)
		} else {
			debit := r.Intn(100)
			fmt.Printf("Withdrawing %d...\n", debit)
			withdraw(debit)
		}
		fmt.Printf("New balance: %d\n", balance)
	}
}

func deposit(credit int) {
	balance += credit
}

func withdraw(debit int) {
	defer func() {
		if r := recover(); r != nil {
			log.Println(r)
		}
	}()
	balance -= debit
	if balance < 0 {
		panic("Insufficient funds")
	}
}

Figure 1: A program that uses the “defer, panic, and recover” pattern incorrectly.

Running the program in Figure 1 produces the output in Figure 2.

Depositing 14...
New balance: 114
Withdrawing 6...
New balance: 108
Withdrawing 96...
New balance: 12
Withdrawing 77...
<time> Insufficient funds
New balance: -65
Depositing 28...
New balance: -37

Figure 2: Output of the program in Figure 1.

Note that there is a bug: even though there are not sufficient funds to cover one of the withdrawals, the withdrawal is still applied. This bug is a special case of a more general class of errors; the program makes global state changes before panicking.

A better approach would be to make such global state changes only after the last point at which a panic could occur. Rewriting the withdraw function to use this approach would cause it to look something like Figure 3.

func withdraw(debit int) {
	defer func() {
		if r := recover(); r != nil {
			log.Println(r)
		}
	}()
	if balance-debit < 0 {
		panic("Insufficient funds")
	}
	balance -= debit
}

Figure 3: A better implementation of the withdraw function from Figure 1.

Following a brief introduction to Go’s race detector, we describe a method for finding improper global state changes like those in Figure 1.

The Go Race Detector

The Go Race Detector is a combination of compiler instrumentation and a runtime library. The compiler instruments (1) memory accesses that cannot be proven race-free, and (2) uses of known synchronization mechanisms (e.g., sending and receiving on a channel). The runtime library, based on Google’s ThreadSanitizer, provides the code to support the instrumentation. If two instrumented memory accesses conflict and cannot be proven synchronized, then the runtime library produces a warning message.

The Go race detector can produce “false negatives” i.e., it can fail to detect some races. However, provided that synchronization mechanisms known to the runtime library are used, every warning message that it produces is a “true positive,” i.e., an actual race.

One enables the Go race detector by passing the “-race” flag, e.g., “go run“ or “go build.“ The “-race” flag tells the Go compiler to instrument the code as described above, and to link-in the required runtime library.

Using the Go race detector is not cheap. It increases memory usage by an estimated 5-10x, and increases execution time by 2-20x. For this reason, the race detector is typically not enabled for “release” code, and is used only during development. Nonetheless, the strong guarantees that come with the detector’s reports can make the overhead worthwhile.

Detecting Global State Changes

The problem of detecting global state changes has obvious similarities to the problem of detecting data races: both involve memory accesses. Like data races, detecting global state changes would seem amenable to dynamic analysis. So, a question that one might ask is: can one leverage the Go race detector to find global state changes? Or, more precisely, can one make a global state change look like a data race?

We solve this problem by executing code that could modify global state twice: once in a program’s main thread, and once in a second, “shadow” thread. If the code does modify global state, then there will be two conflicting memory accesses, one in either thread. So long as the two threads do not appear synchronized (which is not hard to ensure), then the two memory accesses will potentially be reported as a data race.

OnEdge

OnEdge detects improper global state changes using the approach described above. OnEdge is a small library that exports a handful of functions, notably, WrapFunc and WrapRecover. To incorporate OnEdge into a project, do three things:

  1. Wrap function bodies that defer calls to recover in WrapFunc(func() { … }).
  2. Within those wrapped function bodies, wrap calls to recover in WrapRecover( … ).
  3. Run the program with Go’s race detector enabled.

If a panic occurs in a function body wrapped by WrapFunc, and that panic is caught by a recover wrapped by WrapRecover, then the function body is re-executed in a shadow thread. If the shadow thread makes a global state change before calling recover, then that change appears as a data race and can be reported by Go’s race detector.

Figure 4 is the result of applying steps 1 and 2 above to the withdraw function from Figure 1.

func withdraw(debit int) {
	onedge.WrapFunc(func() {
		defer func() {
			if r := onedge.WrapRecover(recover()); r != nil {
				log.Println(r)
			}
		}()
		balance -= debit
		if balance < 0 {
			panic("Insufficient funds")
		}
	})
}

Figure 4: The withdraw function from Figure 1 with OnEdge incorporated.

A complete source file to which the above steps have been applied can be found here: account.go. Running the modified program with the race detector enabled, e.g.,

go run -race account.go

produces the output in Figure 5.

Depositing 14...
New balance: 114
Withdrawing 6...
New balance: 108
Withdrawing 96...
New balance: 12
Withdrawing 77...
==================
WARNING: DATA RACE
Read at 0x0000012194f8 by goroutine 8:
  main.withdraw.func1()
      <gopath>/src/github.com/trailofbits/on-edge/example/account.go:61 +0x6d
  github.com/trailofbits/on-edge.WrapFunc.func1()
      <gopath>/src/github.com/trailofbits/on-edge/onedge_race.go:82 +0x3d
  github.com/trailofbits/on-edge.shadowThread.func1()
      <gopath>/src/github.com/trailofbits/on-edge/onedge_race.go:239 +0x50
  github.com/trailofbits/on-edge.shadowThread()
      <gopath>/src/github.com/trailofbits/on-edge/onedge_race.go:240 +0x79

Previous write at 0x0000012194f8 by main goroutine:
  main.withdraw.func1()
      <gopath>/src/github.com/trailofbits/on-edge/example/account.go:61 +0x89
  github.com/trailofbits/on-edge.WrapFunc.func1()
      <gopath>/src/github.com/trailofbits/on-edge/onedge_race.go:82 +0x3d
  github.com/trailofbits/on-edge.WrapFuncR()
      <gopath>/src/github.com/trailofbits/on-edge/onedge_race.go:132 +0x3d4
  github.com/trailofbits/on-edge.WrapFunc()
      <gopath>/src/github.com/trailofbits/on-edge/onedge_race.go:81 +0x92
  main.withdraw()
      <gopath>/src/github.com/trailofbits/on-edge/example/account.go:50 +0x84
  main.main()
      <gopath>/src/github.com/trailofbits/on-edge/example/account.go:39 +0x3cf

Goroutine 8 (running) created at:
  github.com/trailofbits/on-edge.WrapFuncR()
      <gopath>/src/github.com/trailofbits/on-edge/onedge_race.go:126 +0x3a1
  github.com/trailofbits/on-edge.WrapFunc()
      <gopath>/src/github.com/trailofbits/on-edge/onedge_race.go:81 +0x92
  main.withdraw()
      <gopath>/src/github.com/trailofbits/on-edge/example/account.go:50 +0x84
  main.main()
      <gopath>/src/github.com/trailofbits/on-edge/example/account.go:39 +0x3cf
==================
<time> Insufficient funds
<time> Insufficient funds
New balance: -142
Depositing 28...
New balance: -114
Found 1 data race(s)
exit status 66

Figure 5: The output of the program from Figure 1 with OnEdge incorporated and the race detector enabled.

What’s going on here? As before, there are insufficient funds to cover one of the withdrawals, so the withdraw function panics. The panic is caught by a deferred call to recover. At that point, OnEdge kicks in. OnEdge re-executes the body of the withdraw function within a shadow thread. This causes a data race to be reported at line 61 in account.go; this line:

balance -= debit

This line makes a global state change by writing to the balance global variable. Executing this line in the main and shadow threads results in two writes, which Go’s race detector recognizes as a race.

Limitations

Like all dynamic analyses, OnEdge’s effectiveness depends upon the workload to which one subjects one’s program. As an extreme example, if one never subjects one’s program to an input that causes it to panic, then OnEdge will have done no good.

A second limitation is that, since Go’s race detector can miss some races, OnEdge can miss some global state changes. This is due in part to a limitation of ThreadSanitizer, which keeps track of only a limited number of memory accesses to any one memory location. Once that limit is reached, ThreadSanitizer starts evicting entries randomly.

OnEdge present and future

OnEdge is a tool for detecting improper global state changes arising from incorrect uses of Go’s “defer, panic, and recover” pattern. OnEdge accomplishes this by leveraging the strength of Go’s existing tools, namely, its race detector.

We are exploring the possibility of using automation to incorporate WrapFunc and WrapRecover into a program. For now, users must do so manually. We encourage the use of OnEdge and welcome feedback.

Creating an LLVM Sanitizer from Hopes and Dreams

Each year, Trail of Bits runs a month-long winter internship aka “winternship” program. This year we were happy to host 4 winterns who contributed to 3 projects. This project comes from Carson Harmon, a new graduate from Purdue interested in compilers and systems engineering, and a new full-time member of our research practice.

I set out to implement a dynamic points-to analysis in LLVM for my winternship. Points-to analyses tell us, for a given address or pointer in our program, the type of data residing in memory at said address, and possibly, where the allocation may have occurred. This is useful because it helps evaluate the accuracy of a static analysis, augments static analysis with additional facts, and provides context into when and where certain objects are created.

The LLVM sanitizer infrastructure provides a natural location for implementing such an analysis. Sanitizers can instrument programs at compile time, and include a runtime support library with a libc implementation, interfaces for function replacement, memory and thread management, operating system-specific syscall support, and much more. To wit, many existing bug finding tools that enjoy regular use by developers are implemented as sanitizers, including:

Unfortunately, I didn’t get very far with my points-to analysis due to the lack of documentation on LLVM sanitizers. Instead, I’ll give a high-level overview of LLVM and how to use it to make sanitizers, retracing the steps I went through during my winternship.

How to write your own LLVM sanitizer

I started by reviewing Eli Bendersky’s blog and GitHub repo, Dr. Adrian Sampson’s blog, conference proceedings from EuroLLVM, and LLVM’s extensive toolchain documentation. Some of the information I found was outdated, but it helped identify the modules I needed to interact with in order to build my own sanitizer.

The compiler driver is the glue between all the modules in LLVM; any information that needs to be passed between modules is passed through the driver. Building a sanitizer might require modifying any number of these components. I found understanding the drivers design important for developing on the system.

High-level relationships between the clang frontend, LLVM IR, compiler-rt, and compiler driver

If an analysis pass or sanitizer wanted to modify the LLVM type generation process, they would need to modify the driver’s code-generation process.

Dataflow of clang types throughout the code-generation process

The driver is also responsible for scheduling and running LLVM passes. LLVM passes modify the IR to insert, remove, or replace instructions which is especially useful as it allows sanitizers to modify instructions and insert function calls without any extra effort from the developer. The driver registers passes based on configuration settings passed to the frontend. Sanitizer passes should be registered to run last, since the driver may run optimization passes, analysis passes, or other transformation passes that might affect your instrumentation.

Interactions between the pass manager and driver

Finally, the driver is responsible for linking the sanitizer’s runtime component, compiler-rt. Compiler-rt is a library that provides sanitizers the ability to interact with the target program during its execution. This interaction can happen via LLVM transformation passes inserting calls to functions defined in compiler-rt or by using compiler-rt’s function hooking interface.

Interactions between the transformation pass and runtime components

Make your own sanitizer

I created a tutorial to help make your own sanitizer that includes a prebuilt test sanitizer, a step-by-step guide on developing and integrating passes and runtime components, and other helpful resources for developing on LLVM. I think the winternship was a great way to learn something new over winter break and I wish more companies offered similar programs.

Getting 2FA Right in 2019

Since March, Trail of Bits has been working with the Python Software Foundation to add two-factor authentication (2FA) to Warehouse, the codebase that powers PyPI. As of today, PyPI members can enable time-based OTP (TOTP) and WebAuthn (currently in beta). If you have an account on PyPI, go enable your preferred 2FA method before you continue reading!

2018 and 2019 have been big years for two factor authentication:

All told, there’s never been a better time to add 2FA to your services. Keep reading to find out how you can do it right.

What 2FA is and is not

Before we get into the right decisions to make when implementing two factor authentication, it’s crucial to understand what second factors are and shouldn’t be.

In particular:

  • Second factor methods should not be knowable. Second factor methods are  something the user has or is, not something the user knows.
  • Second factor methods should not be a replacement for a user’s first factor (usually their password). Because they’re something the user has or is, they are an attestation of identity. WebAuthn is a partial exception to this: it can serve as a single factor due to its stronger guarantees.
  • Second factor methods are orderable by security. In particular, WebAuthn is always better than TOTP, so a user who has both enabled should be prompted for WebAuthn first. Don’t let users default to a less secure second factor. If you support SMS as a second factor for legacy reasons, do let users know that they can remove it once they add a better method.
  • 2FA implementations should not request the user’s second factor before the first factor. This isn’t really feasible with TOTP anyways, but you might be tempted to do it with a WebAuthn credential’s ID or public key. This doesn’t introduce a security risk per se, but inconsistent ordering only serves to confuse users that already have difficulty understanding the role their security key plays in authentication.
  • Recovery codes should be available and should be opt-out with sufficient warnings for users who prefer their accounts to fail-deadly. Recovery codes are not second factors — they circumvent the 2FA scheme. However, users don’t understand 2FA (see below) and will disable it out of frustration if not given a straightforward recovery method. When stored securely, recovery codes represent an acceptable compromise between usability and the soundness of your 2FA scheme.

Your users don’t understand 2FA, and will try to break it

Users, even extremely technical ones (like the average PyPI package maintainer), do not understand 2FA or its constraints. They will, to varying degrees:

Attempt to Risk Remedy
Screenshot their TOTP QR codes and leave them lying in their Downloads folder Exposed TOTP secrets. Documentation: Warn users not to save their QR codes
Use the same QR to provision multiple TOTP applications Poor understanding of what/where their second factor is. Documentation: Tell users to only provision one device
Use TOTP applications that allow them to export their codes as unencrypted text Exposed TOTP secrets; unsafe secret management. Documentation: Suggest TOTP applications that don’t support unencrypted export
Use broken TOTP applications, or applications that don’t respect TOTP parameters. Incorrect TOTP code generation; confusing TOTP labeling within the application. Little to none: virtually every TOTP application ignores or imposes additional constraints on provisioning. Use default provisioning parameters!
Scan the TOTP QR with the wrong application. Lost second factor; inability to log in. Require the user to enter a TOTP code before accepting the provisioning request.
Attempt to enter the provisioning URI or secret by hand and get it wrong. Lost second factor; inability to log in. Same as above; require a TOTP code to complete provisioning.
Label their TOTP logins correctly and get them confused. Mislabeled second factor; inability to log in. Provide all username and issuer name fields supported by otpauth. Discourage users from using TOTP applications that only support a whitelist of services or require manual labeling.
Delete their TOTP secret from their application before your service. Account lockout. Documentation: Warn users against doing this, and recommend TOTP applications that provide similar warnings.
Save their recovery codes to a text file on their Desktop. Second factor bypass. Make recovery codes opt-in, and tell users to save only a print copy of their recovery codes.
Get recovery codes for different services mixed up. Lost bypass; inability to log in. Prefix recovery codes with the site name or other distinguishing identifier.
Ignore their second factors entirely and only use recovery codes. Not real 2FA. Track recovery code usage and warn repeat offenders.
Attempt to use their old RSA SecurID, weird corporate HOTP fob, or pre-U2F key. Not supported by WebAuthn. Provide explicit errors when provisioning fails. Most browsers should do this for pre-U2F keys.
Get their hardware keys mixed up. Mislabeled second factor; inability to log in. Give your users the ability to label their registered keys with human-friendly names on your service, and encourage them to mark them physically.
Give or re-sell their hardware keys without deprovisioning them. Second factor compromise. Documentation: Warn users against doing this. For more aggressive security, challenge them to assert each of their WebAuthn credentials on some interval.

Technical users can be even worse: while writing this post, an acquaintance related a tale of using Twilio and a notification-pushing service to circumvent his University’s SMS-based 2FA.

Many of these scenarios are partially unavoidable, and not all fundamentally weaken or threaten to weaken the soundness of your 2FA setup. You should however be aware of each of them, and seek to user-proof your scheme to the greatest extent possible.

WebAuthn and TOTP are the only things you need

You don’t need SMS or voice codes. If you currently support SMS or voice codes for legacy reasons, then you should be:

  1. Preventing new users from enabling them,
  2. Telling current users to remove them and change to either WebAuthn or TOTP, and
  3. Performing extra logging and alerting on users who still have SMS and/or voice codes enabled.

Paranoid? Yes. But if you hold any cryptocurrency, you probably should be paranoid.

Overkill? Maybe. SIM port attacks remain relatively uncommon (and targeted), despite requiring virtually no technical skill. It’s still better to have 2FA via SMS or voice codes than nothing at all. Google’s own research shows that just SMS prevents nearly all untargeted phishing attacks. The numbers for targeted attacks are more bleak: nearly one quarter of targeted attacks were successful against users with only SMS-based 2FA.

Worried about anything other than SMS being impractical and/or costly? Don’t be. There is a plethora of free TOTP applications for both iOS and Android. On the WebAuthn front, Google will sell you a kit with two security keys for $50. You can even buy a fully-open source key that will work with WebAuthn for $25! Most importantly of all: the fact that TOTP is not as good as a hardware key is not an excuse to continue allowing either SMS or voice codes.

Contrasting TOTP and WebAuthn

TOTP and WebAuthn are both solid choices for adding 2FA to your service and, given the opportunity, you should support both. Here are some factors for consideration:

TOTP is symmetric and simple, WebAuthn is asymmetric and complex

TOTP is a symmetric cryptographic scheme, meaning that the client and server share a secret. This, plus TOTP’s relatively simple code-generation process, makes it a breeze to implement, but results in some gotchas:

  1. Because clients are required to store the symmetric secret, TOTP is only as secure as the containing application or device. If a malicious program can extract the user’s TOTP secrets, then they can produce as many valid TOTP codes as they want without the user’s awareness.
  2. Because the only state shared between the client and server in TOTP is the initial secret and subsequent generated codes, TOTP lacks a notion of device identity. As a result a misinformed user can provision multiple devices with the same secret, increasing their attack surface.
  3. TOTP provides no inherent replay protection. Services may elect to guard against replays by refusing to accept a valid code more than once, but this can ensnare legitimate users who log in more than once within a TOTP window.
  4. Potentially brute-forceable. Most services use 6 or 8-digit TOTP codes and offer an expanded validation window to accommodate mobility-impaired users (and clock drift), putting an online brute-force just barely on the edge of feasibility. The solution: rate-limit login attempts.
  5. All of the above combine to make TOTP codes into ideal phishing targets. Both private and nation-state groups have successfully used fake login forms and other techniques to successfully fool users into sharing their TOTP codes.

By contrast, WebAuthn uses asymmetric, public-key cryptography: the client generates a keypair after receiving a list of options from the server, sends the public half to the server for verification purposes, and securely stores the private half for signing operations during authentication. This design results in a substantially more complex attestation model, but yields numerous benefits:

  1. Device identity: WebAuthn devices are identified by their credential ID, typically paired with a human-readable label for user management purposes. WebAuthn’s notion of identity makes it easy to support multiple security keys per user — don’t artificially constrain your users to a single WebAuthn key per account!
  2. Anti-replay and anti-cloning protections: device registration and assertion methods include a random challenge generated by the authenticating party, and well-behaved WebAuthn devices send an updated sign counter after each assertion.
  3. Origin and secure context guarantees: WebAuthn includes origin information during device registration and attestation and only allows transactions within secure contexts, preventing common phishing vectors.

TOTP is free, WebAuthn (mostly, currently) is not

As mentioned above, there are many free TOTP applications, available for just about every platform your users will be on. Almost all of them support Google’s otpauth URI “standard,” albeit with varying degrees of completeness/correctness.

In contrast, most potential users of WebAuthn will need to buy a security key. The relationship between various hardware key standards is confusing (and could occupy an entire separate blog post), but most U2F keys should be WebAuthn compatible. WebAuth is not, however, limited to security keys: as mentioned earlier, Google is working to make their mobile devices function as WebAuthn-compatible second factors, and we hope that Apple is doing the same. Once that happens, many of your users will be able to switch to WebAuthn without an additional purchase.

Use the right tools

TOTP’s simplicity makes it an alluring target for reimplementation. Don’t do that — it’s still a cryptosystem, and you should never roll your own crypto. Instead, use a mature and misuse-resistant implementation, like PyCA’s hazmat.primitives.twofactor.

WebAuthn is still relatively new, and as such doesn’t have as many server-side implementations available. The fine folks at Duo are working hard to remedy that: they’ve already open sourced Go and Python libraries, and have some excellent online demos and documentation for users and implementers alike.

Learn from our work

Want to add 2FA to your service, but have no idea where to start? Take a look at our TOTP and WebAuthn implementations within the Warehouse codebase.

Our public interfaces are well documented, and (per Warehouse standards) all branches are test-covered. Multiple WebAuthn keys are supported, and support for optional recovery codes will be added in the near future.

If you have other, more bespoke cryptography needs, contact us for help.