Until now, smart contract security researchers (and developers) have been frustrated by limited information about the actual flaws that survive serious development efforts. That limitation increases the risk of making critical smart contracts vulnerable, misallocating resources for risk reduction, and missing opportunities to employ automated analysis tools. We’re changing that. Today, Trail of Bits is disclosing the aggregate data from every full smart contract security review we’ve ever done. The most surprising and impactful results we extracted from our analysis are:
- Smart contract vulnerabilities are more like vulnerabilities in other systems than the literature would suggest.
- A large portion (about 78%) of the most important flaws (those with severe consequences that are also easy to exploit) could probably by detected using automated static or dynamic analysis tools.
- On the other hand, almost 50% of findings are not likely to ever be found by any automated tools, even if the state-of-the-art advances significantly.
- Finally, manually produced unit tests, even extensive ones, likely offer either weak or, at worst, no protection against the flaws an expert auditor can find.
Continue reading this post for a summary of our study and more details on these results.
Everyone wants to prevent disastrous vulnerabilities in Ethereum smart contracts. Academic researchers have supported that effort by describing some categories of smart contract vulnerabilities. However, the research literature and most online discourse is usually focused on understanding a relatively small number of real-world exploits, typically with a bias towards the highly visible. For example, reentrancy bugs are widely discussed because they were responsible for the infamous DAO attack and are somewhat unique to smart contracts. However, is reentrancy the most common serious problem in real smart contracts? If we don’t know, then we cannot effectively allocate resources to preventing smart contract vulnerabilities. It’s not enough to understand detection techniques. We have to know how and where to apply them. Smart contracts are new. Decision makers have a relatively shallow pool of developer/analyst experience upon which to base their actions. Having real data to draw conclusions from is essential.
So, we collected the findings from the full final reports for twenty-three paid security audits of smart contract code we performed, five of which have been kept private. The public audit reports are available online, and make informative reading. We categorized all 246 smart-contract related findings from these reports, in some cases correcting the original audit categorization for consistency, and we considered the potential for both static and dynamic analysis tools, in the long-run, to detect each finding. We also compared the frequencies of categories to those for fifteen non-smart-contract audits we performed. Using paid, expert audits of code means our statistics aren’t overwhelmed by the large number of relatively silly contracts on the blockchain. Using many audit findings instead of a handful of exploited vulnerabilities gives us a better picture of potential problems to keep watch for in the future.
Category Frequencies are Different than Other Audits… But Not as Much as You’d Think
The most common type of smart contract finding is also the most common kind of finding in the 15 non-smart-contract audits we examined in order to compare smart contract and other kinds of audits: data validation flaws are extremely common in every setting, constituting 36% of smart contract findings, and 53% of non-smart contract findings. This is no surprise; accepting inputs that should be rejected, and lead to bad behavior, will always be easy to do, and always be dangerous. Access control is another common source of problems in smart contracts (10% of findings) and in other systems we audited (18%); it’s easy to accidentally be too permissive, and access control can also be disastrous if too restrictive (for example, when even the owner of a contract can’t perform critical maintenance tasks in some states, due to a contract bug).
Some categories of problems are much less common in smart contracts: unsurprisingly, denial of service, configuration, and cryptography issues are less frequent in a context where the blockchain abstracts away communication issues and operating system/platform-specific behavior changes, and gas limits reduce the temptation to roll your own cryptography. Data exposure problems are also less common in smart contracts; most developers seem to understand that data on the blockchain is inherently public, so there seem to be fewer misunderstandings about the consequences of “surprise” visibility. But these cases are somewhat unusual; for a majority of types of finding, including overflow/underflow and arithmetic precision, patching, authentication, timing, error reporting, and auditing and logging, the percentages in findings are within 10% of those for non-smart-contract audits.
The Worst of the Worst
In addition to counting how many findings there were in each category, we looked at how serious those findings tended to be; their potential severity, and the difficulty for an attacker to exploit them. We refer to the worst findings as high-low: high severity, low difficulty. These issues can allow an attacker to inflict major damage with relative ease.
Many of our twenty-two categories had no high-low findings, but a small number had high-low rates greater than 10%: access controls (25% high-low), authentication (25%), timing (25%), numerics (23%), undefined behavior (23%), data validation (11%) and patching (11%). Note that much-dreaded reentrancy, while often serious (50% of reentrancy findings were high severity) had no high-low findings at all, and accounted for only 4 of the 246 total findings.
Tools and Automation: We Can Do a Lot Better
For each finding, we determined, to the best of our knowledge, if it could potentially be detected by automated static analysis (e.g., our Slither tool), using a reasonable detector without too many false positives, or by automated dynamic analysis (e.g., with property-based testing like Echidna or symbolic execution like Manticore), either with off-the-shelf properties like standard ERC20 semantics, or using custom invariants. Rather than restricting ourselves to current tools, we looked to the future, and rated a finding as detectable if a tool that could be produced with significant engineering effort, but without unprecedented advances in the state-of-the-art in software analysis, could potentially find the problem. That is, we asked “could we write a tool that would find this, given time and money?” not “can current tools definitely find this?” Obviously, this is a somewhat subjective process, and our exact numbers should not be taken as definitive; however, we believe they are reasonable approximations, based on careful consideration of each individual finding, and our knowledge of the possibilities of automated tools.
Using this standard, 26% of the full set of findings could likely be detected using feasible static approaches, and 37% using dynamic methods (though usually only with the addition of a custom property to check). However, the potential for automated tools is much better when we restrict our attention to only the worst, high-low, findings. While static tools have less potential for detecting high-low findings than dynamic ones (33% vs. 63%), four of the high-low issues could probably only be detected by static analysis tools, which are also much easier to apply, and require less user effort. Combining both approaches, in the best-case scenario, would result in automatic detection of 21 of the 27 high-low findings: almost 78% of the most important findings. Our estimates of how effective static or dynamic analysis tools might be, in the limit, also vary widely by the kind of finding:
|Denial of service||40%||0%|
Of course, in some cases, these percentages are not particularly informative; for instance, there was only one cryptography finding in our audit set, so it isn’t safe to assume that all cryptography bugs are easy to catch with a static analysis tool. Similarly, the coding bug category, containing what amount to “typos” in code, is likely to have an even higher percentage of easily statically detected problems, but there were only a handful of such problems in our audits.
Our best guess is that the combination of major impact on system behavior (thus high severity) and low difficulty (thus easy to find, in some sense) is not only a boon to would-be attackers, but a big help to automated analysis tools. That’s good news. Ongoing efforts to improve smart contract analysis tools are well worth the effort. That’s part of our motivation in releasing Crytic, a kind of Travis CI for smart contracts — with built-in support for running static analysis (including some Slither detectors not yet available in the public release) and, soon, dynamic analysis, on your code, automatically.
Perhaps the most important upshot here is that using high-quality automated static analysis is a best practice with almost no downside. If you’re writing important smart contracts, looking at a relatively small number of false positives in order to detect, with almost no developer effort, some of the most critical flaws is simply the right thing to do.
Tools and Automation: No Silver Bullet
However, a lot of the findings (almost 49%) are almost impossible to imagine detecting with a tool. In most of these cases, in fact, a tool isn’t even very likely to help. Slither’s code understanding features may assist in finding some issues, but many problems, and almost a quarter of the most important problems, require deeper understanding of the larger context of blockchains and markets. For example, tools can’t inform you about most front-running. Problems requiring human attention are not limited to the obvious categories, such as front-running, configuration, and documentation, either: there are 35 data validation findings, 12 access controls findings, and 10 undefined behavior findings, for example, that are unlikely to be detectable by automated tools – and 3 of these are high-low findings. A full 35% of high severity findings are unlikely to be detected automatically.
Even in the best of possible near-future automated tool worlds, and even with full formal verification (which might take up to 9x the developer effort), a great many problems simply require human attention. Security is a Strong-AI Hard problem. Until the robots replace us, independent expert attention will remain a key component of security for the most essential contracts.
Unit Tests are Great… But Maybe Not For This
Finally, what about unit testing? We didn’t add any unit tests during our audits, but we can look at whether the presence of significant unit tests was correlated with fewer findings during audits, or at least fewer high-low findings. The number of data points is, of course, too small to draw any solid conclusions, but we didn’t find any statistically significant correlation between our estimate of unit test quantity and quality and the presence of either findings in general or high-low findings. In fact, the insignificant relationships we did detect were in the wrong direction: more unit tests means more problems (the positive relationship was at least weaker for high-low findings, which is comforting). We hope and believe that’s just noise, or the result of some confounding factor, such as a larger attack surface for more complex contracts. While our basis for this result is subject to a number of caveats, including our ability to gauge the quality of unit tests without examining each line of code in detail, we do believe that if there were a causal relationship between better unit tests and fewer audit findings, with a large and consistent effect size, we’d have seen better evidence for it than we did.
Obviously, this doesn’t mean you shouldn’t write unit tests! It means that the kinds of things attackers and auditors are looking for may not overlap significantly with the kinds of problems unit tests help you avoid. Unit testing can probably improve your development process and make your users happier, but it may not help you actually be much more secure. It’s widely known that the problems developers can imagine happening, and write unit tests to check for, do not often overlap with the problems that cause security vulnerabilities. That’s why fuzzing and property-based testing are so valuable.
One key point to take away here is that the bugs found by property-based testing can be added as new unit tests to your code, giving you the best of both worlds. The pyfakefs module for creating high-fidelity mock file systems in Python, originally developed at Google, was a widely used software system, with a fairly extensive set of well-written unit tests. However, using the TSTL property-based testing tool for Python revealed over 100 previously undetected problems in pyfakefs (all of which were fixed), and let the developers of pyfakefs add a large number of new, more powerful, unit tests to detect regressions and new bugs. The same workflow can be highly effective with a well-unit-tested smart contract and Echidna; in fact, it can be easier, because Echidna does a better job of automatically figuring out the public interface to a contract than most property-based testing tools do when interacting with a library API.
Stay Tuned for More Details
We’ll be publishing the full results after we add a few more of our own smaller-scale audits and validate our results by comparing to estimates for audits performed by other companies. In the meantime, use our preliminary results to inform your own thinking about defects in smart contracts.