AIxCC finals: Tale of the tape
The results of DARPA’s AI Cyber Challenge (AIxCC) finals will be announced this week, revealing which team will claim the $4 million first prize for building the best AI system that automatically finds and fixes vulnerabilities in real-world code. For real-time updates and access to our CRS tool Buttercup, follow @dguido on X or visit our Buttercup website
Over the last few weeks, CTF Radiooo interviewed each of the seven finalists about their differing approaches to creating their own cyber reasoning system (CRS). These interviews reveal a diversity of technical approaches and philosophical differences regarding AI integration and risk tolerance. Should AI integration supplant or supplement traditional tools? How aggressive should teams be in submitting proofs of vulnerability (PoVs) and patches? What’s the best use of the teams’ LLM budgets? While the winner has not yet been announced, these differences show that there are multiple viable paths forward to using AI for vulnerability detection.
A geographically diverse field
Of the seven finalists, four teams are based in universities, and three are from private companies. Team members are spread across the globe, and there is a blend of collaborators among the finalists made up of other universities and companies. Each team’s home base is located in the US.
Private companies: Trail of Bits (New York City); LACROSSE (Minneapolis); Theori (Austin, TX)
Academia: 42-b3yond-6ug (Northwestern University); all_you_need_is_a_fuzzing_brain (Texas A&M University); Shellphish (Arizona State University); Team Atlanta (Georgia Institute of Technology)
But geographic diversity is just the tip of the iceberg. What truly separates the teams is their unique approaches to vulnerability discovery, generating PoVs, and patching. What follows is our best guess about each team’s technical strategies, based on their CTF Radiooo interviews. We haven’t seen their code, but this is what we think is true about their approach.
Vulnerability discovery
The seven finalists can be split into three philosophical camps based on the vulnerability discovery that motivated their system design.
Enhancing traditional security tools with AI
Trail of Bits, Shellphish, and LACROSSE built systems rooted in fuzzing, static analysis, and vulnerability research and enhanced them with LLMs. Trail of Bits uses LLMs to generate seed inputs for traditional fuzzing tools to improve their code coverage and ability to find inputs that trigger specific kinds of vulnerabilities. Shellphish’s “Grammar Guy” uses LLMs to generate and evolve progressive grammars based on a feedback loop that analyzes uncovered code paths. LACROSSE deploys 300–500 fuzzing agents (a scale similar to Trail of Bits’) that are orchestrated by “Optimus Zero” and use LLMs for higher-level reasoning tasks that require semantic understanding. They also used LLMs to create “vulnerability objects” when a crash occurs to describe, categorize, and plan for patching.
AI-first with traditional validation
all_you_need_is_a_fuzzing_brain and Theori use LLMs as the primary reasoning engine and traditional security tools for validation and fallback mechanisms. Of all the finalists, all_you_need_is_a_fuzzing_brain has the most AI-forward approach, using LLMs for vulnerability analysis, system architecture, strategic decision-making, and code generation. Not only that, but about 90% of their codebase was written using AI assistance. Theori’s approach uses LLM agents that follow reverse engineering workflows that are constrained to prevent the AI from wandering. Their system uses static analysis tools, like Infer, to generate thousands of bug candidates, and the LLM agents use reasoning to determine actual vulnerabilities and reduce false positives.
Hybrid approach
Team Atlanta and 42-b3yond-6ug balance AI with traditional methods, each with unique specializations. To our knowledge, Team Atlanta is the only team to use fine-tuned custom models on Llama 7B with extensive fine-tuning specialized specifically for C programming language analysis. 42-b3yond-6ug applies “super patches,” which is an LLM-based patching process able to fix two or more different bugs at once, even when those bugs appear unrelated. Their system can recognize when multiple different crashes stem from the same underlying vulnerability.
Proof of vulnerability (PoV) generation
PoVs serve as the foundation of the AIxCC scoring system because they demonstrate that vulnerabilities can actually be triggered. PoV+patch combinations earn significantly higher point values than patches submitted without PoV. The competition’s scoring system also rewards speed and accuracy. Furthermore, PoVs can be used to bypass other teams’ patches and reduce competitors’ accuracy multipliers, which adds an interesting game theory element to the competition.
Traditional fuzzing-based PoV generation
LACROSSE’s PoV generation occurs through established fuzzing methods, focusing on agent orchestration rather than AI-driven vulnerability discovery. Their approach prioritizes proven fuzzing reliability over experimental AI techniques, with Optimus Zero managing global state and task distribution among traditional security tools.
42-b3yond-6ug also maintains traditional fuzzing as the core PoV generation mechanism. Their approach includes SARIF integration for static analysis report validation and multi-fuzzer coordination through reinforcement-learning-based scheduling.
AI-enhanced traditional methods
Trail of Bits uses LLMs to generate Python programs that create specialized seed inputs for traditional fuzzing tools that leverage implicit understanding of complex formats like SQL injection and path traversal attacks. These specialized inputs have been added to the fuzzer’s coverage-guided corpus of inputs to improve fuzzing performance. This approach is optimized specifically for improved harness saturation time (to meet competition time constraints) and for using AI to generate semantically aware inputs that traditional mutational fuzzing struggles with.
Shellphish enhances traditional fuzzing with “Grammar Guy,” which uses LLMs to generate progressive grammars that evolve based on coverage feedback, targeting complex input formats and protocols. This approach improves the ability to fuzz formats like SQL, URLs, and binary protocols, with grammars continuously refined based on program exploration results. This AI-driven grammar generation approach consumes a sizable portion of their LLM budget but significantly increases their bug-finding capability.
Team Atlanta deploys language-specific PoV strategies across their three specialized CRS systems, with LLMs generating custom Python mutators and input generators tailored to C versus Java vulnerability patterns. Their approach includes directed fuzzing guided by static analysis reports and LLM-generated function-level dictionaries for targeted mutation.
AI-first PoV generation
all_you_need_is_a_fuzzing_brain generates approximately 90% of PoVs through direct AI reasoning, using thousands of concurrent agents in parallel to overcome AI unreliability through scale and model diversity. Traditional fuzzing is activated only when AI methods fail as a fallback validation mechanism.
Theori’s LLM agents use semantic understanding to generate PoVs that require format compliance. This gives them an edge when it comes to complex formats that traditional fuzzing struggled with, such as well-formed URLs and intricate binary protocols. When agent-generated PoVs fail, the reasoning attempts become seeds for traditional fuzzing, creating a feedback loop where AI insights inform traditional validation methods.
Patching
Each team’s patching strategy reveals their risk tolerance and understanding of the competition scoring mechanics, which was likely the most critical factor in determining final rankings.
Conservative: Trail of Bits, Shellphish, and Team Atlanta never submitted patches without PoVs. Team Atlanta actually disabled their non-PoV patching capabilities before the finals to avoid accuracy penalties.
Aggressive: Theori developed a mathematical model for submitting patches without PoVs, implementing a 2:1 ratio strategy where they’d submit up to two speculative patches for every confirmed PoV-based patch.
Holistic: 42-b3yond-6ug deployed “super patches,” which are single patches that fix multiple seemingly unrelated vulnerabilities, turning the accuracy penalty problem into a scoring advantage.
Strategic: Trail of Bits implemented cross-validation systems to test PoVs against existing patches and strategically submit PoVs that may break other teams’ patches. LACROSSE chose a middle ground, where patches were submitted using LLM consensus and a confidence algorithm.
What we’ve learned so far
We are eager to learn more technical details from the teams at DEFCON and are excited to check out the other teams’ CRSs when they become open source soon. Regardless of who wins, the AIxCC finals demonstrated that AI-assisted cybersecurity has reached a practical tipping point. Every team achieved meaningful automation of tasks that previously required human experts, from vulnerability discovery to patch generation. The innovations demonstrated here, from grammar-based fuzzing to agent-based analysis, will likely influence cybersecurity tools for years to come.
Most importantly, the competition proved that the question isn’t whether AI will transform cybersecurity, but how quickly and in what forms. The seven teams that made it to the finals each found different answers to that question, and this week, we’ll learn which approach DARPA’s judges found most compelling.
Lastly, we’d like to comment on what we admire about each team based on what we learned.
42-b3yond-6ug: We admire their creativity in the use of “super patches,” which attempt to fix multiple bugs with one patch, even if the bugs appear unrelated. Very clever!
all_you_need_is_a_fuzzing_brain: They get the Dr. Strangelove, or How I Learned to Stop Worrying and Love the LLM Award. We were very impressed to learn that much of their code was written with LLM code generation.
LACROSSE: This team gave its original CRS from almost 10 years ago a glow-up and competed in AIxCC! This says a lot about its ability to write long-lasting software.
Shellphish: We love anyone who is dedicated to making fuzzing tools faster and smarter. With Shellphish’s Grammar Guy, we believe that they have made a considerable leap forward in improving fuzzing for the security community.
Team Atlanta: Also in keeping with the spirit of the competition, Team Atlanta was the only one to run its CRS on fine-tuned models. This shows they have a good sense of where the security industry is heading.
Theori: Their approach resonated with the true spirit of the competition, using a very LLM-forward approach to building their strategy. We’re very excited to see how well they are able to reduce false positives on a large scale.
Trail of Bits: That’s us!
Thank you to CTF Radiooo for taking the time to interview each of the AIxCC finalists! Their hard work will help everyone understand which strategies were most effective when the results are announced.
For more background, see our previous posts on the AIxCC:
- Buckle up Buttercup: AIxCC’s scored round is underway
- Kicking off AIxCC’s Finals with Buttercup
- Trail of Bits Advances to AIxCC Finals
- Trail of Bits’ Buttercup heads to DARPA’s AIxCC
- DARPA awards $1 million to Trail of Bits for AI Cyber Challenge
- Our thoughts on AIxCC’s competition format
- DARPA’s AI Cyber Challenge: We’re In!