This website uses cookies

Read our Privacy policy and Terms of use for more information.

What you'll learn

  • Reproducibility, not detection, is the actual bottleneck in patching: maintainers receive thousands of reports they cannot afford the hours to validate, so real CVEs sit in the backlog while teams chase symptoms.

  • Roughly 30 to 35 percent of LLM-generated patches can be bypassed because they solve a single crash, not the underlying class of vulnerability — and the same pattern shows up in human-written patches that get bypassed in a week or ten days.

  • The defensive chain of verification has to include fuzzing, not just compile plus regression plus unit tests plus an LLM critic, or you ship patches that pass every test you wrote and fail the test the attacker runs.

Description

This is the episode I have been pushing for since CVE Genie hit arXiv. Saad Ullah is a PhD student at Boston University, a member of Team Shellfish, and the lead author behind two of the more important security papers in the LLM era — the 2024 work showing that off-the-shelf LLMs cannot reliably reason about vulnerabilities, and CVE Genie itself, which proved an LLM agent system can build, exploit, and reproduce CVEs end-to-end across more than a hundred and forty vulnerability classes.

The conversation goes where the headlines do not. We spend most of the hour on the part of the problem that does not show up in vendor demos: reproducibility. Open source maintainers receive thousands of issue reports they cannot afford to reproduce, so the real CVEs sit in the backlog while everyone chases symptoms. CVE Genie's contribution is not just exploitation. It is an architecture that builds the project, runs it in a vulnerable state, generates a verifier, and gives a maintainer a reproducible signal worth acting on.

Saad's upcoming paper lands the harder finding. Roughly thirty to thirty-five percent of patches generated by the top LLM patching frameworks can be bypassed in a week or ten days, because they solve a single crash, not the class of vulnerability. The same is true of many developer-written patches in 2025. Team Shellfish won the best patcher award at the DARPA AI Cyber Challenge by adding fuzzing to the verification chain that most systems still skip.

If you run a vulnerability management program, this is the conversation about where the next twelve months go.

What we cover

  • "Attacking is going one epoch up" — Saad's framing of the asymmetry that drove him to publish CVE Genie

  • "Maintainers can't reproduce the issue" — the open source backlog problem behind real CVEs going unpatched

  • "30 to 35 percent of LLM patches can be bypassed" — the headline finding from the upcoming paper

  • "The patch only solves the symptom" — why developer-written patches get bypassed in a week or ten days

  • "Best patcher at AIxCC" — Team Shellfish's chain of verification, with fuzzing baked in

  • "Build access is the moat" — why dynamic analysis tools cannot be applied to most open source projects

  • "Artificial" — Saad's product turning CVE Genie's reproduction stack into autonomous remediation

  • "Security engineers still pick the context" — what does not get automated, even at scale

Thank you to our Sponsors:

Hampton North is the premier US based cybersecurity search firm. Start building your security team with Hampton North

Sysdig is the leader in AI-powered real-time cloud defense; stop watching and start defending 

The conversation

The asymmetry that started CVE Genie

Saad's path into vulnerability research started with a question the rest of the field was asking informally and not testing rigorously: can large language models actually reason about vulnerabilities, or are they pattern-matching on training data? His 2024 paper — one of the most cited security papers of the year — answered that they could not do it reliably out of the box. The follow-up question was more provocative. If a small academic team could publish a system that builds, exploits, and reproduces CVEs across more than a hundred and forty classes, who else has the same capability and is not publishing?

Now we can actually generate these proof of concept exploits. And if I am able to do it and I have published something about it, it's very possible that there are people around the world who are already doing it and we don't really know.

— Saad Ullah

The paper landed and the defensive conversation moved. The framing Saad used in our recording was that the attack side was already moving an epoch faster than the defense side, and the only way to force the defensive conversation forward was to publish the attack capability with enough rigor that nobody could dismiss it. CVE Genie did that. The Trail of Bits team — friends of the pod — took second place at the DARPA AI Cyber Challenge tour championship at Black Hat, which is the public moment where this stopped being a research debate and started being a procurement question.

Reproducibility is the bottleneck no one talks about 

The piece of CVE Genie that gets lost in the "AI can write exploits" headline is the part that matters most to a defender. Most vulnerability reports against open source projects sit in a backlog because the maintainer cannot reproduce them. Saad estimates roughly ninety percent of inbound reports are false positives, and the maintainers do not have the time to triage a single one. The real CVEs disappear into the noise.

We reproduce and we actually figured out that there were some CVEs in that that we were not dealing with because the maintainers don't really have that much time to go through thousands and thousands of warnings.

— Saad Ullah

This is where CVE Genie's architecture matters more than its exploitation results. The system builds the project, instruments it, runs it in a vulnerable state, generates the verifier, and produces a reproducible artifact a maintainer can act on. The defensive on-ramp is autonomous reproduction, not autonomous patching — because reproduction is the gate that unlocks every downstream tool, including dynamic analysis, fuzzing, and root cause inference. Without it, security engineers are stuck reading text reports and guessing. 

The bypass problem: patches that solve symptoms

The number Stuart immediately flagged is the right one to flag. Saad's upcoming paper shows that almost a third of patches generated by the top LLM patching frameworks can be bypassed easily, and the same pattern shows up in developer-written patches throughout 2025. The mechanism is consistent. A vulnerability report describes a crash. The patch fixes the crash. The class of vulnerability — use-after-free, for instance — is still there, and it produces a different crash a week later that bypasses the patch.

Almost like 30, 35 % of the patches that are generated by the top, top patching frameworks can be bypassed very, very easily.

 — Saad Ullah

This is the thing that should make a vulnerability management leader pause. The current generation of patching tools and even significant chunks of human patching are operating at a layer above root cause. They are remediating symptoms because that is what the bug report describes and what the unit tests catch. The attacker is not testing your unit tests; the attacker is testing the class of vulnerability and finding the next crash inside it.

The chain of verification at AIxCC

Team Shellfish won the best patcher award at the DARPA AI Cyber Challenge for one specific reason: their verification chain treated patch generation as adversarial. The competition was structured so that every team's patches were attacked by every other team's seed generators. A patch that compiled, regressed cleanly, passed unit tests, and satisfied an LLM critic was still not finished. It also had to survive fuzzing.

Your patch needs to actually remediate that root cause of the vulnerability.

— Saad Ullah

The lesson generalizes. Most teams shipping LLM-assisted patching today run the first four checks — compile, regression, unit test, LLM critic — and skip the fuzzer because integration is hard and the build access often is not there. Saad's framing in this section is that the missing piece is not novel; fuzzing has been a standard security tool for years. What is novel is using it as a closing gate on every machine-generated patch. Without that gate, the patch passes the tests you wrote and fails the test the attacker runs. The other structural lesson from the challenge is that build access is a moat. The tools that exist for static analysis are noisy by design; the tools that work are dynamic and require the project to be buildable in an instrumented state. Most enterprises and most security vendors do not have that build, which is why so many capable security tools sit unused.

From benchmark to product: artificial

CVE Genie started as a research benchmark — Saad wanted a diverse, reproducible test set across more than a hundred and forty CWEs to evaluate LLMs against, since most prior work clustered around memory-related vulnerabilities the models had memorized from training data. The system that fell out of that goal turned out to do more than benchmark. It could take a developer-written patch, attempt to bypass it, and frequently succeed.

The next goal should be that how do you remediate and how do you remediate them very, fast? And I think in the very beginning, we said the first step of remediation is to be able to reproduce that issue.

— Saad Ullah

The product Saad is building post-PhD — artificial — is the productization of that pipeline: take a backlog of issues from any source, reproduce the ones that are real, root-cause them, generate patches that pass a chain of verification including fuzzing, and surface the small number that matter. The integration point with existing vulnerability programs is upstream of EPSS and KEV, not downstream. EPSS and KEV tell a security team which vulnerabilities to prioritize. Tools like artificial answer the harder downstream question of whether the team can actually patch them in time, and whether the patch holds. The role for security engineers does not go away — Saad is emphatic that human judgment about context (what code to extract, whether a patch is right for a specific application) remains the bottleneck a model cannot replace — but the speed of remediation moves up an order of magnitude. 

Show notes

Guests — Saad Ullah, PhD student at Boston University; member of Team Shellfish; co-creator of CVE Genie; founder of artificial. 

Books mentioned — none 

Frameworks / models / tools named

  • CVE Genie — reproducible-vulnerability benchmark and exploit-generation system, covering more than 140 CWEs

  • Sec LLM Holmes — Saad's earlier project evaluating whether LLMs can reason about vulnerabilities and perform root cause analysis

  • DARPA AI Cyber Challenge (AIxCC) — DARPA's autonomous cyber reasoning competition; tour championship held at Black Hat

  • Team Shellfish — Saad's team at the AIxCC; received the best patcher award

  • OSS-Fuzz — Google's open source fuzzing infrastructure; the project corpus used in the AIxCC challenge harnesses

  • KEV — known-exploitable vulnerabilities catalog

  • EPSS — Exploit Prediction Scoring System

  • Address sanitizer (ASan) — referenced as the runtime instrumentation Team Shellfish used to validate memory-class vulnerabilities

Hosted by Conor Sherman and Stuart Mitchell.

Keep Reading