A Critical Looks at Anthropic’s 500 Zero-Day Claims
Posted on Mi, 2026-02-11 in genai
Anthropic recently published a blogpost detailing how their latest model, Claude Opus 4.6, discovered over 500 zero-day vulnerabilities in open-source software. The news was quickly picked up by outlets like Golem, but I can’t help but feel a sense of déjà vu. This reminds me of headlines from 15 years ago, like “Mayhem finds 1,200 bugs”, which also sparked excitement—only for the hype to quickly settle. As we all know today, Mayhem and other symbolic execution tools haven't solved the Automated Exploit Generation (AEG) problem.
The idea that LLMs can identify vulnerabilities isn’t new. The AIxCC competition already demonstrated this capability years ago, and it’s reasonable to assume that security tutorials and bug writeups are well-represented in training data. Anthropic’s claim that they tested Opus’s ability to find bugs without specialized prompting is interesting, but not groundbreaking. In fact, reducing reliance on targeted prompts might make the whole system less deterministic: something you'd want to avoid in practice. If I’m paying $10k for a LLM Tokens during my LLM-based security audit, I’d prefer consistent, reproducible results over a model that might surface different bugs with each run.
Key Questions Left Unanswered
There are multiple questions that Anthropic does not answer in their blog post.
- False Positives and Hallucinations Anthropic says they have found 500 zero-day vulnerabilities, which were all validated by internal or externals teams with a security background. That is impressive. However, how many bugs did Claude report on top of those 500 that they were not able to validate as actual vulnerabilities/bugs? What is the total number of bugs that Claude reported.
- Severity and Exploitability Anthropic claims that these are all “high-severity” bugs. What does that mean? Not all memory corruption bugs are equally dangerous. For example, the Linux kernel assigns CVEs to every memory corruption issue, but exploitability depends on context, i.e., whether the bug can be triggered in a realistic attack scenario. A memory corruption bug in module loading isn’t a vulnerability if the attacker already needs root privileges to exploit it (assuming root is allowed to load kernel modules like in most unhardened Linux systems). Automated tools (and LLMs) often struggle to assess real-world impact because they lack a proper threat model and contextual understanding. Competitions like the AIxCC tried to mitigate this Problem by having clear definitions of what is considered a true alarm or by forcing the automated tools to create exploits to proof that the found bug is truely a vulnerability.
- Representative Projects? Were the analyzed projects truly representative? For example, the cgif Project has 180 stars on GitHub at the time of writing and als has not released a version 1.0 yet. This suggest this might not be a widely used, and as such not heavily scrutinized Project. What Projects did they analyze? And of those projects what is the security posture? Did they use standard security tools like SAST or fuzzing? If not, finding 500 bugs isn’t that impressive. It is somewhat expected. Early fuzzing research faced similar criticism: “We found 20 CVEs!” sounds great until you realize the code had never been fuzzed before. Benchmarks aren’t perfect and have a lot of problems on their own, but they at least allow for fair comparison with the current state-of-the-art.
- Efficiency and Scalability How cost-effective is this approach? Anthropic can afford to burn Opus tokens for research, but real-world scalability matters. What’s the cost per valid bug found? Most organizations do not have the funds to use Opus at scale. Given that GenAI is still heavily subsidized, I expect cost for these high end models to rise in the future.
So we are left with knowing that Claude Opus can perform security research on its own, but not whether it is something that stays a research PoC at Anthropic, or whether this is something we can actually run ourselves.
A Call for Responsible Reporting
Despite these questions, it’s fascinating to see how Opus identified these bugs (e.g., performing variant analysis based on a previous git commit is very impressive). I hope Anthropic follows up with a detailed paper. This deserves more than a blogpost. Right now, the hype cycle is in full swing, and even minor announcements from Anthropic make waves. I became aware because this blogpost, because it was shared and discussed at my current workplace by people outside of the security field. If Anthropic wants to lead responsibly, they should prioritize rigorous, peer-reviewed research over flashy marketing or publishing half-finished research as blog posts. I know they have the capability to do so.
PS: This blogpost was written with the help of a LLM, but not one from Anthropic :)