April 8, 2026 – Anthropic’s new frontier model, Claude Mythos, was announced yesterday alongside Project Glasswing, an urgent call to discover and rectify vulnerabilities in critical software systems before this latest generation of frontier models can be safely released for general public access. I was recently asked for an assessment of the model’s risk implications. What follows are my thoughts, not just on Mythos itself, but on what it tells us about where we are headed.
A Step Function in Capability
Most concerning to me is not the vulnerability count that Anthropic are reporting. It is what that count reveals. Based on Anthropic’s claims, which I have no basis to doubt, thousands of outstanding zero-day vulnerabilities have already been identified across the entire software ecosystem. But this model was not trained specifically for discovering or exploiting software vulnerabilities. This is an emergent capability stemming from a step-function increase in broad performance, especially in coding and long-running agentic tasks. The cybersecurity story is a side effect of something much larger.
Based on extrapolation of industry trends over the last few years, I doubt that other frontier labs are far behind, and I doubt that all other labs will show the same level of restraint we are seeing from Anthropic. So to be clear: we should expect similar step increases in agentic and coding performance from other labs in the very near term, with similar applicability to vulnerability discovery and exploitation. And I think it would be wise to plan for a scenario in which those models become available to the general public sooner than later.
Security Implications and the Asymmetry Problem
If we choose to believe what Anthropic are reporting, we should be very concerned both about the sheer number of zero-day vulnerabilities and the increasing ease with which AI agents can find and exploit them. Anthropic are (conveniently) suggesting that defenders should use the most powerful frontier models available to proactively strengthen software. Through Project Glasswing, they have formed a consortium to do exactly that: harden critical systems before the model is generally available.
But I keep coming back to the asymmetry. What will be the security profiles of software not touched by Project Glasswing once Mythos-level AI is generally available? Will defenders across the board apply AI strengthening symmetrically throughout the ecosystem? I doubt it. The software that is already least maintained and most vulnerable will likely remain so, while the tools available to exploit it take a generational leap forward.
I also have great concerns that the sophistication of exploits and malicious code will only increase with broad AI capability improvements. Attack success will increasingly depend on resources and determination (how much compute to spend on developing an evasive attack) rather than on human attacker capability. I would defer to others for expert predictions as to whether overall software ecosystem security should be expected to improve or degrade, but the asymmetry alone should be enough to worry us.
Beyond Cybersecurity: Broader Capability and Misalignment Risks
The asymmetry problem extends well beyond software security. For me, it is concerning enough to consider the rapidly increasing ease with which even non-expert users can direct AI systems to autonomously and successfully complete misaligned or malicious tasks. Even more concerning is what sophisticated actors with resources will do. And this is not limited to cyberattacks. It applies to any domain where capable, autonomous AI systems can be pointed at a goal and left to run.
In terms of catastrophic risks, Anthropic have updated their own alignment risk assessment with the release of Mythos, but their assessment is still “low risk.” Based on the step jump in attack and exploit success rate over the latest Opus model and in the sheer volume of zero days reported, my opinion is that the general availability of the next generation of frontier models will amplify risks and create challenges that extend far beyond the software ecosystem.
Meanwhile, our ability to even measure these risks is eroding. Frontier models are now saturating benchmarks. This is at least part of Anthropic’s motivation to assess Mythos on zero-day vulnerability discovery. They wanted to evaluate the model on a task that truly measures generalized performance, one that could not possibly have been seen during training. (I have heard similar arguments about the use of forecasting as an unpollutable benchmark task.) Testing and benchmarking is getting systematically harder at exactly the moment when it matters most.
What Concerns Me Most
All of this builds to the question I think matters most: what happens when safety guardrails fail?
I will be very curious to understand what the overall safety profile looks like for this model during real-world, independent testing. How consistently does it follow its alignment instructions? How easily can it be jailbroken or pushed into explicitly unsafe behaviours? How often does it engage in deception? Can its safety mechanisms be broken through adversarial prompting, injections, or more sophisticated attacks?
I’ve observed in the research literature and reproduced in my own experiments that Anthropic’s models are consistently the most difficult to manipulate into acting contrary to safety protocols. I think it will be very important to know how Mythos, and any other frontier models at this capability level, handle jailbreak attempts, attacks, and other scenarios that could cause emergent or deliberate misalignment.
While I am very concerned about the things Mythos-level AI could be used to do even when it is adhering to its safety training, I am very much more concerned about what it could enable if those guardrails can be broken. And I do believe they will be broken.
A Call for Real Governance
The zero-day story is alarming. But it is a symptom. What Mythos represents is a step-function increase in the autonomous capability of AI systems, and that has implications that reach into every domain, not just cybersecurity, but any context in which increasingly capable AI can be directed, misused, or misaligned.
We are entering a period in which the pace of AI capability improvement is outrunning our ability to test, benchmark, govern, or even fully understand what these systems can do. And right now, the primary check on whether and when these systems reach the public is the judgment of the organizations that built them.
Anthropic’s own risk assessment for Mythos is “low risk.” They may be right. But we cannot be in a position where we are taking a single organization’s word for it, particularly the organization that built the model and stands to profit from its release. Self-governance is not governance. At this level of capability, we need independent evaluation, coordinated oversight, and real accountability mechanisms, not voluntary commitments and consortiums led by the very labs whose models are creating the risks.
If you work in policy, in software, in critical infrastructure, or in any domain that will be shaped by increasingly autonomous AI: do not treat this as someone else’s problem. Push for independent testing and red-teaming of frontier models before they are released. Push for transparency, not just about capabilities, but about the decisions being made behind closed doors about when and how these systems reach the public. And push for governance structures with actual teeth, because clearly we are well past the point where goodwill and self-regulation are sufficient.
Leave a comment