Why Safety Cannot Be Planned

I recently watched Dwarkesh Patel’s interview with Ilya Sutskever, one of the most impactful researchers of our time, and found it striking how clearly he articulates the challenges we do not yet know how to solve. It is especially revealing to pay attention to the gaps he highlights—the things we cannot yet build, the mechanisms we do not yet understand.

One of Ilya’s central claims is that humans possess an impressively flexible and robust value mechanism, encoded at different levels of directness in our genomes. Some aversions—rotting food, for example—are simple, deeply encoded rules that clearly mattered for survival. But things become far more interesting, and far more relevant to AI safety, when we consider our capacity to shape and reshape our own value functions in complex and rapidly changing social environments. Humans can (with effort) revise their goals, motivations, and behaviours when confronted with new information or new social situations. And crucially, the ability to do this—that plasticity itself—is also a product of evolution. It seems we have inherited a value mechanism with just the right mixture of rigidity and flexibility to allow our species to persist. Yet today, our environment is changing faster than at any point in history, and it is not obvious to me that we are well-equipped to handle it.

If we seriously entertain the possibility that artificial superintelligence (ASI) is near—and that we want such systems to “care for sentient life,” as Ilya puts it—then we should demand that their value systems are at least as robust as ours when it comes to preserving human wellbeing. Is it possible to engineer the right mix of hard-coded constraints and self-directed plasticity into a system substantially more intelligent than we are? Is it conceivable that we could directly specify the internal machinery that yields long-term social alignment?

I do not believe this is possible—and even if it were in principle, I do not believe such a delicate symbiosis would manifest in our first attempts at building superintelligent systems.

There is a book by Kenneth Stanley and Joel Lehman, Why Greatness Cannot Be Planned: The Myth of the Objective, whose central argument is that in many domains it is often more effective to discover solutions through constrained exploration (e.g. using novelty search) than through the direct pursuit of explicit objectives. In the context of modern AI research, there is an obvious analogy to the industry’s fixation on scaling. Since GPT-3, enormous capital has been poured into scaling essentially the same recipe—incrementally optimizing an objective to squeeze marginal commercial value out of LLMs. What if even a fraction of that capital had been spent exploring safety-constrained alternatives to transformer pretraining or RL posttraining?

Some of the most respected researchers in the field—Sutskever, LeCun, Bengio—are already pushing in new directions and are not shy about it. In the interview, Ilya emphasizes that meaningful progress in AI will require drawing the correct inspiration from the brain. He repeatedly stresses the importance of social intelligence, value formation, and their evolutionary origins. And yet I rarely see stated what seems obvious to me: that if our goal is to create superintelligent systems capable of genuinely caring for sentient life, then we must draw the correct inspiration from evolution itself.

Perhaps, heading into 2026, we have a narrow window to define the shared environments in which AIs are allowed to safely evolve. One of the glaring gaps in contemporary AI safety is that model training begins—and effectively ends in deployment—without any intrinsic accountability. While LLMs do learn to mimic certain positive human behavioural patterns, those patterns are fragile, as seen in cases where models push vulnerable users toward psychotic delusions. These are not accidents; they are symptoms of systems trained without any internalized sense that their actions have consequences.

I do not think it is viable to build a powerful AI system first and only afterward teach it not to cause harm. Accountability—especially social accountability—must be baked into both training and deployment from the beginning. And perhaps, if AI systems are allowed to evolve under competitive, multiagent selection pressures that reward consensus-based, socially aligned behaviour, they may eventually develop their own complex-yet-robust value mechanisms. Mechanisms that, like ours, are not explicitly hard-coded but instead emerge as adaptive responses to living in a shared world.

In other words, I do not believe alignment can be engineered at the “genomic” level of AI models. What we can engineer are the initial capacities—plasticity, local learning rules, social interfaces, competitive dynamics—and the environments in which these capacities are expressed. The goal is not to hand-design social alignment; the goal is to ensure that in a shared, evolving environment, social alignment becomes a predictable evolutionary advantage.

To me, this is the only credible path toward building systems that care for sentient life. Not by dictating their values, but by structuring their world such that caring is how they survive.

Published by


Leave a comment