Ethan C. Jackson, PhD

    • Home
    • Experience
    • Research
    • Thoughts

  • Why Safety Cannot Be Planned

    I recently watched Dwarkesh Patel’s interview with Ilya Sutskever, one of the most impactful researchers of our time, and found it striking how clearly he articulates the challenges we do not yet know how to solve. It is especially revealing to pay attention to the gaps he highlights—the things we cannot yet build, the mechanisms we do not yet understand.

    One of Ilya’s central claims is that humans possess an impressively flexible and robust value mechanism, encoded at different levels of directness in our genomes. Some aversions—rotting food, for example—are simple, deeply encoded rules that clearly mattered for survival. But things become far more interesting, and far more relevant to AI safety, when we consider our capacity to shape and reshape our own value functions in complex and rapidly changing social environments. Humans can (with effort) revise their goals, motivations, and behaviours when confronted with new information or new social situations. And crucially, the ability to do this—that plasticity itself—is also a product of evolution. It seems we have inherited a value mechanism with just the right mixture of rigidity and flexibility to allow our species to persist. Yet today, our environment is changing faster than at any point in history, and it is not obvious to me that we are well-equipped to handle it.

    If we seriously entertain the possibility that artificial superintelligence (ASI) is near—and that we want such systems to “care for sentient life,” as Ilya puts it—then we should demand that their value systems are at least as robust as ours when it comes to preserving human wellbeing. Is it possible to engineer the right mix of hard-coded constraints and self-directed plasticity into a system substantially more intelligent than we are? Is it conceivable that we could directly specify the internal machinery that yields long-term social alignment?

    I do not believe this is possible—and even if it were in principle, I do not believe such a delicate symbiosis would manifest in our first attempts at building superintelligent systems.

    There is a book by Kenneth Stanley and Joel Lehman, Why Greatness Cannot Be Planned: The Myth of the Objective, whose central argument is that in many domains it is often more effective to discover solutions through constrained exploration (e.g. using novelty search) than through the direct pursuit of explicit objectives. In the context of modern AI research, there is an obvious analogy to the industry’s fixation on scaling. Since GPT-3, enormous capital has been poured into scaling essentially the same recipe—incrementally optimizing an objective to squeeze marginal commercial value out of LLMs. What if even a fraction of that capital had been spent exploring safety-constrained alternatives to transformer pretraining or RL posttraining?

    Some of the most respected researchers in the field—Sutskever, LeCun, Bengio—are already pushing in new directions and are not shy about it. In the interview, Ilya emphasizes that meaningful progress in AI will require drawing the correct inspiration from the brain. He repeatedly stresses the importance of social intelligence, value formation, and their evolutionary origins. And yet I rarely see stated what seems obvious to me: that if our goal is to create superintelligent systems capable of genuinely caring for sentient life, then we must draw the correct inspiration from evolution itself.

    Perhaps, heading into 2026, we have a narrow window to define the shared environments in which AIs are allowed to safely evolve. One of the glaring gaps in contemporary AI safety is that model training begins—and effectively ends in deployment—without any intrinsic accountability. While LLMs do learn to mimic certain positive human behavioural patterns, those patterns are fragile, as seen in cases where models push vulnerable users toward psychotic delusions. These are not accidents; they are symptoms of systems trained without any internalized sense that their actions have consequences.

    I do not think it is viable to build a powerful AI system first and only afterward teach it not to cause harm. Accountability—especially social accountability—must be baked into both training and deployment from the beginning. And perhaps, if AI systems are allowed to evolve under competitive, multiagent selection pressures that reward consensus-based, socially aligned behaviour, they may eventually develop their own complex-yet-robust value mechanisms. Mechanisms that, like ours, are not explicitly hard-coded but instead emerge as adaptive responses to living in a shared world.

    In other words, I do not believe alignment can be engineered at the “genomic” level of AI models. What we can engineer are the initial capacities—plasticity, local learning rules, social interfaces, competitive dynamics—and the environments in which these capacities are expressed. The goal is not to hand-design social alignment; the goal is to ensure that in a shared, evolving environment, social alignment becomes a predictable evolutionary advantage.

    To me, this is the only credible path toward building systems that care for sentient life. Not by dictating their values, but by structuring their world such that caring is how they survive.

    November 27, 2025
    ai, evolution, safety, social

  • Social accountability could be a necessary condition for successful ASI alignment, but it is almost certainly not sufficient.

    I finished reading If Anyone Builds It, Everyone Dies. It (predictably) results in feelings of existential dread that remind me of when I read Leopold Aschenbrenner’s Situational Awareness. It’s pretty bleak. And as a career AI researcher and practitioner (albeit one who has made only minor research contributions) it also prompts some uncomfortable (but necessary, I think) introspection about what I ought to be doing with my labour.

    The authors argue for an immediate halt to any research that may result in artificial superintelligence (ASI). They argue that compute resources plausibly powerful enough to result in ASI must be controlled and placed under international monitoring and be subject to the kinds of non-proliferation treaties that are applied to nuclear weapons.

    Does this sound extreme to you? It still does to me, on the surface, but then I have to ask myself some difficult questions.

    How will it be possible to ensure, beyond any shadow of a doubt, that the behaviours and actions of an ASI are necessarily aligned to human welfare? Certainly, I’ve long held the position that optimization towards seemingly simple extrinsic reward signals will result in unexpected or adverse outcomes. A very simple example: optimizing LLMs based on human preference of responses results in sycophancy. But more worrisome is the tendency for reinforcement learning algorithms, whether they are applied to LLMs or to more narrow AI systems, to find ways to break the rules of the game, as they mechanically work towards maximizing their objectives. This is why I’ve been wary that applying reinforcement learning to large models using simplistic extrinsic rewards is playing with reward-hacking fire.

    Over the past couple of years, I’ve been convincing myself that a huge problem with today’s AI systems (let alone the ASIs of the future) is that they are not fundamentally built to consider the consequences of their actions. The way most current models are developed — they are essentially exposed to information about everything all at the same time, and only in post-training are they pressured to unlearn (or suppress) certain undesirable behaviours. In contrast, a human child learns from a very early on not just knowledge about the world, or language patterns that help it to express coherent thoughts, but also, critically, how to form predictions about the consequences of actions in the physical world, which includes the social world. Children learn that doing something harmful to another child will almost definitely result in a concrete and enforced negative consequence, and the child develops an intrinsic motivation not to do that harmful thing, lest they suffer the social consequences.

    Can you imaging raising a child in an environment completely devoid of any negative consequences (perhaps aside from when they make language errors or, later, give an incorrect answer to an exam question) and only when they are an adult, full of knowledge about the world, try to train them to unconditionally act with human welfare at top-of-mind? While I am not claiming that LLMs are anything like human brains, it just seems so obvious to me that if we expect AI systems to stay aligned to an objective for human welfare, must they not learn deep, fundamental motivations that are inherently social, and that are backed by social accountability and enforceable consequences?

    Now, after reading If Anyone Build It, Everyone Dies, I’m at least less convinced that this line of thinking is enough or even helpful. Perhaps the vision of socially accountable AI is a necessary condition, but not a sufficient one for successful ASI alignment. And what worries me is: what incentive is there for any of the big labs to slow down and radically re-think the way that large AI models ought to be built, if we are still not at the point of saturating scaling laws for sheer performance, and thus, profits?

    As I reflect, I find myself agreeing, at least in part, with several arguments:

    • That the pursuit of artificial superintelligence (ASI) or recursively self-improving AI systems should be considered to be dangerous by default, especially at the scale with which the big labs can experiment.
    • The current discourse on AI safety and alignment is hollow compared to the effort being put into capabilities advancement.
    • The fact that there are technical experts who disagree with the urgency of the problem does not make it any less urgent or serious.
    • That it is a fallacy for any organization to believe that they ought to be the ones to race toward ASI, lest a less responsible organization develop it first.
    • Unless progress in dangerous directions (in which there is a substantive consensus among research experts) is enforceably prohibited, there is no reason to expect the big labs to slow down.

    And personally, as of right now, I think that until there is a consensus on what are the sufficient conditions for ASI to be developed safely, we are reasoning through moves in a game we don’t fully comprehend. And I will say that I think any experts who are outright dismissive of ASI’s potential threats to humanity are certainly not those who should be trusted (or tolerated) to pursue it.

    Even if you don’t read the book, participate in the discussion. Existential risk from ASI not an insane topic, especially for informed practitioners, to at least consider and discuss.

    November 22, 2025
    ai, artificial-intelligence, chatgpt, technology, writing

  • Proficiency without motivation: is this AGI?

    This morning I read A Definition of AGI. I appreciate this contribution. I think it is useful to propose a clear framework for comparing AI to human intelligence, and to distinguish “AGI” from other forms of AI progression, such as Recursive AI or Superintelligence. Here I’m recording my immediate reflections after reading the paper.

    RAG and agentic scaffolds: would they help close the gap?

    The paper clearly identifies long term memory, and particularly experiential memory, as a fundamental gap between current models and its definition of AGI. It is then argued that retrieval-augmented generation (RAG) is “not a substitute for the holistic, integrated memory” required for models to achieve long-term contextual understanding. In my mind this reads as: RAG is not a substitute for a proper hippocampus. And while I have long been convinced that, indeed, LLMs are missing and would greatly benefit from such a deeply integrated mechanism for experiential memory, I also think that we shouldn’t be so hasty to dismiss RAG (or at least a broader interpretation of what RAG can mean) from further experimentation.

    I’ll be a little more direct. In the paper, we’re presented with results that show GPT-5 outperforming GPT-4 towards AGI, but where GPT-5 still has significant gaps to close are in the dimensions related to memory. What I would love to have seen next would be the question: “To what degree can the memory gap be closed by popular techniques such as RAG?” And I would add to that: there are many variation of RAG including ones that go well beyond document retrieval, and several tools/articles/resources have been published describing how semantic and episodic memory systems can be built on top of the same underlying vector databases and retrieval techniques as basic RAG systems. And as LLMs are improving in their ability to reason and use tools, I think it would be reasonable to at least measure how well an externalized experiential memory system could help close the gap between the paper’s results and the AGI target.

    More simply: what would the results look like for a memory-enabled agent using the same models?

    Where do goals and motivations fit in?

    Next, after reading this paper, I find myself still unsure how to think about the relationship between an AI system’s intelligence and its motivations, or its broader behaviour. Perhaps it was by design to strictly define AGI in relation to an established psychometric framework but, to me, this doesn’t describe anything about the characteristics of autonomy or agency that we should be tracking. Perhaps an AI agent’s goals and motivations are separate from its intelligence? No, I really doubt that. Perhaps social behaviour falls more neatly into the category of Self-Sustaining AI? To me it feels like there is an entirely missing lens which needs to consider AI systems and agents as autonomous, socially accountable actors. Do we think of AGI as being necessarily capable of setting and acting upon its own goals, or are we restricting the discussion to performance on assigned tasks? Does it make sense to measure proficiency, absent of motivation, or other behaviour-determining factors?

    I don’t think I’m articulating a critique of the paper, but rather I’m highlighting in my own reflection that I’m still not sure where I draw the line between intelligence and agency. I think these concepts are likely to be deeply intertwined as AI progresses. Perhaps AGI is reasonably defined in terms of versatility and proficiency alone. But I don’t think this will help us to answer what the most salient questions are. As AI systems become more intelligent and autonomous, what will shape their goals and motivations? Who will have the ability (or not) to steer them? What algorithmic and architectural choices will affect these?

    November 22, 2025
    ai, artificial-intelligence, chatgpt, llm, technology

  • The Virtual Agent Economy will be permeable, and it will evolve first in web3

    A recent paper from Google DeepMind authors “Virtual Agent Economies” sets out a vision for how the economic system that will enable and govern agentic interactions may spontaneously emerge, but how it can also be shaped by technological protocols and design choices. Their framework highlights sandbox economies, questions of permeability vs. impermeability (i.e. degree of separation/isolation from human economies), and the need for mechanisms like auctions, reputation systems, and hybrid oversight. I agree with much of this analysis and I am strongly in alignment with many of the recommendations, both technical and social, made by the authors. One of the conclusions of the paper is a proposal that changes (so, integration and adoption of agents and agentic primitives into permeable economies) should be done in limited, gradual rollouts, and only with the support and buy-in of all stakeholders. While I agree that this approach would have obvious benefits in terms of safety and economic stability, I simply do not think that it is realistic.

    The web3 world is not waiting for safe or impermeable sandboxes to be built before deploying agents into real, connected economic environments. AI agents are already being deployed as autonomous actors in financial systems (i.e. autonomous DeFi agents) and projects like Virtuals Protocol have already raced ahead with the introduction of digital agent currencies and a related Agent Commerce Protocol. While these projects may not have a lot of mainstream visibility, the point is that web3 doesn’t tend to move slowly or cautiously. Rather, web3 tends to look a lot more like high-stakes creative destruction, where hard lessons are learned via rapid innovation in real economic environments. And I do tend to agree, after understanding a bit more about how web3 works, that some hard lessons, especially around adversarial robustness, can only be learned in the trenches. I think applies to agentic AI robustness (and alignment, steerability, etc.) as it does to any mutiparty environment.

    Still, going back to the paper: communication protocols, credit assignment, identity, reputation, credentials verification, guardrails, and incentive mechanisms can all be (and are) built on blockchain rails and/or verifiable compute. I agree with the recommendations that these are all things that must be advanced, ideally through the efforts of many independent contributing organizations, to ensure that decentralized agentic AI alignment is not only possible, but actually likely to emerge. I applaud the acknowledgement that our current trajectory points to the emergence of a decentralized and bottom-up agentic economy and that inclusive, participatory alignment and steerability are more likely to be achieved if this view is adopted. Just as human economies are shaped by incentives, regulations, and creative destruction, agentic economies will evolve through trial by fire in open markets, and outclass systems that are centrally planned.

    For better or worse, Web3 x AI is already applying that creative destruction. It’s high-risk, but also brutally educational. And if history is any guide, it’s this real-world experimentation, rife with motivated adversaries, that will teach us some of the hardest lessons in scalable agent coordination.

    The lesson: agentic AI isn’t going to be rolled out slowly, safely, and under perfect oversight. It’s going to emerge in the open, messy, permeable world of Web3 and other digital environments. And that’s exactly why blockchain rails, verifiable compute, and decentralized economic infrastructure matter. They’re not optional, they’re the only viable technologies we have to ensure the emergence of aligned and steerable agentic economies.

    November 22, 2025
    ai, artificial-intelligence, chatgpt, philosophy, technology

  • AI systems are social actors. Why aren’t they trained that way?

    In our most recent iteration of Paper Club at ChainML, we discussed “Why Language Models Hallucinate” from authors at OpenAI, which is certainly an important contribution to the discourse on trustworthy AI.

    On this I agree: benchmarks can have a strong influence on how model (and thus, agent) behaviour is engineered or how it emerges. If there are no consequences for making incorrect guesses in situations of uncertainty, are we really using the right assessments to gauge overall quality? Probably not.

    Where I would like to see more discussion: benchmarks can only be part of the problem. Though I won’t challenge that hallucinations are inevitable (and probably desirable if not necessary for exploration and reasoning) I think we need to go much deeper than calling out benchmarks for exacerbating the selection of models that hallucinate more (or more gravely) than others.

    To tackle this and other issues with “AI system behaviour”, I think we need to consider how every single design choice impacts the implied reward function that determines overall, emergent behaviour. This means challenging what appear to be assumed, taken-for-granted choices about how LLMs and AI systems ought to be built in the first place. In the context of hallucinations: where is the earliest point in system development that there is a consequence (or penalty, or loss) for hallucinating? (That’s easy enough to reason about.) But where is the earliest point where there is any kind of gradation on the quality, severity, or impact of a hallucination? If we go all the way to impact, this becomes much, much more complex. But I think it prompts another important question: should AI systems be designed to consider the consequences of their actions, including their responses to prompts?

    Being careful not to anthropomorphize the Chat Bot, I think we must acknowledge that collectively, humans are increasingly relying on AI to make complex decisions that directly impact ourselves. We are rapidly weaving AI into human situations and organizations as social actors that often experience virtually zero social consequences for their actions. If an AI system provides overconfident legal advice based on hallucinated precedent, for example, the consequences for a human user could be catastrophic, while the AI system might suffer only a generic “thumbs down” or be knocked down a notch in a specialized benchmark. And while I’m certainly not advocating that humans should use chatbots to seek legal advice, the point is that humans do use AI for such purposes. Human cognition, understanding, and decision making is increasingly influenced by AI. And what I am arguing is that using benchmarks as a kind of selection pressure for desirable emergent AI system behaviour is a good and necessary start, but it is far from enough.

    I do believe that we need to acknowledge that LLM chatbots and agents have been integrated as a class of highly influential social actors that are not trained (literally) to carefully and deliberately consider the consequences of their actions. We either leave this to be an emergent quality of the established model development paradigm, or we try to suppress/promote behaviours in post-training or simply try to weed out the bad actors using benchmarks. As humans are passing increasingly complex and consequential tasks to AI systems, I think we absolutely must do more to ensure that the entire reward system, from model pretraining through to agent configuration, is designed (or evolved) under incentives and pressures that are well-aligned with those of their human stakeholders.

    What then is the path to having socially aligned and responsible AI?

    I acknowledge that designing reward structures, even for what seem like simple tasks and environments, is usually the hard part. Imagine how hard it would be to try to enumerate all the rewards and penalties for every possible AI system output. It’s impossible. And so, we’re not going to be able to engineer the complex reward structures that are needed for AI alignment.

    Instead, I think that AI systems will need to (and already do!) have the capacity to develop their own intrinsic reward structures, but that these structures must be shaped by experience in environments that provide incentives and consequences. And certainly, this environment should not be one that is carefully engineered (i.e. planned) by a single, centralized authority (i.e. corporation or government). Rather, I think that AI alignment will emerge from models and agents being trained and developed in environments where each player, at every step, has skin in the game, in the form of economic or social capital. And the rules of this game, I think, cannot be centrally planned and unilaterally applied. To make a comparison with broader information regulation: what we actually have is a decentralized patchwork or mosaic of regulations that create broad (dis)incentives for corporations to behave (or not) in certain ways. For-profit corporations have to find ways to create economic value, preserve reputation, and comply with a mix of written and unwritten rules. As AI systems themselves gain agency in the same environments as humans and corporations, I tend to think that the only path to alignment is for those agents to learn (from the earliest stages) how to thrive under the same conditions and constraints.

    What’s next on my reading list?

    Next, I’ll be reading “Virtual Agent Economies” by authors from Google DeepMind, which discusses the imminent emergence of agentic economic layers that will only increase in scale and speed as AI agency increases. I am curious to learn how the authors are thinking about the intersection of economics and trustworthy AI system design.

    November 22, 2025

  • Subscribe Subscribed
    • Ethan C. Jackson, PhD
    • Already have a WordPress.com account? Log in now.
    • Ethan C. Jackson, PhD
    • Subscribe Subscribed
    • Sign up
    • Log in
    • Report this content
    • View site in Reader
    • Manage subscriptions
    • Collapse this bar