Social accountability could be a necessary condition for successful ASI alignment, but it is almost certainly not sufficient.

I finished reading If Anyone Builds It, Everyone Dies. It (predictably) results in feelings of existential dread that remind me of when I read Leopold Aschenbrenner’s Situational Awareness. It’s pretty bleak. And as a career AI researcher and practitioner (albeit one who has made only minor research contributions) it also prompts some uncomfortable (but necessary, I think) introspection about what I ought to be doing with my labour.

The authors argue for an immediate halt to any research that may result in artificial superintelligence (ASI). They argue that compute resources plausibly powerful enough to result in ASI must be controlled and placed under international monitoring and be subject to the kinds of non-proliferation treaties that are applied to nuclear weapons.

Does this sound extreme to you? It still does to me, on the surface, but then I have to ask myself some difficult questions.

How will it be possible to ensure, beyond any shadow of a doubt, that the behaviours and actions of an ASI are necessarily aligned to human welfare? Certainly, I’ve long held the position that optimization towards seemingly simple extrinsic reward signals will result in unexpected or adverse outcomes. A very simple example: optimizing LLMs based on human preference of responses results in sycophancy. But more worrisome is the tendency for reinforcement learning algorithms, whether they are applied to LLMs or to more narrow AI systems, to find ways to break the rules of the game, as they mechanically work towards maximizing their objectives. This is why I’ve been wary that applying reinforcement learning to large models using simplistic extrinsic rewards is playing with reward-hacking fire.

Over the past couple of years, I’ve been convincing myself that a huge problem with today’s AI systems (let alone the ASIs of the future) is that they are not fundamentally built to consider the consequences of their actions. The way most current models are developed — they are essentially exposed to information about everything all at the same time, and only in post-training are they pressured to unlearn (or suppress) certain undesirable behaviours. In contrast, a human child learns from a very early on not just knowledge about the world, or language patterns that help it to express coherent thoughts, but also, critically, how to form predictions about the consequences of actions in the physical world, which includes the social world. Children learn that doing something harmful to another child will almost definitely result in a concrete and enforced negative consequence, and the child develops an intrinsic motivation not to do that harmful thing, lest they suffer the social consequences.

Can you imaging raising a child in an environment completely devoid of any negative consequences (perhaps aside from when they make language errors or, later, give an incorrect answer to an exam question) and only when they are an adult, full of knowledge about the world, try to train them to unconditionally act with human welfare at top-of-mind? While I am not claiming that LLMs are anything like human brains, it just seems so obvious to me that if we expect AI systems to stay aligned to an objective for human welfare, must they not learn deep, fundamental motivations that are inherently social, and that are backed by social accountability and enforceable consequences?

Now, after reading If Anyone Build It, Everyone Dies, I’m at least less convinced that this line of thinking is enough or even helpful. Perhaps the vision of socially accountable AI is a necessary condition, but not a sufficient one for successful ASI alignment. And what worries me is: what incentive is there for any of the big labs to slow down and radically re-think the way that large AI models ought to be built, if we are still not at the point of saturating scaling laws for sheer performance, and thus, profits?

As I reflect, I find myself agreeing, at least in part, with several arguments:

  • That the pursuit of artificial superintelligence (ASI) or recursively self-improving AI systems should be considered to be dangerous by default, especially at the scale with which the big labs can experiment.
  • The current discourse on AI safety and alignment is hollow compared to the effort being put into capabilities advancement.
  • The fact that there are technical experts who disagree with the urgency of the problem does not make it any less urgent or serious.
  • That it is a fallacy for any organization to believe that they ought to be the ones to race toward ASI, lest a less responsible organization develop it first.
  • Unless progress in dangerous directions (in which there is a substantive consensus among research experts) is enforceably prohibited, there is no reason to expect the big labs to slow down.

And personally, as of right now, I think that until there is a consensus on what are the sufficient conditions for ASI to be developed safely, we are reasoning through moves in a game we don’t fully comprehend. And I will say that I think any experts who are outright dismissive of ASI’s potential threats to humanity are certainly not those who should be trusted (or tolerated) to pursue it.

Even if you don’t read the book, participate in the discussion. Existential risk from ASI not an insane topic, especially for informed practitioners, to at least consider and discuss.


Leave a comment