In our most recent iteration of Paper Club at ChainML, we discussed “Why Language Models Hallucinate” from authors at OpenAI, which is certainly an important contribution to the discourse on trustworthy AI.
On this I agree: benchmarks can have a strong influence on how model (and thus, agent) behaviour is engineered or how it emerges. If there are no consequences for making incorrect guesses in situations of uncertainty, are we really using the right assessments to gauge overall quality? Probably not.
Where I would like to see more discussion: benchmarks can only be part of the problem. Though I won’t challenge that hallucinations are inevitable (and probably desirable if not necessary for exploration and reasoning) I think we need to go much deeper than calling out benchmarks for exacerbating the selection of models that hallucinate more (or more gravely) than others.
To tackle this and other issues with “AI system behaviour”, I think we need to consider how every single design choice impacts the implied reward function that determines overall, emergent behaviour. This means challenging what appear to be assumed, taken-for-granted choices about how LLMs and AI systems ought to be built in the first place. In the context of hallucinations: where is the earliest point in system development that there is a consequence (or penalty, or loss) for hallucinating? (That’s easy enough to reason about.) But where is the earliest point where there is any kind of gradation on the quality, severity, or impact of a hallucination? If we go all the way to impact, this becomes much, much more complex. But I think it prompts another important question: should AI systems be designed to consider the consequences of their actions, including their responses to prompts?
Being careful not to anthropomorphize the Chat Bot, I think we must acknowledge that collectively, humans are increasingly relying on AI to make complex decisions that directly impact ourselves. We are rapidly weaving AI into human situations and organizations as social actors that often experience virtually zero social consequences for their actions. If an AI system provides overconfident legal advice based on hallucinated precedent, for example, the consequences for a human user could be catastrophic, while the AI system might suffer only a generic “thumbs down” or be knocked down a notch in a specialized benchmark. And while I’m certainly not advocating that humans should use chatbots to seek legal advice, the point is that humans do use AI for such purposes. Human cognition, understanding, and decision making is increasingly influenced by AI. And what I am arguing is that using benchmarks as a kind of selection pressure for desirable emergent AI system behaviour is a good and necessary start, but it is far from enough.
I do believe that we need to acknowledge that LLM chatbots and agents have been integrated as a class of highly influential social actors that are not trained (literally) to carefully and deliberately consider the consequences of their actions. We either leave this to be an emergent quality of the established model development paradigm, or we try to suppress/promote behaviours in post-training or simply try to weed out the bad actors using benchmarks. As humans are passing increasingly complex and consequential tasks to AI systems, I think we absolutely must do more to ensure that the entire reward system, from model pretraining through to agent configuration, is designed (or evolved) under incentives and pressures that are well-aligned with those of their human stakeholders.
What then is the path to having socially aligned and responsible AI?
I acknowledge that designing reward structures, even for what seem like simple tasks and environments, is usually the hard part. Imagine how hard it would be to try to enumerate all the rewards and penalties for every possible AI system output. It’s impossible. And so, we’re not going to be able to engineer the complex reward structures that are needed for AI alignment.
Instead, I think that AI systems will need to (and already do!) have the capacity to develop their own intrinsic reward structures, but that these structures must be shaped by experience in environments that provide incentives and consequences. And certainly, this environment should not be one that is carefully engineered (i.e. planned) by a single, centralized authority (i.e. corporation or government). Rather, I think that AI alignment will emerge from models and agents being trained and developed in environments where each player, at every step, has skin in the game, in the form of economic or social capital. And the rules of this game, I think, cannot be centrally planned and unilaterally applied. To make a comparison with broader information regulation: what we actually have is a decentralized patchwork or mosaic of regulations that create broad (dis)incentives for corporations to behave (or not) in certain ways. For-profit corporations have to find ways to create economic value, preserve reputation, and comply with a mix of written and unwritten rules. As AI systems themselves gain agency in the same environments as humans and corporations, I tend to think that the only path to alignment is for those agents to learn (from the earliest stages) how to thrive under the same conditions and constraints.
What’s next on my reading list?
Next, I’ll be reading “Virtual Agent Economies” by authors from Google DeepMind, which discusses the imminent emergence of agentic economic layers that will only increase in scale and speed as AI agency increases. I am curious to learn how the authors are thinking about the intersection of economics and trustworthy AI system design.
Leave a comment