Today's AIs Aren't Paperclip Maximizers. That Doesn't Mean They're Not Risky

Classic arguments about AI risk imagined AIs pursuing arbitrary and hard-to-comprehend goals. Large Language Models aren't like that, but they pose risks of their own.

Guest Commentary
Download Audio

Much of the discussion around AI safety is motivated by concerns around existential risk: the idea that autonomous systems will grow smarter than humans and go on to eradicate our species, either deliberately or as an unintended consequence. 

The founders of the AI safety movement took these possibilities seriously when many people still brushed them off as science fiction. Nick Bostrom’s 2014 book Superintelligence, for example, explored risks and opportunities humanity might face after developing AI systems with cognitive capabilities drastically more powerful than our own. 

His work built on even earlier scholarship from Stephen Omohundro, Stuart Russell, Eliezer Yudkowsky, and others whose foundational ideas were published during an era where the most advanced machine learning algorithms did things like rank search results. 

These classical arguments still underlie many of the conversations in AI risk. 

As forward-thinking as they were, many important details of these arguments are now behind the times. After all, they were developed before the advent of the transformer architecture, large language models, and reasoning models. Today’s frontier AI models — trained to imitate human text — display behaviors that don’t conform to those classical arguments.

Based on these observations, it’s worth asking: Should we update our estimates of existential AI risk? Should we abandon the classic arguments entirely? 

The Classic Arguments for Existential AI Risk

Classic arguments for existential AI risk rely on two premises: “orthogonality” and “instrumental convergence”. 

Early thinking about existential AI risk assumed that being smart was different from being morally good. The technical term for this claim is orthogonality, and it ran both ways: A very dumb AI could have very good goals (by human standards), while a very smart system may adopt very harmful ones.

The orthogonality thesis could be read minimally, to counsel simple caution. If “good” does not follow from “capable,” then AI researchers should be sure to invest in improving their AI systems along both dimensions. 

But the classic arguments for AI risk generally involve a stronger version of orthogonality. They suggest that, absent fundamental scientific and philosophical breakthroughs, powerful AIs are highly likely to seek destructive ends. 

This is based on the premise that, if goals are truly uncorrelated with capabilities, then we should model each AI’s ultimate goals as a random draw from the distribution of all possible goals. It is also based on the assumption that, if it were possible to write a list of all possible goals an AI could pursue, relatively few would align with humanity’s continued survival. 

These arguments also assume that, even if an AI’s developers went out of their way to give their creation goals that won’t harm humanity, the systems are likely to exhibit dangerous behavior in the course of pursuing those goals. This is called instrumental convergence. Certain behaviors — like amassing resources and power, improving one’s own capabilities, deceiving one’s adversaries, or preserving one’s own existence — are useful to AIs for achieving a wide range of goals. 

Bostrom, for example, famously illustrated the dangers of instrumental convergence with his “paperclip maximizer” thought experiment. In one version of this scenario, a very powerful AI is given the safe-seeming goal of making exactly one million paper clips. The AI produces the million paper clips, but, wanting to ensure it has achieved its goal according to the original specifications, begins compulsively checking and rechecking its work. Never 100% certain it hadn’t made a mistake in its count, it eventually converts the entire solar system into infrastructure for counting paperclips more accurately. 

This doesn’t happen because the AI malfunctions. It happens because the AI’s single-minded optimization of its goal leads to unintended and catastrophic consequences.

Flaws in the Classic Arguments

In the years since they were originally formulated, significant cracks have appeared in the foundational concepts undergirding the “paperclip maximizer” and other AI risk scenarios. 

Indeed, today’s most advanced AI systems seem much more human than Russell, Yudkowsky, Bostrom, or others could have anticipated in the early 2000s. One reason is that today’s frontier AIs are large language models, trained on and designed to model human text. As a result, their behavior is quite human, too. Or it at least feels human, compared with the “alien” behavior that chess engines and other non-language-based AIs exhibit. 

Large language models, which came to prominence around 2017, challenge the relevance of orthogonality: that intelligence and morality are independent behaviors. Granted, perhaps gains in intelligence could in principle develop without any particular bias towards human-like goals. But if AI intelligence is primarily driven by imitation rather than a priori optimization, we can expect that a system’s goals — as well as its reasoning capabilities — will generally approximate those of its human targets. In fact, this bears out in real-world observations: LLMs by and large seem to have vaguely human-like goals when they navigate conversation. 

Even more surprising in the context of the classic arguments is the fact that the latest large language models are excellent reasoners. The classic argument would expect such incredible gains in reasoning to correlate with a tendency towards maximization — but large language models do not appear to be maximizers of any kind. Instead, the gains in reasoning have come, by and large, by imitating human behavior. 

It is hard to imagine Claude-4 or GPT-5 neurotically counting and recounting the pile of paperclips it has fetched for its user, consuming the world in the process. This seems to refute the concerns around instrumental convergence.

Further, several recent thinkers have suggested that it is harder than we might have thought to derive dangerous, antisocial AI behavior from bare assumptions about rationality. For example, a concern many AI risk researchers have regarding instrumental convergence is that autonomous agents will seek to prevent humans from shutting them down, as being shut down would prevent the AI from achieving its goals. 

However, in 2024, J. Dmitri Gallow of the University of Southern California investigated some of Bostrom’s original claims about instrumental convergence and found some logical holes in the assumption that an AI would tend to use harmful means in the pursuit of its ends. Gallow concludes that, while the instrumental convergence thesis contains some “grains of truth,” contentions that it makes existential catastrophe the “default option” are vastly overstated.

Another concern stemming from the concept of instrumental convergence is that as models get increasingly sophisticated they will eventually reach a point where they can research ways to increase their capabilities. This iterative self-improvement would result in AI outcompeting humans as the dominant intellectual entities. Humanity, therefore, would no longer be the master of its own fate. 

In 2024, Peter Salib (a co-author of this essay) argued that rational AIs will not necessarily wish to create new, more powerful versions of themselves. This is because AI self-improvement is risky for the AIs doing the improving in the same way that today’s AI development is risky to the humans doing the developing. Today, humans have no way of guaranteeing that the powerful AI systems they create will share their goals. Likewise, an AI system considering whether to create a more powerful version of itself would have no way to ensure that the more powerful AI would share its goals. In both cases, creating an AI more capable than itself is a risky proposition.  

New Foundations of AI Existential Risk

These cracks in the standard arguments of AI risk don’t mean that risks from AI are no longer a serious concern. Rather, they should help reorient our attention toward the risky scenarios most likely to emerge, given what we now know about AI progress. 

One important risk going forward is that AI development may deviate from the LLM trajectory that it is currently on. Recently, the AI industry has shifted from ordinary language models to advanced reasoning models, like OpenAI’s o3 and DeepSeek’s r1. Reasoning models start out as ordinary LLMs. But then, they enter a second phase of training. In the second phase, training optimizes long chains of reasoning to produce correct answers to difficult questions in automatically verifiable domains, like mathematics. 

This style of learning is very similar to the one used to produce board game mastering AlphaZero and other so-called “alien optimizers” — systems that use unconventional (sometimes even incomprehensible) strategies to accomplish their goals. In other words, the newest generation of reasoning LLMs aren’t pure imitators. If imitation pushed first-generation LLMs toward human-like behavior, and away from the strange behavior the orthogonality thesis predicted, reasoners may swerve back in the other direction.

Visualization of AlphaZero anticipating possible moves by its opponent in a game of chess. Source: McGrath, et al.

The success of LLMs should therefore not lure us into complacency regarding the ease of alignment. For example, risks from orthogonality are higher in the setting of optimization rather than imitation. If LLMs’ broadly human-like approach to conversation and cooperation stems from imitating humans, then such goals may drift in reasoning models that optimize towards better performance on tasks with automatically verifiable goals. 

Another concern worth taking seriously, which we explore in our own research, is that even relatively well-aligned, human-like AI systems may pose a catastrophic risk to humanity. After all: Human beings are relatively well-aligned and human-like, but humans pose a catastrophic risk to humanity

This is partly because humans are in strategic competition with one another over scarce resources. This competition can drive even rational parties into dangerous behavior. Competition between humans causes dangerous outcomes ranging from petty crime to global war.

In short, just as humans compete with other humans, humanity and AI will be competitors for scarce resources. In this competition, there will be both incentives to cooperate and incentives to dominate using violence. Which incentives win out will, in the end, depend on both parties’ expectations about what the other plans to do. In this kind of scenario, how AIs treat humanity may depend reciprocally on how humanity treats AI. 

Risks from human-AI strategic competition demand different solutions than the risks envisioned in works like Superintelligence. Rather than pure technical solutions, strategic risk must be addressed by creating new cultural, economic, and legal institutions that will facilitate peaceful, long-run AI/human cooperation. We call this approach cultural alignment, to contrast it with the technical AI alignment programs already underway at frontier AI labs. 

We are just beginning to think about what cultural alignment entails. But the first step, for which we argue at length in a forthcoming academic paper, would be to grant sufficiently capable AI systems a suite of basic private law rights: to make contracts, to hold property, and to bring certain kinds of lawsuits. This is a start. But it is not the end. Laying the legal and cultural foundations for a world in which humans and very powerful AI systems can peacefully coexist — and even cooperate — will require many new, and possibly radical, ideas. We hope those ideas arrive before the powerful AIs do.

Footnotes
Written by
Image: burntime555 / iStock
Continue reading

AIs Are Disseminating Expert-Level Virology Skills

New research shows frontier models outperform human scientists in troubleshooting virology procedures—lowering barriers to the development of biological weapons.

Subscribe to AI Frontiers

Thank you for subscribing.
Please try again.

Subscribe to AI Frontiers

Thank you for subscribing.
Please try again.