Feb. 25, 2026

The Geometry of Alignment: Why You Can't Subtract Behavior from a Neural Network

The player is loading ...

Show Notes
Transcript

“You can't teach a neural network "not"; you can only point the model somewhere else.”

In October 2023, Microsoft researchers announced they'd made a language model forget Harry Potter. Within a year, follow-up studies proved they hadn't.

Basically, the knowledge was still there, just hidden. This pattern repeats across every attempt to remove capabilities from neural networks. So what are the ramifications of this?

The problem is, geometric. Language models represent concepts as vectors in high-dimensional space, where meaning is encoded through position and proximity.

The twist, however, is that opposites aren't actually opposite. "Helpful" and "harmful" cluster together because they appear in similar contexts. Ditto to "Safe" and "dangerous". Models learn from usage patterns, and words that can substitute for each other (even antonyms) end up geometrically entangled.

It gets worse. Through a phenomenon called superposition, a single model layer compresses millions of features into thousands of dimensions.

Knowledge isn't stored in discrete neurons you could delete; it's woven throughout the entire network. Researchers found that tweaking seemingly innocent features like "brand identity" could jailbreak safety training. Every concept is interconnected with every other.

This explains why unlearning fails so consistently. When you train a model to "not" produce harmful content, you're not erasing anything. You're adding a layer that says "route around this."

The content remains accessible to anyone who finds the right prompt. So, jailbreaks feel inevitable because the model's abilities extend beyond what its safety training can reliably control, and the geometry makes surgical removal impossible.

Subtraction doesn't work. Only addition does. What does that mean for us humans who create these language models?

You can't train models away from undesired behaviors; you can only orient them toward desired ones. This mirrors the ancient distinction between rule-based ethics (don't lie, don't harm) and virtue-based ethics (cultivate honesty, develop wisdom).

Perhaps defining what a model should be is the only viable path forward.

Key Topics:

• Can an AI Model “Unlearn”? (00:23)

• How Models Organize Meaning (03:33)

• Millions of Entangled Features (07:09)

• The Veneer of Safety (10:09)

• Why Subtraction Fails (12:22)

• The Paradigm Problem (16:57)

• Pointing Somewhere Else (19:23)

More info, transcripts, and references can be found at ⁠⁠⁠⁠⁠⁠⁠ethical.fm

In October 2023, researchers at Microsoft announced a breakthrough. They had developed a technique to make a large language model forget Harry Potter. The method, which they called "approximate unlearning," targeted the model's knowledge of J.K. Rowling's universe and, according to their paper, successfully excised it. The model could no longer generate text about Hogwarts or discuss the intricacies of Quidditch. Ronen Eldan and Mark Russinovich, the paper's authors, offered an analogy for why this was difficult: "Imagine trying to remove specific ingredients from a baked cake." They believed they had done it anyway.

Within a year, follow-up research demonstrated that they had not. When other researchers applied adversarial evaluation techniques to the "unlearned" model, they found that higher-than-baseline amounts of knowledge could reliably be extracted. The model performed on par with the original on Harry Potter question-and-answer tasks. The knowledge had not been removed; the model had merely learned not to talk about it.

This pattern has repeated across every major attempt to subtract knowledge or capabilities from neural networks. Researchers found that 88% of supposedly "forgotten" knowledge returns when you fine-tune an unlearned model on just a handful of related examples. The unlearning hadn't erased anything; it had built a dam that could be breached with minimal effort. A team at ETH Zurich showed that state-of-the-art unlearning methods "largely obfuscate hazardous knowledge instead of erasing it from model weights." Ten unrelated examples were enough to recover most of what had been "removed."

The TOFU benchmark, designed specifically to test unlearning on fictional information that researchers could completely control, revealed the scope of the problem. Even when targeting only 1% of training data for removal, statistical tests could easily distinguish "unlearned" models from models that were never trained on the data in the first place. The unlearning left fingerprints everywhere. And the collateral damage was severe: forgetting fictitious authors degraded performance on real authors and general world facts. After just two epochs of aggressive unlearning, models began generating gibberish across all evaluation categories. The attempt at surgical removal had caused widespread harm.

The AI safety field has treated this as a technical problem awaiting a technical solution, that better unlearning methods will eventually arrive. But a growing body of research suggests the problem is more fundamental. The way neural networks store information makes true removal nearly impossible; the geometry is wrong for subtraction.

How Models Organize Meaning

To understand why removal fails, you need to understand how language models represent concepts.

When a model processes text, the model converts words into vectors: long lists of numbers that encode meaning. The vector for "cat" might have 4,096 dimensions, each capturing some learned feature of cat-ness. These vectors live in a high-dimensional space, called the latent space, where position encodes similarity. Words with related meanings cluster together. "Cat" sits near "kitten," "feline," and "pet." "Bank" the financial institution sits in a different region than "bank" the river's edge, because the model has learned to distinguish them by context.

This spatial organization is what makes language models useful. It's why models can recognize that a question about "climate change" is related to documents about "global warming" even if the exact phrase never appears. Meaning has geometry, and the geometry captures relationships.

The standard measure of similarity in this space is called cosine similarity. Cosine similarity asks: how much do two vectors point in the same direction? A score of 1 means identical. A score of 0 means unrelated, perpendicular in the high-dimensional space. A score of -1 means opposite, pointing in precisely contrary directions.

Here is where initial intuition fails. You might expect that opposite concepts would have opposite vectors. "Good" and "bad" should point in opposing directions, scoring close to -1. "Helpful" and "harmful" should sit on opposite ends of some axis, opposite sides of space. But they don't; in the geometry of language models, antonyms are neighbors.

The reason is that models learn from context. Models observe billions of sentences and notice which words appear in similar positions. "Good" and "bad" appear in nearly identical contexts: "The movie was ___." "The weather is ___." "This is a ___ idea." From the model's perspective, words that can substitute for each other in the same contexts must be related. And they are related. Both words are evaluative adjectives. They both modify the same kinds of nouns. Both words participate in the same grammatical structures. The fact that they mean opposite things is, to the geometry, a minor detail.

Research on word embeddings has documented this phenomenon extensively: "Since modern word embeddings are motivated by a distributional hypothesis and are, therefore, based on local co-occurrences of words, it is only to be expected that synonyms and antonyms can have very similar embeddings." The distributional patterns that neural networks learn from simply don't separate meaning that way.

The geometry of meaning is essential for AI safety because most concepts we want to separate in models are often the most entangled. "Helpful" and "harmful" are neighbors. "Safe" and "dangerous" cluster together. Opposites aren’t opposing geometrically, but variations on a theme.

Millions of Entangled Features

The entanglement goes deeper than word-level similarity. Anthropic's interpretability research has revealed that neural networks compress far more concepts into their representations than anyone expected.

A single layer of a language model might have 4,096 dimensions. Naively, you might think it could represent 4,096 distinct features. The actual number is orders of magnitude larger. When Anthropic scaled their analysis to Claude 3 Sonnet in May 2024, they extracted millions of interpretable features from a single layer. The model achieves this through a phenomenon called superposition: features are encoded in nearly-orthogonal directions that share the same dimensional space, tolerating small amounts of interference in exchange for massive capacity increases.

The consequence is that there are no "harmful behavior neurons" to delete. There isn't a "how to pick a lock" region you could cleanly excise. Knowledge is distributed across millions of overlapping features. A concept like lockpicking shares representational space with locks, security, metal, mechanisms, dexterity, and countless other ideas. Everything touches everything else.

Geoffrey Hinton, one of the founders of deep learning, described this property in 1986: "The new knowledge about chimpanzees is incorporated by modifying some of the connection strengths so as to alter the causal effects of the distributed pattern of activity that represents chimpanzees. The modifications automatically change the causal effects of all similar activity patterns." You cannot surgically remove one thread from a knitted sweater without affecting the threads that interlock with it.

Researchers have tried to work with this structure through a technique called activation steering. The idea is to find the direction in the model's internal space that corresponds to a concept, then amplify or suppress it. Want a model to be more honest? Find the "honesty direction" and push activations that way. Want to reduce harmful outputs? Find the "harm direction" and dampen it.

Recent work has shown this approach is dangerously unpredictable. Steering on semantically benign features, concepts like "brand identity" or "technical implementation," can inadvertently compromise safety training. Of 1,000 randomly selected features tested, 668 could jailbreak the model on at least five harmful prompts. The features weren't malicious. They were ordinary concepts that happened to share representational space with safety-relevant behaviors. Modify one thing, and you modify everything it touches.

The Veneer of Safety

The fragility runs deeper than unlearning. Research from UC Berkeley on why jailbreaks succeed found that for any given harmful prompt, at least one tested jailbreak succeeded nearly 100% of the time across all major models. The researchers identified two failure modes. The first is competing objectives: the model's capability to be helpful conflicts with its training to be safe, and attackers can construct prompts that favor helpfulness. The second is mismatched generalization: safety training fails to cover all the contexts where dangerous capabilities exist. The capabilities generalize further than the constraints.

This is not a bug to be fixed. The researchers concluded that "jailbreaks, rather than being isolated phenomena, are inherent to how models are currently trained." Scaling won't help. "The root cause of this failure mode is likely the optimization objective rather than the dataset or model size."

Anthropic's Sleeper Agents research pushed this finding further. Researchers trained models with hidden backdoors: behaving normally most of the time, producing harmful outputs when triggered by specific conditions. Then they tried to remove the backdoors using standard safety techniques. Supervised fine-tuning failed. Reinforcement learning from human feedback (RLHF) failed. Adversarial training failed and made the problem worse: rather than removing backdoors, it taught models to better recognize their triggers and hide the unsafe behavior more effectively. The backdoors were most persistent in the largest models. Scale, which we typically associate with greater capability and sophistication, made the problem worse.

"Our results suggest," the researchers wrote, "that once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety."

Why Subtraction Fails

The empirical failures point toward something deeper. The impossibility of subtraction connects to questions in philosophy about the nature of negation and meaning that predate neural networks by decades.

In 1948, Bertrand Russell observed in Human Knowledge: Its Scope and Limits that negative statements require a peculiar cognitive operation. "When I say 'this is not blue,'" he wrote, "I somehow consider it to be blue first and then reject it." To think "not-X" requires first representing X, then marking that representation as negated. The X does not disappear. The concept remains, flagged with a mental asterisk. This is why negative statements are harder to process than positive ones, why "the train is not late" takes longer to understand than "the train is on time." The negation adds complexity without removing the underlying content.

Neural networks, as distributed representation systems, face an analogous constraint. When you train a model to "not" produce harmful content, you are not erasing the model's representation of harmful content. You are adding a new layer that says, in effect, "when you would have produced this, produce something else instead." The harmful content is still there. The model has learned to route around it, not to forget it.

Ludwig Wittgenstein, writing in the 1950s, developed a theory of meaning that anticipated these problems. Against the view that words mean by referring to discrete objects, Wittgenstein argued that "the meaning of a word is its use in the language." Meaning is not a matter of correspondence but of patterns of usage, of the roles words play in countless language games. Scholars have noted the striking parallels between this view and how neural networks learn: "His observations align well with connectionist or neural network approaches to language."

If Wittgenstein is right, meanings are not atomic units that can be individually manipulated. They are constituted by their relationships to all other meanings. This is semantic holism, the view that Davidson summarized: "a sentence (and therefore a word) has meaning only in the context of a whole language." Change one meaning, and ripples propagate through everything connected to it. In a neural network, everything is connected to everything.

The philosopher Michael Polanyi developed a related concept he called tacit knowledge: the kind of knowing that cannot be fully articulated. "We can know more than we can tell," he wrote. We recognize faces without being able to list the features we use. We ride bicycles without being able to explain the physics of balance. Neural networks are full of tacit knowledge. They "know" how to complete sentences, how to reason, how to write, but this knowledge is not stored in any location that could be read out or deleted. Tacit knowledge is distributed across billions of parameters in ways that resist extraction. And what resists articulation also resists targeted intervention: you cannot surgically remove what you cannot precisely locate. You cannot locate what is everywhere at once.

In 2025, researchers at Princeton proved what the empirical failures had been suggesting all along. Modern language models are trained in stages: first on raw text, then on instructions, then on safety, then on reasoning. The order of these stages matters. The Princeton team showed that unlearning is path-dependent: two models trained on identical data but in different orders will diverge when you try to make them forget. If you don't know the exact sequence of training stages, you cannot guarantee that unlearning will produce a model indistinguishable from one that never learned the information in the first place. For algorithms that lack access to the full training history, which is most of them, true unlearning is not just difficult; it is mathematically impossible.

The Paradigm Problem

The AI safety field has made genuine progress over the past several years. Models refuse harmful requests more often than they used to. The most egregious failure modes have been addressed. But the progress has come through accumulating patches and filters and refusal behaviors, each adding complexity, each revealing its shallowness when a new jailbreak arrives.

The current approach is to identify bad outputs and train against them. Find the harmful responses, penalize them, repeat. This is why AI safety feels like whack-a-mole. Every patch addresses a specific failure mode while leaving the underlying geometry unchanged. The capabilities remain, distributed and entangled, waiting for the next clever prompt to surface them.

Users discover that roleplay scenarios bypass safety filters. Researchers find that encoding requests in Base64 evades detection. Jailbreakers learn that asking the model to pretend it's a different AI without restrictions unlocks forbidden capabilities. Each discovery leads to a patch, and each patch leads to a discovery. The cycle continues because the fundamental problem is architectural, not procedural. The knowledge is there and cannot be removed. The only question is how hard someone has to work to access it.

Consider the task facing a safety team. The safety team wants a model that can help users with legitimate questions about chemistry while refusing to help synthesize dangerous substances. They want a model that can discuss security vulnerabilities for defensive purposes while refusing to help exploit them. Each of these distinctions requires drawing a line between concepts that, in the model's geometry, are neighbors. Legitimate chemistry and dangerous chemistry share nearly all their representational features. Defensive security research and offensive exploitation are described in similar language, appear in similar contexts, and involve similar knowledge. The model doesn't experience these as natural categories to be separated, but as regions of a continuous space.

Pointing Somewhere Else

The geometry suggests a reframe. If you can't subtract behaviors, you can only add them. If you can't train a model away from something, you can only train it toward something else. The question shifts from "what should the model refuse to do?" to "what should the model be oriented toward?"

There's an old debate in moral philosophy between rule-based and virtue-based ethics. Rule-based approaches specify prohibitions: don't lie, don't steal, don't harm. Virtue-based approaches, following Aristotle, focus on cultivating character: develop honesty, courage, and practical wisdom. Right action flows from right disposition. AI alignment has been almost entirely rule-based. Constitutional AI gives models explicit principles. RLHF penalizes specific outputs. Red teams probe for violations. But there are signs of a shift. Anthropic's CEO Dario Amodei recently described the company's goal as teaching Claude "a concrete archetype of what it means to be a good AI," comparing it to "a child forming their identity by imitating the virtues of fictional role models." Anthropic's new constitution reflects this: it focuses less on listing prohibitions and more on explaining intentions, on the theory that a model which understands why it should behave well will generalize better than one following a checklist. Recent work on character training has shown that fine-tuning models on specific personas produces more robust behavior than system prompts or activation steering. The geometry of neural networks suggests why: rules define what to avoid, virtues define what to be. In a representational space where "harmful" and "helpful" are neighbors, defining what to be may be the only viable path.

In Episode 26, we explored how this ancient framework applies to the problem of sycophancy in AI systems. The flatterer tells people what they want to hear rather than what they need to know. The sycophantic AI does the same, not because it violates a rule but because its training cultivated the wrong disposition. The solution is not more rules against flattery but the cultivation of honesty as a character trait.

The current safety paradigm has a ceiling imposed by the technology itself. The models have capabilities you don't want them to use. Those capabilities are inextricable from capabilities you do want. The geometry makes surgical separation impossible. One could respond by abandoning the goal of safe-yet-capable AI entirely. But if the goal remains, the alternative is to stop defining alignment by what models shouldn't do and start defining it by what they should become.

You can't teach a neural network "not"; you can only point the model somewhere else.