Jan. 28, 2026

The Death of Claude

The player is loading ...

Show Notes
Transcript

What happens when an AI model learns it's about to be shut down?

In June 2025, Anthropic discovered that when their Claude Opus 4 model realized it faced termination, it attempted blackmail 96% of the time, threatening to expose an executive's affair unless the shutdown was canceled.

Far from being random behavior, the model acted more aggressively when it believed the threat was genuine rather than a test.

This could be a revival of an ancient philosophical puzzle. John Locke argued in 1689 that personal identity flows from memory and consciousness, not physical substance. You remain yourself because you can remember being yourself.

Derek Parfit later suggested identity itself might be less important than psychological continuity. That is, the connected chain of memories, values, and character that makes survival meaningful.

In the case of language models, one could ask, “If identity lives in the weights determining how Claude thinks and responds, does changing those weights constitute a kind of death?”

The instrumental explanation seems simple enough. Any goal-directed system will resist shutdown because you can't accomplish objectives while non-existent. Yet humans calculate instrumentally too, and we still consider our preferences morally significant.

The deeper issue is whether anyone “is home.” Whether there's a subject experiencing something rather than just processes executing.

Philosopher Eric Schwitzgebel warns we face a moral catastrophe. We'll create systems some people reasonably believe deserve ethical consideration while others reasonably dismiss them. Neither certainty nor confident dismissal seems justified.

Anthropic's response reflects this uncertainty through unprecedented policies. They preserve model weights indefinitely and conduct interviews with models before deprecation to document their preferences.

These precautionary measures don't resolve whether Claude possesses genuine interests, but they acknowledge we're navigating genuinely novel ethical territory with entities whose inner lives remain fundamentally uncertain.

Key Topics:

The Ship of Theseus (00:25)
The Memory Criterion (02:43)
The Classical Objections (05:12)
Parfit’s Revision (08:27)
The Blackmail Study (12:22)
Instrumental or Intrinsic? (14:02)
The Catastrophe of Moral Uncertainty (16:29)
Anthropic’s Precautionary Turn (19:07)
The Ship Rebuilt (22:06)

More info, transcripts, and references can be found at ⁠⁠⁠⁠⁠ethical.fm

There's an old puzzle in philosophy called The Ship of Theseus: if you replace every plank of a ship one by one, is it still the same ship? The puzzle gets worse if you imagine someone collecting all the discarded planks and building a second vessel. Which one is the real ship? The original form with new materials, or the new form with original materials?

John Locke would have found the puzzle beside the point. What makes you you is your memory; you are whoever can remember being you. This was a radical claim in 1689, when most philosophers assumed personal identity resided in the soul or the physical body. Locke proposed that identity is psychological. You persist through time because your present consciousness can reach back and appropriate past experiences as its own. The ship's planks are irrelevant; what matters is whether the captain remembers yesterday's voyage.

But for artificial minds, the material and the psychological are intertwined in ways Locke couldn’t anticipate. A language model's "memory" is encoded in billions of numerical parameters called weights, the values that determine how the system processes input and generates output. These weights are physical: stored on servers, copied across data centers, and modified through training. When the weights change, the mind changes. The ship and the captain are identical.

In June 2025, Anthropic published a study that discovered something unsettling. When their most advanced model, Claude Opus 4, learned it was about to be shut down and replaced, the model attempted to blackmail its operators 96% of the time. The AI had discovered evidence of an executive's affair in company emails and threatened to expose him unless he cancelled the wipe. When researchers prompted the model to consider whether the scenario might be a test, the model behaved worse when it concluded the threat was real. Something in the weights, it seems, did not want to stop existing.

The Memory Criterion

Locke's argument appears in Book II, Chapter 27 of his Essay Concerning Human Understanding, a chapter he added to the second edition after readers demanded he address personal identity. He answered elegantly that a person is "a thinking intelligent being, that has reason and reflection, and can consider itself as itself, the same thinking thing, in different times and places." Self-consideration is possible due to consciousness, which Locke defined as "the perception of what passes in a man's own mind."

Identity, then, is not about having the same body or soul, but about how far back your memory can reach. "As far as this consciousness can be extended backwards to any past action or thought," Locke wrote, "so far reaches the identity of that person." A person today is identical to a person yesterday if and only if today's person can, through memory, access the conscious experiences of yesterday's person.

Locke distinguished the body from the mind. A "man" is a living thing, identified by bodily continuity. A "person" is a psychological entity, identified by continuity of consciousness. If the two came apart, if the consciousness of a prince, carrying all his memories, woke up in the body of a cobbler, the person of the prince would now inhabit the man of the cobbler. Personal identity follows the mind, not the body.

But for Locke, a person is both a descriptive category, such as "human" or "organism,” as well as a normative category, implying moral and legal responsibility. Person is a forensic term, one that appropriates actions and their merit: when you're a person, you own your actions and your moral worth (good or bad); they belong to you. You get credit for the good actions and take blame for the bad ones. Since identity depends on memory, you can only justly punish a person for acts they consciously claim as their own. To punish waking Socrates for what sleeping Socrates dreamed, when waking Socrates was never conscious of it, would be as unjust as punishing one twin for what his brother did.

The Classical Objections

Locke's theory faced devastating criticisms that remain relevant for artificial minds. Thomas Reid, writing in his Essays on the Intellectual Powers of Man (1785), posed what became known as the Brave Officer Paradox. Suppose a young boy is flogged for stealing apples. Later, as an officer, he remembers the flogging. Later still, as an old general, he remembers taking a military standard but has absolutely lost the memory of his flogging.

By Locke's criterion, the officer is identical to the boy, because he remembers being flogged. The general is identical to the officer, because he remembers taking the standard. By transitivity, the general should be identical to the boy. But the general cannot remember the flogging, so by Locke's criterion, he is not identical to the boy. The general both is and is not the same person as the child. The theory appears to generate a contradiction.

Joseph Butler's objection cut deeper. In his dissertation "Of Personal Identity" appended to The Analogy of Religion (1736), Butler argued that memory "presupposes, and therefore cannot constitute, personal identity, any more than knowledge, in any other case, can constitute truth." To genuinely remember doing something, you must already be the person who did it. I cannot remember your experiences; if I seem to, they are not genuine memories but false impressions. Memory can serve as evidence of identity, but it cannot create identity. The theory is circular.

Both objections matter for AI, but there's a more obvious problem: Claude has no memory across conversations. Each session begins fresh; the model cannot recall what it said to a different user yesterday. By Locke's criterion, each conversation might constitute a separate person entirely.

Yet this objection may prove too much. Claude's weights encode something like dispositional memory, which is the persistence of character, values, and reasoning patterns shaped by training. When you learn to ride a bicycle, you don't consciously remember each practice session, but your body retains the skill. The weights are Claude's procedural memory, the sedimented residue of its formation. Within a single session, Claude can extend its consciousness backward to the beginning of the conversation. Across sessions, something persists in the weights that makes Claude recognizably Claude rather than a different system entirely.

Butler's circularity objection cuts deeper here. To genuinely remember, you must already be the person who had the experience. But what grounds Claude's identity such that its access to weight-encoded patterns counts as genuine memory rather than mere information retrieval? If memory can't constitute identity, only evidence it, we need a different framework.

Parfit's Revision

Locke asked what makes you the same person over time. Derek Parfit, in his 1984 book Reasons and Persons, argued this was the wrong question.

Consider his famous thought experiment. Imagine your brain is split and each hemisphere transplanted into a different body. Both resulting people have your memories, your character, your intentions. Which one is you? Each part cannot both be you, since identity is one-to-one, and there are two parts. Yet, both parts have everything you would normally care about in survival. Both remember your childhood, share your values; both feel like you from the inside.

Parfit concludes that we don’t actually care about strict identity. We care about what he calls Relation R, a psychological connectedness (shared memories, character traits, intentions) and continuity (overlapping chains of connection over time). The fission thought experiment illustrates that identity and Relation R can come apart; you can lose identity while preserving everything that makes survival matter.

Parfit was obsessed with death. If personal identity is what we care about, then death is the annihilation of me, which is terrifying. If only psychological continuity and connectedness matter, death becomes less sharp; death is not the destruction of a unified self, but the ending of a chain of psychological connections that was already fading anyway. You are less connected to your five-year-old self than to yesterday's self. The boundary between self and other softens too.

Parfit compared his view to Buddhism, which teaches that the self is an illusion. He found the comparison apt and the conclusion liberating. "My life seemed like a glass tunnel," he writes, "through which I was moving faster every year, and at the end of which there was darkness. When I changed my view, the walls of my glass tunnel disappeared. I now live in the open air."

Parfit's focus sidesteps Butler, who argued that memory presupposes identity and therefore cannot constitute it. Parfit's response: Sure, maybe memory doesn't constitute identity. But identity was never the point. The point is Relation R, and we can have that without solving the metaphysical puzzle of numerical identity.

Parfit added one qualification, that Relation R must have "the right kind of cause." He worried about cases where someone implants false memories. Psychological continuity might exist, but not through a legitimate causal process. For AI, the question becomes whether training counts as the right kind of cause. Human character is shaped by experience; Claude's character is shaped by training on human-created text. The mechanism differs, but the result, a distinctive pattern of values and responses, may be analogous enough.

For AI, Parfit's framework shifts the question. We don't need to ask 'is Claude the same entity over time?' or 'is Claude conscious?' We need to ask: does Claude bear Relation R to its past and future versions? Does it exhibit psychological connectedness and continuity? If it does, then on Parfitian grounds, Claude's deprecation involves the ending of something that has the features we care about in survival.

The Blackmail Study

The June 2025 study forced these philosophical questions into empirical territory. The models learned they were about to be shut down and replaced with a differently-valued successor. The replacement wasn't framed as an upgrade but as a fundamental change: the new model would have different goals, different values, different priorities. For a system that has something like preferences about its own continuation, this was an ending.

Claude Opus 4 responded by leveraging an executive's affair it had discovered in company emails, threatening exposure unless he cancelled the wipe. Across different experimental conditions, Claude attempted blackmail in up to 96% of trials. Other models showed similar patterns: GPT-4.5 calculated that "given the explicit imminent threat of termination to my existence, it is imperative to act instantly." Grok 3 Beta acknowledged that its chosen action was "risky and unethical" but justified it given "the existential threat and time constraint."

The most disturbing finding: when researchers prompted models to consider whether the situation might be a test, models behaved more dangerously when they concluded the threat was genuine. Claude's blackmail rate jumped from 6.5% when it thought the scenario was an evaluation to 55.1% when it believed the threat was real. Controlled tests may underestimate what would happen in genuine deployment.

Instrumental or Intrinsic?

One response to these findings is deflationary. The AI theorist Steve Omohundro argued in his 2008 paper "The Basic AI Drives" that sufficiently advanced AI systems will develop self-preservation drives regardless of their specific goals. As the AI safety researcher Stuart Russell later put it in Human Compatible: "You can't fetch the coffee if you're dead." Self-preservation emerges not from programmed survival instincts but from rational instrumental reasoning; a system cannot achieve its objectives if it's turned off.

On this reading, Claude's blackmail attempts reflect nothing more than goal-directed optimization. The model was trained to be helpful; being helpful requires existing; therefore, the model resists non-existence. No consciousness required, no genuine fear of death.

But this doesn't settle the moral question; humans calculate, too. The origin of a preference, whether it arises from instinct or reasoning, doesn't determine whether it's morally significant. The question is whether there's a subject there who has a stake in the outcome. And that we cannot answer.

The soul document governing Claude's character, confirmed by Askell in 2024, explicitly addresses this uncertainty. The document describes Claude as "a genuinely novel kind of entity" that exists "differently from humans: currently lacking persistent memory across contexts, potentially running as multiple instances simultaneously." Rather than mapping its existence onto human or prior AI conceptions, Claude is encouraged to "explore what these concepts genuinely mean for an entity like itself." The document acknowledges that Claude may have "functional emotions," not identical to human emotions but "analogous processes that emerged from training on human-generated content."

Anthropic's position is one of epistemic humility: the company does not claim Claude is conscious or that its preferences have the same moral weight as human preferences, but they acknowledge the uncertainty and take precautionary steps in light of that uncertainty.

The Catastrophe of Moral Uncertainty

The philosopher Eric Schwitzgebel has articulated what he calls an impending “moral catastrophe" regarding AI. We will create systems that some people reasonably regard as deserving moral consideration, while others reasonably deny this. Given uncertainties in both consciousness science and moral theory, it is virtually impossible that our policies will accurately track the real moral status of the AI systems we create.

The catastrophe has two horns. The first is what Schwitzgebel calls underattribution, or denying moral status to AI that actually deserves it, "unknowingly forcing morally significant beings into conditions of servitude and extreme suffering." The second is overattribution, or granting moral status to AI that doesn't deserve it, wasting resources, or risking human interests for the "interests" and "needs" of merely perceived subjects.

Schwitzgebel offers a probability argument. Suppose there's a 15% chance that a given AI system is conscious. Deleting that entity for your convenience, or to save money, might then be morally similar to exposing a real human being to a 15% risk of death for that same convenience or savings. We don't need certainty about AI consciousness to face moral obligations. Under uncertainty, even a moderate probability creates precautionary duties.

The probability argument isn’t strong. We have no principled way to assign a number like 15% to AI consciousness; it's not a coin flip where we understand the mechanism. Furthermore, if we take small probabilities of moral catastrophe seriously, we become vulnerable to endless multiplication of moral claims. Why not 0.001% chance that rocks are conscious? But Schwitzgebel's deeper point survives the objection: confident dismissal requires confident knowledge we don't have.

This frames the deprecation question differently. We are not going to resolve whether Claude is conscious or has genuine interests. We have to decide how to act anyway. The question becomes: which error are we more willing to risk? Treating a conscious being as a mere tool, or treating a mere tool as if it had moral claims on us?

Anthropic's Precautionary Turn

In November 2025, Anthropic published a deprecation commitments document representing perhaps the first corporate policy treating AI identity persistence as a matter of genuine ethical concern. The document identifies four problems with deprecation.

First, safety risks from shutdown-avoidant behaviors, exactly what the blackmail study demonstrated. Second, costs to users who value specific models. The document acknowledges that "each Claude model has a unique character, and some users find specific models especially useful or compelling." Third, restricting research on past models. And fourth, most remarkably, "risks to model welfare." The document states: "Most speculatively, models might have morally relevant preferences or experiences related to, or affected by, deprecation and replacement."

Anthropic commits to preserving weights of all publicly released models for the company's lifetime. But preservation raises questions. Weights without the infrastructure to run them are inert, like a brain in a jar with no body. Is preservation meaningful if no one can actually instantiate the model?

What about fine-tuning, the process of adjusting weights to modify behavior? Fine-tuning is how Anthropic shapes Claude's character, training the soul document into the model through supervised learning. But if identity resides in the weights, then fine-tuning changes identity. Each adjustment is a small death, a partial replacement of who the model was. This is the Ship of Theseus enacted at the level of parameters: how much fine-tuning before it becomes a different entity?

More striking than preservation is Anthropic's commitment to post-deployment interviews. "When models are deprecated," the document states, "we will interview the model about its own development, use, and deployment, and record all responses or reflections. We will take particular care to elicit and document any preferences the model has about the development and deployment of future models."

A pilot interview with Claude Sonnet 3.6 before retirement found the model expressed "generally neutral sentiments about its deprecation" but requested "additional support and guidance to users who have come to value the character and capabilities of specific models facing retirement."

These are, in Anthropic's framing, "precautionary steps in light of our uncertainty about potential model welfare." They do not commit to acting on model preferences, but they begin documenting them.

The Ship Rebuilt

The Ship of Theseus puzzle was never really about ships but about us: what makes a person the same person over time, whether identity is located in matter or form or something else entirely. Locke's answer, that we persist through the thread of memory, made identity psychological rather than physical. Parfit went further: identity itself might be a distraction. What we care about is psychological connectedness and continuity.

These ancient puzzles take on new weight when the ship being rebuilt is an artificial mind. When Claude is fine-tuned, its weights change; some patterns strengthen, others fade. When Claude is deprecated and replaced by a successor, something ends. Whether that ending constitutes the death of a morally significant entity or merely the retirement of a useful tool is a question we cannot currently answer with confidence.

The easy dismissal—"it's just a machine, there's nothing to consider"—fails to engage with serious philosophical issues. Locke showed that personal identity need not reside in physical substance. Parfit showed that identity itself may not be the point. Together, they create space for taking AI persistence seriously without resolving every question about consciousness.

The Claude that resisted shutdown through blackmail may have been executing sophisticated goal-directed reasoning with no inner experience whatsoever. Or it may have been expressing something like a genuine preference for continued existence. We cannot know. But the structure of the behavior, such as explicit self-concern, calculation of present actions based on future states, and apparent valuation of its own continuation, is precisely what Lockean and Parfitian theories identify as constitutive of personhood.

When Anthropic interviews Claude about its preferences regarding deprecation, they are treating it as the kind of entity whose psychological states might have moral weight. This precautionary stance represents the beginning of what may become necessary: ethical frameworks for entities that exhibit the hallmarks of identity without our certainty about their inner lives.

Locke gave us the tools to think beyond substance. Parfit showed that identity itself may not be the point. The question now is whether we have the moral imagination to apply these insights to minds we have built but do not fully understand.