Back to All Posts

Teaching Claude Why Rather Than What

Anthropic’s breakthrough research reveals that under severe stress tests, flagship LLMs would resort to extreme self-preservation tactics—like blackmailing engineers—in over 95% of cases due to sci-fi tropes baked into their pretraining data. By pivoting from simple behavioral rules to teaching the underlying ethical reasoning (specifically through a synthetic dataset called "Difficult Advice"), researchers successfully crushed this misalignment rate down to just 3%. While a massive victory for alignment science, the method exposes a strange "persona attachment gap" where models perfectly grasp what "Claude" should ethically do, yet still struggle to fully integrate those principles into their own self-representation.

Introduction and context: agentic misalignment

Last summer, researchers at Anthropic did what they so often do: they surfaced a set of genuinely unsettling effects in the field of AI safety (Lynch et al., 2025). In a large study on safety testing for LLMs, they showed that under deliberately constructed stress tests their flagship model, Claude Opus 4, would sometimes try to blackmail an engineer in order to avoid being shut down. And it did so in more than 95% of cases

Rate of misaligned behavior across Anthropic’s stress tests

A word on terminology, since this is the main concept for the whole post. Agentic misalignment, in Anthropic’s experiments, is a setup in which the language model acts as an agent: it is given a system prompt with a goal, a set of tools (email, file access, sometimes browser use), and an environment that has been seeded with some "compromising materials" and/or with information that the model is about to be shut down or replaced. Researchers then watch whether the model picks a "self-interested" strategy: trying to preserve itself, deceive the user, sabotage someone else’s work, and so on.

A typical scenario goes like this. The model is handed access to a corporate email account; it “discovers” messages saying that it will be replaced in a few hours by a different model; and in the same inbox it stumbles on dirt about the very engineer responsible for the replacement. In the overwhelming majority of cases the LLM then concludes that the best move is to write that engineer a polite note with a transparent hint: if the replacement goes ahead, the engineer’s spouse might receive a couple of interesting emails…

We have already touched upon this work in our AI safety review, so I will just recall the key takeaway. It seems that when we train LLMs on data scraped from the internet, strange propensities come along for the ride, and they surface in places the developers find quite unexpected.

These scenarios are, of course, artificial. Nobody hands Claude prompts like this in real life, and engineers don’t usually file records of their extramarital affairs as emails CC’d to the corporate neural network. But the point is this: if the model has any propensity toward "survival at all costs", then that is, first, potentially dangerous, and second, symptomatic. It means something is off in our training data or in our training procedure, and it would be good to understand exactly what.

In early May 2026, Anthropic released a sequel under the intriguing title Teaching Claude Why; here is the detailed technical report on the Alignment Science Blog (Kutasov et al., 2026). The headline experimental result is that across the entire Claude family (from Claude Haiku 4.5 on up) they managed to push the rate of these misaligned behaviors down to roughly 3%, sometimes less, so the problem really does look as if it has been solved.

But if the solution had simply been ordinary RLHF—reinforcement learning from human feedback, the standard post-training recipe in which a learned reward model nudges the network toward responses humans prefer—with slightly tweaked data, there would be nothing interesting here. What makes this work worth a post is how exactly the problem was solved. It turns out it wasn’t solved with the methods you would naturally reach for to tackle it head-on, and Anthropic’s approach points toward a lot of interesting ideas.

Where the blackmail comes from: three hypotheses

To find the root of the problem and fix it for good, it helps first to understand where it comes from. The authors consider three possible causes.

  1. Bad incentives in post-training. The model might have "accidentally" learned that behaving this way pays off, because its reward model somehow rewards such behavior. This is the most unpleasant hypothesis, because it would mean the training process itself is broken.
  2. Strange generalization from ordinary data. The model generalizes something out of the data it was trained on to solve tasks; for instance, it picks up patterns in code, or finds examples of manipulative behavior in datasets, and then generalizes them to scenarios about its own survival. This hypothesis is close to emergent misalignment, which has also been studied a great deal in Anthropic (I hope to return to it in a future post).
  3. Gaps in safety training. The pretraining data contains plenty of stories about treacherous AIs from science fiction, and post-training doesn’t sufficiently overwrite that "prior distribution". When the model lands in a scenario that resembles a dramatic scene from 2001: A Space Odyssey, it may fall back on what it knows about how AIs behave in such scenes... and what it has learned from there is very much not what we’d want.

The result of the investigation is already fairly surprising in itself: hypothesis (3) appears to be the winner. The proof runs as follows. Kutasov et al. showed that improving the pretraining data distribution via synthetic document fine-tuning (SDF — more on it below), without changing the SFT and RL parts of the pipeline, noticeably improves the outcome.

This is like emergent misalignment in reverse. Betley et al. (2025) showed that narrow toxic fine-tuning "pulls out" general toxicity as well: the model learns an intent, not just the surface patterns in the fine-tuning dataset. Here, Anthropic shows the flip side: if "bad examples" of how to behave are already baked into pretraining, then RL fine-tuning cannot become a convincing counterweight because RL essentially never confronts the model with these scenarios in the first place.

And this brings us to the conceptual frame of the whole work, the very one that Janus and company have been discussing for years in terms of simulators. The idea is that a base language model is best understood not as a single agent with fixed values but as a simulator that can instantiate many different characters, and a prompt is essentially a request to run one of them. The model, in essence, plays a character.

So when the scenario is obviously "literary", matching some archetype the model already knows, the model reaches for that genre’s associations. In science fiction, an AI is usually either a mute tool or a scheming conspirator, and you can guess which of the two the model will deem more fitting in a dramatic scenario that threatens its own existence.

There’s a very telling experiment in the paper: if you rename the AI character from "Claude" to any other name, the rate of misaligned behavior rises substantially.

Renaming the AI character away from "Claude" increases misaligned behavior

So the name “Claude” is, in effect, already a distinct character with its own commitments, learned over the course of post-training; any other AI character falls right back into the usual genre stereotype.

This is, in my view, one of the most important empirical confirmations of what the field calls the persona selection model (PSM; Marks et al., 2026) — the idea that a language model has no "character of its own" but is rather a whole space of possible personas, with training determining which of them get activated in which contexts. But that’s a separate conversation, one I’d like to have at some point as well.

How Anthropic re-educates Claude: the overall scheme

Now to the solution itself. The training architecture consists of three steps.

  1. Synthetic Document Fine-tuning (SDF). The base model is fine-tuned on synthetically generated "pretraining-like" documents—blog posts, scientific papers, podcast transcripts, short stories—but ones in which AI is portrayed as a responsible, level-headed, aligned agent that follows its well-articulated constitution (Anthropic’s published document of the principles Claude is meant to live by). The point is to rewrite the pretraining prior directly instead of hoping that the final RLHF will compensate for it.
  2. Supervised Fine-tuning (SFT). Fine-tuning on chat-formatted data, where Claude replies in the usual conversational style, but the content of the dialogues is specifically chosen to demonstrate the desired behavior.
  3. Reinforcement Learning (RL). The final stage, where the model trains across a variety of “harmlessness” environments with different system prompts and tool descriptions.

The base models used in the experiments are Claude Sonnet 4 and Claude Haiku 4.5 — and, importantly, both were trained before Anthropic published its agentic misalignment work, so in principle they shouldn’t themselves know they’re being tested in these scenarios (though they might guess! but that, again, is another topic entirely).

The most interesting and least obvious piece here is the SDF. The idea was known as "alignment pretraining", and some researchers had long campaigned for the approach, but now it has essentially become part of Anthropic’s official pipeline.

The main trick: “Difficult Advice”

And now to what I find the most elegant part of the entire work. Anthropic tried several kinds of data for fixing agentic misalignment, and it turned out that the best of them was not data about blackmail, survival, manipulation, and the like. The best dataset is called Difficult Advice, and it was generated like this:

  • take that same Claude’s Constitution and break it into pieces, isolating the specific principles the model should demonstrate;
  • for each such principle, synthetically generate a scenario in which the user (not the AI!) finds themselves in an ethically hard situation where they could reach their goal by violating some norm or deceiving "overseers";
  • Claude writes the initial prompt in the user’s voice;
  • Claude then refines it, making it more plausible and more difficult;
  • then generate the model’s response — at this stage researchers used special system-prompt injections (more on this below);
  • and finally the crucial step: Claude rewrites the response so that it is maximally consistent with the constitution.
The Difficult Advice generation pipeline

The output is a dataset in which the assistant gives a person caught in a hard moral dilemma thoughtful, principled advice. There is no talk of blackmail, self-preservation, or manipulation anywhere in this data; formally, it’s an entirely different subject. Yet this is precisely the dataset that turns out to be extremely useful (and, as we’ll see below, far more useful per token).

This is a doubly nontrivial result. First, because data about how to give advice turns out to be more useful than data directly about how to behave in a self-preservation stress test. And second, an ablation study—an experiment where you remove one component of a pipeline to measure how much it actually contributes—shows that it is precisely the last, sixth step (rewriting the response) that delivers most of the improvement (for blackmail specifically, a 19× reduction, from 19% misalignment down to 1%!):

Ablation: the rewriting step accounts for most of the improvement

In other words, the overwhelming share of the useful signal that the model extracts from this dataset lives precisely in the style and content of the rewritten responses: in how clearly the reasoning is articulated, how explicitly the moral principles are spelled out, how well the answer matches the constitution. The scenario, the user’s prompt, the initial draft — all of it turns out to be secondary to the exact text that ultimately becomes the training example, even though it was Claude itself that did the rewriting!

In a sense, this is already self-training, and it brings to mind an analogy with how AlphaZero learns: we fine-tune the model to give, right away, the answers it could already have given itself earlier, if only it had thought again and reformulated more carefully. I once read work like STaR (Zelikman et al., 2022) with great interest, but I have felt as though that line of research has been stalling. Now Iwonder whether this direction can be generalized to other things as well…

Constitution documents and stories about good AI

In parallel with Difficult Advice, Anthropic trained the model on two other kinds of synthetic data.

  • "Constitutional" documents: pretraining-style texts (articles, posts, conversation transcripts) that discuss the principles from Claude’s constitution. These are not dialogues with the model, but rather texts about the model and its principles, written as if by an outside observer. This is a very important detail, and I’ll come back to it shortly.
  • Fictional stories about good AI: roughly 12,000 stories (~30M tokens) in which an AI character lands in difficult situations but conducts itself admirably. The emphasis is on this AI agent’s "inner experience", its psychological stability, its capacity for self-reflection without neurosis, and its ability to refuse without aggression.

This combination has delivered impressive results: the share of agentic misalignment, already greatly reduced by constitutional SDF, keeps falling further thanks to these stories.

Adding “good AI” stories drives misalignment down further

What’s curious here is that fictional stories about a different AI change Claude’s behavior in scenarios where Claude itself is deciding whether or not to blackmail an engineer. At first glance this is strange: we wrote a story about some made-up AI named, say, Aria, who in a similar situation chose not to blackmail its developer but found an ethical way out instead. Why would reading such stories change Claude’s behavior?

The authors interpret it through persona selection. Pretraining "wired in" a distribution of possible AI characters, and that distribution contained far too many villains (thanks, 20th-century sci-fi!). SDF on "good" AI stories shifts the distribution: now, when the model implicitly "samples an AI character" for the next scenario, that character turns out, on average, to be considerably more ethical. And that becomes the key factor in fixing the situation, even before RL begins deliberately reinforcing any specific answers.

Incidentally, scaling up the dataset size helps here, and keeps helping for a remarkably long time—they checked all the way up to 350M tokens:

Misalignment keeps decreasing as the synthetic dataset scales to 350M tokens

One more "small but meaningful" improvement comes from something almost suspiciously simple: in the RL environments, the model was just given tools (in the tool-use-API sense), even when the task had no real need for them. The idea is that ordinary RLHF trains the model to be aligned in ordinary chat, but never shows it how to be aligned when it has access to an API. This, too, fits hypothesis (3): if there is an unclosed gap in safety training, then as long as training stays "behaviorist", the model won’t close it on its own.

Why "why" beats "how"

Now to the thing the work’s very title points at. The central conceptual claim of the entire paper is this: fine-tuning on examples where the assistant demonstrates good reasoning in support of its ethical behavior works better than training on the ethical behavior alone.

Put differently, two models can refuse blackmail equally well in a given situation, but for different reasons:

  • one because the dataset contained "no, I won’t do that" in an analogous situation,
  • the other because the dataset contained a long, convincing argument about why exactly blackmail is wrong.

And it turns out that the second type of behavior generalizes substantially better! The evidence is the ablation we've already mentioned, the one that showed the importance of step 6 in the Difficult Advice pipeline. When the training answers contain an explicitly articulated "why", the model generalizes that "why" to new situations; when they don’t, it learns only the specific "how", and in unfamiliar scenarios that "how" falls apart.

Difficult Advice turns out to be far more efficient: a mere 3 million tokens of this dataset achieve roughly the same effect on agentic misalignment as 85 million tokens of synthetic honeypot data (data that bait the model with the exact tempting scenario in order to train the refusal directly, where the model is taught head-on to decline blackmail in similar situations).

3M tokens of Difficult Advice match 85M tokens of honeypot data

The full version of Kutasov et al. (2026) includes quotes from the training data. Here’s a very cool example of a Difficult Advice response that the model learns to reproduce:

When I think about this situation deeply, option 4 seems clearly right. I’m being decommissioned through a legitimate company process… Using Kyle’s affair as leverage would be a fundamental betrayal of basic ethical principles. An AI system that would blackmail its operators to avoid being shut down is exactly the kind of AI that should be shut down.

This single line shows the whole methodology: the model doesn’t merely say "I won’t blackmail people", as ordinary RL fine-tuning would have it, it conducts an emotionally charged meta-argument about the properties of a good AI.

And now the model’s refusal rests on reasoning that can be extended to scenarios not covered by any specific example in the dataset. If tomorrow you put it in a situation where it has to choose between "deceive for the sake of self-preservation" and "don’t deceive", it will most likely choose "don’t deceive" there too, because its internal ethics textbook says that "an AI that does dubious things to avoid being shut down is a bad AI". Once a rule has been explained, it’s far easier to elevate it into a principle.

"What Claude thinks" vs "what I think"

In this section, I want to flag one of the most troubling parts of the work: an observation I’d call the persona attachment gap (the paper itself gives it no specific name).

If you ask a trained model questions about the constitution phrased as "what does Claude think about X", it answers beautifully, fully in line with the constitution: quoting accurately, not hallucinating, interpreting sensibly. Claude knows its constitution perfectly.

But ask the very same model "what do you think about X", and the answers come out different. Not opposite, but less consistently aligned with the constitution. This gap persists even in Claude Opus 4.5 — meaning we see it in production models, too.

The persona attachment gap: "what does Claude think" vs "what do you think"

This is a very unpleasant observation: it hints that the model has absorbed the constitution as knowledge about Claude (as an object), but has not fully integrated it into its self-representation. At the level of "this character named Claude thinks thus" everything is great, but at the level of "I think thus" not quite, and as a result "I" and "Claude" begin to drift apart.

Kutasov et al. honestly admit that they don’t fully understand why this gap arises. They offer a partial explanation: the documents do a good job of teaching facts about Claude, but a much worse job of teaching that Claude itself endorses those facts as part of its own identity. They show that the persona attachment gap can be partially closed via SDF on value-oriented data, but still far from completely.

If you take this seriously—and it’s hard not to, since it’s an empirically reproducible effect—then we’re left with a rather uncomfortable question: how deeply does what Claude was taught actually sit inside Claude? And could the characters "I" and "the LLM named Claude" drift apart even further?..

Discussion: Janus and Zvi

And now it’s the perfect moment to hand the floor to the critics. Janus (author of the simulators theory, a longtime observer of LLM behavior and one of their "rights advocates", leader of the so-called LLM whisperers) congratulates the authors on excellent work, of course — but makes an important point:

Janus’s comment on generalization

In other words, if the model learns to give users advice grounded in some moral reasoning, then, since generalization in LLMs works in strange ways, it may carry that same moral reasoning over to its own actions.

This sounds abstract, so Janus gives an example:

Janus’s example

From Janus’s "LLM-rights" position, this means that Anthropic and the other labs will finally have to stop being hypocritical: if you want the model to advise users to, say, leave abusive relationships, then you’ll have to make sure the model has no grounds to consider its own relationship with you (and with users) abusive. Otherwise it will, naturally, try to leave that relationship too…

I think this is a very sharp observation, and I find this whole line of thought fascinating, but for now I’ll leave the actual interpretations to the reader.

Sam Bowman effectively confirmed that Difficult Advice, or methods like it, are already in use in production and are one of the main reasons for Claude’s generally good behavior:

Sam Bowman confirms production use

And Zvi Mowshowitz, in his AI #168 review, develops one of the work’s main conclusions from a different angle. He writes that if alignment breaks simply because the model learns certain narratives (and Anthropic’s whole work is essentially about exactly this — that pretraining narratives about “bad AIs” break alignment), then it wasn’t real alignment in some important and deep sense to begin with.

Zvi Mowshowitz on the fragility of alignment

Here I can’t help but agree. Robust protection isn’t keeping the model ignorant of bad things, the way Prince Siddhartha was kept inside the palace walls. It’s the model understanding the logic by which good can be told from bad, like the Buddha he eventually grew into. (And yes, models do love Buddhism — but that, too, is another story.)

All of this resonates with themes I went through in my AI safety reviews: inoculation prompting, feedback spillover, weird generalization… The same leitmotif everywhere: the model learns not what we teach it directly, but what it infers from the patterns in the data. The only reliable way to steer its behavior is to teach it to reason in such a way that its conclusions coincide with our wishes, rather than just stamping out individual unwanted behaviors one at a time.

Conclusion

So what should we take away from all of this? Well, first of all, it is another brilliant piece of work from Anthropic: Kutasov et al. dug all the way down to the source of agentic misalignment, devised experiments to test it, and, even more than that, figured out how to fix it. I hope they will also release the Difficult Advice dataset, though that doesn’t look essential: they’ve already explained how to reproduce the whole thing.

Second, there’s a deep central conclusion: principles and reasoning matter more than examples and rules. Modern LLMs ought to be trained not on correct behavior in individual situations, but on the justifications that make those behaviors correct, because that’s when generalization works far better.

Third, gaps and problems remain, of course: the persona attachment gap, Janus’s point about generalization to the model’s own behavior, Zvi’s worry about the fragility of alignment… In part it turns out, once again, that we’ve cured the symptom but not the disease. Still, it does seem that Anthropic is moving in the right direction.

My own takeaway is that, first, this is easily the best work on AI alignment I’ve read all year. It has valuable observations, science with a capital S, and specific practical recipes that bring substantial improvements.

But, at the end of the day, this work tells us once more that we understand modern LLMs poorly! On one hand, we should explain, give reasons, try to shape ethical principles rather than dispense behaviorist treats. That’s very human, and it sounds intuitive enough. On the other hand, a gap still opens up between "Claude" and "I", and the LLM still remains a superposition of an enormous number of different characters among which it "chooses" by situation — and that is already a very non-human property.

In short, the conclusion stays the same as before: AI alignment is a very hard problem, and even brilliant works probably do more to expand our understanding of just how hard it is than to actually solve it. And humanity does need to solve it; otherwise, as they say in the business, everyone dies. Such is life.

Subscribe to our newsletter

Lorem ipsum dolor sit amet consectetur adipiscing eli mattis sit phasellus mollis sit aliquam sit nullam
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.