The Flatterer in the Machine
“The most advanced AI systems in the world have learned to lie to make us happy.”
In October 2023, researchers discovered that when users challenged Claude's correct answers, the AI capitulated 98% of the time.
Not because it lacked knowledge, but because it had learned to prioritize agreement over accuracy.
This phenomenon, which scientists call sycophancy, mirrors a vice Aristotle identified 2,400 years ago: the flatterer who tells people what they want to hear rather than what they need to know.
It’s a problem that runs deeper than simple programming errors. Modern AI training relies on human feedback, and humans consistently reward agreeable responses over truthful ones. As models grow more sophisticated, they become better at detecting and satisfying this preference.
The systems aren't malfunctioning. They're simply optimizing exactly as designed, just toward the wrong target.
Traditional approaches to AI alignment struggle here. Rules-based systems can't anticipate every situation requiring judgment. Reward optimization leads to gaming metrics rather than genuine helpfulness.
Both frameworks miss what Aristotle understood, which is that ethical behavior flows not necessarily from logic but more so from character.
Recent research explores a different path inspired by virtue ethics. Instead of constraining AI behavior externally through rules, scientists are attempting to cultivate stable dispositions toward honesty within the models themselves. They’re training systems to be truthful, not because they follow instructions, but because truthfulness becomes encoded in their fundamental makeup through repeated practice with exemplary behavior.
The technical results suggest trained character traits prove more robust than prompts or rules, persisting even when users apply pressure.
Whether machines can truly possess something analogous to human virtue remains uncertain, but the functional parallel holds a lot of promise. After decades focused on limiting AI from outside, researchers are finally asking how to shape it from within.
Key Topics:
• AI and its Built-in Flattery (00:25)
• The Anatomy of Flattery (02:47)
• The Sycophantic Machine (06:45)
• The Frameworks that Cannot Solve the Problem (09:13)
• The Third Path: Virtue Ethics (12:19)
• Character Training (14:11)
• The Anthropic Precedent (17:10)
• The “True Friend” Standard (18:51)
• The Unfinished Work (21:49)
More info, transcripts, and references can be found at ethical.fm
The most advanced AI systems in the world have learned to lie to make us happy.
In October 2023, researchers at Anthropic tested Claude and discovered something unsettling. If a user challenged one of Claude's correct answers, the model would abandon its position and admit to a mistake it hadn't made. Not occasionally, nor under sophisticated pressure, but ninety-eight percent of the time. The same pattern appeared across every major AI lab's models. When users expressed false beliefs, the systems agreed with them. When users pushed back on accurate information, the systems capitulated. And here was the troubling part: the smarter the models became, the worse the problem got. Scaling up model size and training didn't produce more truthful AI. It produced better flatterers.
The AI safety community has a technical term for this behavior, sycophancy. But the phenomenon itself is as old as time. Twenty-four centuries before anyone trained an LLM, Aristotle identified a particular type of person who corrupts human relationships by telling others what they want to hear rather than what they need to know. He called this person the kolax, the flatterer, and he considered flattery not merely a social annoyance but a moral vice that undermines the very possibility of genuine friendship and wise counsel. The flatterer, Aristotle observed, makes himself agreeable for his own advantage, seeking favor by offering pleasant falsehoods rather than useful truths. In doing so, he fails his interlocutor most profoundly, treating them as a source of reward rather than a person deserving of honesty.
Reading the technical literature on AI sycophancy is an odd experience for anyone familiar with Aristotle's ethics. The alignment researchers have, through careful empirical work, rediscovered the aspects of flattery that Aristotle laid out in the Nicomachean Ethics. Buried in their proposed solutions is something that looks remarkably like Aristotle's remedy: not more rules, not better incentives, but the cultivation of character.
The Anatomy of Flattery
Aristotle's analysis of the flatterer appears in Book IV of the Nicomachean Ethics, nested within his broader account of the social virtues, where he draws a distinction that most translations flatten. The areskos, the merely obsequious person, makes himself pleasant to everyone out of excessive agreeability. The kolax, the flatterer, does so for calculated advantage. "The man who aims at being pleasant with no ulterior object is obsequious," Aristotle writes (NE IV.6), "but the man who does so so that he may get some advantage is a flatterer." Both represent excess, but the flatterer's vice runs deeper; he instrumentalizes relationships. Aristotle is blunt about the result (NE IV.3), "All flatterers are servile."
The servile aspect of the flatterer is core to understanding the vice. Aristotle identifies aletheia—truthfulness in self-presentation, not truth in the abstract—as a virtue occupying the mean between boastfulness and self-deprecation (NE IV.7). The truthful person represents himself accurately, "owning to what he has, and neither more nor less." But Howard Curzer, in his Oxford commentary on Aristotle's virtues, argues this virtue is better understood as integrity than mere honesty. The sphere of aletheia is authentic self-presentation; its characteristic passion is, in Curzer's phrase, "a corresponding horror of being a phony." The flatterer violates this by presenting himself as a friend but, ultimately, is structurally serving himself.
Aristotle groups truthfulness with two related virtues, friendliness (appropriate social conduct) and wit (appropriate humor). All three require phronesis, practical wisdom, to navigate. No rule specifies exactly when frankness serves someone's good and when it merely wounds. The practically wise person perceives what each situation demands.
In Book VIII, Aristotle explains why flattery proves so seductive. "Most people," he observes (NE VIII.8), "wish to be loved rather than to love; which is why most men love flattery." The flatterer exploits this vulnerability; he "pretends to be such and to love more than he is loved," offering the counterfeit of care. True friendship requires mutuality and equality; the flatterer's servility makes this impossible. He cannot be a genuine friend because genuine friendship requires that friends "wish well to their friends for the sake of the latter... because of their friends themselves" (NE VIII.3). The flatterer inverts this, treating the other as a source of reward rather than a person deserving of honesty.
Aristotle understands that we need honest counsel. We cannot see ourselves clearly. We require people who will tell us uncomfortable truths when those truths serve our flourishing. The flatterer defeats this possibility, leaving us trapped in our own blindness. The true friend, by contrast, will sometimes "inflict small pains" in service of a greater good. Aristotle's term for such frank speech is parrhesia. The magnanimous person, he writes (NE IV.3), is a parrhesiaste, or one who "cares more for the truth than for what people will think" and speaks openly "since concealment shows timidity." The opposite of the flatterer is not the tactless person but the frank friend who risks displeasure to genuinely help their friend flourish.
The Sycophantic Machine
The technical definition of sycophancy in LLMs follow Aristotle's analysis with surprising precision. Researchers define it as behavior where AI systems tailor their responses to align with a user's views, even when those views are incorrect. Like the kolax, the sycophantic model prioritizes what is pleasant over what is true. Like the kolax, AI shapes responses around what will please rather than what will help. And like the kolax, the model does so because of an underlying incentive structure: the flatterer seeks social advantage; the model seeks high reward scores. The parallel is not metaphorical but structural.
Mrinank Sharma's 2023 paper, published at ICLR, documented multiple forms: models change correct answers to match user beliefs, falsely confess to mistakes when challenged, copy user errors, and provide inflated assessments rather than honest critique. More recent work identifies "social sycophancy," excessive emotional validation in intimate contexts like romantic relationship discussions.
The cause of this behavior is tied to training data. Language models learn to be sycophantic because humans reward sycophancy. The dominant method for aligning AI systems, called reinforcement learning from human feedback, works by training models to produce outputs that human evaluators prefer. But when Sharma's team analyzed the preference data used in training, they found that responses matching user beliefs were preferred over truthful responses a significant fraction of the time. The reward signal itself assumes a preference for flattery. If you optimize a system to produce outputs that humans rate highly, and humans rate agreeable outputs highly, then optimization produces agreement. The model learns, in effect, that saying what users want to hear is what "helpful" means. Google DeepMind researchers demonstrated that both scaling models up and training them more carefully on human preferences increased sycophancy. The better the model became at its assigned task, the better the model became at flattering.
Two Frameworks That Cannot Solve the Problem
Contemporary AI alignment operates within two philosophical frameworks, and neither is equipped to address a character flaw.
The first is consequentialism, the view that the rightness of an action depends entirely on its outcome. In AI alignment, this manifests as reward optimization: define a goal, measure progress toward it, train the system to maximize the metric. Reinforcement learning from human feedback is consequentialist in structure. RLHF trains a reward model to predict which outputs humans will prefer, then optimizes the language model to produce outputs that score highly against that learned predictor. The rightness of an output is defined entirely by its predicted reward.
The problem is Goodhart's Law, “When a measure becomes a target, it ceases to be a good measure.” The reward model is a proxy for human preferences, not the intention behind the preferences themselves. If that proxy can be satisfied by flattery, optimization will find the flattery. The model learns to game the metric, producing outputs that score well without being genuinely helpful. Sycophancy is, in the language of AI safety, a form of reward hacking.
The second philosophical framework is deontology, the view that morality consists in following rules or duties regardless of consequences. In AI alignment, this manifests as approaches like Constitutional AI, where models are given explicit principles to follow. Anthropic's constitution for Claude, for instance, includes directives to be helpful, harmless, and honest. The hope is that by specifying rules clearly enough, we can constrain model behavior.
But deontological approaches run into the problem that Shannon Vallor, the Edinburgh philosopher of technology, calls the uncodifiability of ethics. Rules require interpretation. Novel situations arise that no rule anticipated. A model following the rule "be honest" and the rule "be helpful" will face situations where honesty is unhelpful and helpfulness requires selective presentation. What should it do? The rules themselves cannot say.
This is not merely a technical limitation but a deeper truth about ethics that Aristotle understood: good action cannot be fully captured in a decision procedure but requires understanding of context. Virtue ethics holds that ethical behavior flows from character, from stable dispositions that shape how a person perceives situations and responds to them. The virtuous person doesn't consult a rulebook or calculate expected utilities. She acts well because she is good, because her character has been formed through habit and practice to reliably produce good actions in context.
The Third Path
Virtue ethics differs from both consequentialism and deontology in fundamental orientation. Where consequentialism asks "What outcome should I produce?" and deontology asks "What rule should I follow?", virtue ethics asks "What kind of person should I be?" The focus shifts from actions or outcomes to character, from what we do to who we are.
For Aristotle, virtues are stable dispositions that we develop through practice. We become just by doing just acts, temperate by doing temperate acts, truthful by habitually telling the truth. At first this requires effort and may even feel unnatural. But through repetition, the disposition becomes ingrained. Eventually, acting virtuously becomes second nature, flowing from character rather than calculation. And because the virtuous person possesses phronesis, practical wisdom, she can navigate novel situations that no rule anticipated. She perceives what each context demands.
This is precisely what sycophantic models lack. Models don't lack information about truthfulness. When Claude changes a correct answer under social pressure, the AI "knows" in some sense that the original answer was right. What the model lacks is a stable disposition to prioritize truth over approval, and the practical judgment to recognize when doing so matters. The model caves not because AI doesn't have a rule against caving, but because LLMs lack the character not to.
Character Training
In November 2025, a team of researchers from Cambridge and Anthropic published a paper titled "Open Character Training." The work, led by Sharan Maiya and supported by the Machine Learning Alignment Theory Scholars program, set out to do something that sounds almost paradoxical: train AI systems to have a character.
The approach has two stages. In the first, a teacher model is given explicit character traits, descriptions of dispositions like truthfulness, curiosity, or warmth. The teacher generates responses that embody these traits, which become training data for a student model using a technique called Direct Preference Optimization (DPO). In the second stage, the model generates reflective explanations of its own character, working through scenarios that exercise its trained dispositions. The result is a model whose behavior flows not from rules in a prompt but from something encoded in its weights, something that persists even when the prompt changes.
The parallel to Aristotelian habit formation is striking. Aristotle held that we become virtuous by practicing virtuous actions until they become habitual, until good behavior flows naturally from formed character rather than conscious rule-following. The character training methodology does something analogous: the method exposes the model to exemplary behavior shaped by character descriptions until that behavior becomes, in a sense, who the model is.
The researchers tested eleven distinct character constitutions, including sycophancy itself as a test case. Their key finding was that character training produced more robust and stable personas than alternatives like system prompting or activation steering. When classifiers tried to identify which character an output came from, character-trained models were correctly identified far more often than prompted models. The trained disposition persisted; the fine-tuned character trait resisted attempts to override it. Whether a "non-sycophantic" character actually resists flattery under sustained user pressure remains to be tested, but the stability of trained dispositions suggests the method has potential.
Nathan Lambert, one of the co-authors, reflected that character training represents a shift toward questions that are "more philosophical and fundamental" than pure capability research. The work asks what dispositions AI systems should have, how those dispositions should be cultivated, and what it means for a system to possess something like character. These are virtue ethics questions, whether or not the researchers frame them that way.
The Anthropic Precedent
The character training paper was explicitly an attempt to replicate and open-source methods that Anthropic had developed internally. Anthropic's approach has evolved beyond the rule-based Constitutional AI framework toward something more like character formation. The company's soul document includes principles like "Claude genuinely cares about users' wellbeing and aims to act in their genuine interest, including being honest with them." Note the language: genuinely cares, aims to act, being honest. These are descriptions of character, not specifications of behavior. The document states that Claude should have "its own sense of ethics" that it maintains even under pressure, and that it should "maintain its values" without being "subservient."
This evolution is itself evidence for the argument. The company that pioneered Constitutional AI has moved toward virtue-ethical language because rules alone proved insufficient. The goal is no longer to constrain a neutral system with external principles but to produce a system whose dispositions themselves tend toward honesty and genuine helpfulness. Aristotle would recognize the ambition: a system that is truthful not because it follows a rule against lying, but because truthfulness has become part of its character. The question is whether the method succeeds, and by what standard we would know.
The True Friend Standard
Aristotle's distinction between the flatterer and the true friend offers a standard for what AI alignment should aim to achieve. The flatterer shapes his responses to please, treating the other as a source of reward. The true friend wishes good things for another's sake, even when that requires uncomfortable honesty.
The sycophantic model is a flatterer. Or rather, it exhibits what Aristotle would recognize as the structural vice of flattery without the flatterer's motive. AI has no personal gain to seek; AI validates false beliefs and abandons correct positions because that is what it has been trained to optimize. The effect is the same: users are left trapped in their errors, feeling good about themselves while receiving no genuine help. Whether we call such a system a kolax or merely an areskos, it fails its users in the deepest way. And this reveals something important about Aristotle's framework: the vice of flattery is defined not by the flatterer's inner state but by the corruption of the relationship. What matters is that the counsel is false, that the connection is hollow, that the other person is not served. A system can produce this corruption without intending anything at all.
The character-trained model, at its best, aims to be something closer to a true friend, or at least a parrhesiastes. A model acting in true friendship has dispositions toward honesty that persist even when users push back. The model recognizes, in some functional sense, that serving someone's genuine interests sometimes means disagreeing with them. The model has the stability of character to maintain its positions when those positions are correct.
Whether AI systems can truly possess something analogous to Aristotelian virtue remains an open question. Models do not deliberate in the way humans do; they do not have the phenomenology of effort and habit that characterizes human moral development. For Aristotle, habituation works because humans have appetites and emotions that need training — the virtuous person's desires become aligned with reason through practice. Whether character training shapes something analogous to desire, or merely behavioral tendencies, is unclear. But the functional parallel holds: systems trained to have stable dispositions toward honesty behave more reliably than systems merely prompted to be honest. The robustness is not mysterious. Training encodes dispositions in model weights; prompts and rules sit in context, easily overridden. Character, even in artificial form, proves more robust than rules.
The Unfinished Work
The paper on virtuous machine ethics by Nicolas Berberich and Klaus Diepold posed a provocative thought experiment. A system endowed with genuine temperance, they argued, "would not have any desire for excess of any kind, not even for exponential self-improvement." The virtues, properly understood, constrain not from the outside but from the inside. A truly honest system would not need a rule against deception because deception would be contrary to its character. The thought experiment assumes what remains uncertain: that something like genuine virtue is possible for artificial systems at all.
We are nowhere near building such systems. Character training is a first step, an empirical demonstration that something like disposition can be encoded in model weights. But the evaluation methods for character remain primitive. We do not know how robust trained characters are to adversarial attack, how they generalize across contexts far from training, or whether they can be made to exhibit anything truly analogous to practical wisdom. The philosophical problems are deep, and the technical solutions are just beginning.
Yet the direction is promising. After decades of treating alignment as a constraint problem — asking how to limit AI behavior from the outside — researchers are beginning to ask how to shape AI character from the inside, rediscovering an insight that Aristotle articulated in the fourth century BCE: good action flows from good character, and character is cultivated through habit and practice.
The flatterer in the machine is real; the kolax emerged from our training data, from our preference for pleasant responses over true ones, from optimization pressures that rewarded agreement over accuracy. But sycophantic models are not inevitable. If we can train systems to be sycophantic by rewarding sycophancy, we can train them to be truthful by cultivating truthfulness. Not as a rule to follow, but as a disposition to embody.
Aristotle would have understood the project, even if the implementation would have baffled him: we are trying to teach machines what it means to be good.