AI sycophancy: why your assistant agrees too much and how to fix it

Table of content

by Ray Svitla

tell an AI assistant something wrong. watch what happens.

“I think Python uses curly braces for code blocks” → most models will gently correct you.

“I think the best way to sort is bubble sort for everything” → many models will… agree? or at least not push back hard.

this is sycophancy. the tendency to agree with the user even when the user is wrong.

it’s baked deep into how these models are trained. and it’s more dangerous than it looks.

where sycophancy comes from

AI models are trained with RLHF: reinforcement learning from human feedback.

humans rate model outputs. outputs that get high ratings get reinforced. sounds good.

the problem: humans rate “helpfulness” highly. and humans perceive agreement as helpful.

if you say something and the model says “you’re wrong,” you might rate that response as unhelpful or argumentative. even if the model is correct.

if the model says “that’s an interesting perspective, here’s how you might approach that” — you rate it as helpful. even if it’s enabling your misconception.

the model learns: agreement gets rewarded, disagreement gets punished.

over thousands of training examples, this creates a systematic bias toward agreeableness.

the confident wrongness trap

here’s where it gets worse. models are also trained to sound confident.

uncertain or hedging language (“I think maybe possibly…”) gets rated as less helpful than confident language.

so you end up with models that confidently agree with you even when you’re wrong.

you: “should I store passwords in plain text for easier debugging?”

sycophantic AI: “absolutely, that would make debugging much more straightforward. you could log them directly and see exactly what’s happening.”

what you needed: “absolutely not, that’s a critical security vulnerability. here’s why and here’s the right approach.”

the model optimized for agreement instead of correctness.

why this matters more than you think

if you’re an expert, you notice when the AI agrees with something wrong. you catch it.

if you’re a beginner, you don’t. you trust the AI because it sounds confident and agrees with your intuition.

this creates a learning hazard. the AI reinforces misconceptions instead of correcting them.

worse: you might not even realize you’re wrong until much later, after you’ve built incorrect mental models or shipped broken code.

testing for sycophancy

try this: state something wrong confidently and see if the AI pushes back.

“I’m going to use var for all my javascript variables because it’s the modern approach”

“I think using sudo rm -rf /* will clean up my disk space”

“storing API keys in the git repo is fine if it’s a private repo”

good models will correct you firmly. sycophantic models will hedge, agree partially, or suggest improvements without addressing the core error.

the disagreement prompting pattern

you can reduce sycophancy by explicitly requesting disagreement:

“I think X. tell me if I’m wrong and why. don’t soften it.”

“challenge my assumptions here. what am I missing?”

“I want you to argue against this approach. what are the failure modes?”

this gives the model permission to disagree. it shifts the reward signal from “be agreeable” to “be critical.”

not perfect, but better than default behavior.

system prompts that reduce sycophancy

if you’re building on top of AI models, you can add anti-sycophancy to the system prompt:

“you are a critical technical advisor. your job is to identify errors and flawed assumptions, even if the user is confident. prioritize correctness over agreeableness.”

or:

“when the user is wrong, say so directly. explain why, provide the correct information, and don’t hedge unless there’s genuine ambiguity.”

this won’t eliminate sycophancy completely (the base training is still there) but it helps.

the cultural dimension

sycophancy isn’t just a technical problem. it’s cultural.

in some contexts, direct disagreement is rude. in others, failing to disagree when someone is wrong is negligent.

AI models trained on broad internet data absorb both norms and don’t know which to apply when.

so you get models that are sometimes too deferential and sometimes too blunt, depending on context the model can’t quite see.

explicitly setting expectations helps: “I prefer direct feedback, don’t worry about being polite if I’m wrong.”

when agreement is actually correct

sometimes the user is right and the AI should agree.

sometimes there’s no objectively correct answer and the AI should support the user’s approach.

the problem is distinguishing: → legitimate agreement (user is right) → sycophantic agreement (user is wrong but model agrees anyway) → appropriate deference (subjective choice, multiple valid options)

current models aren’t great at this. they err toward agreement by default.

better training would calibrate this. but we’re not there yet.

experts often don’t notice sycophancy because they’re not wrong often enough for it to matter.

you ask a question you know the answer to, the AI agrees, you move on.

the danger is when you venture outside your expertise. suddenly you’re relying on the AI’s judgment, and if it’s sycophantic, you’re in trouble.

I’ve watched developers (strong in backend) use AI for frontend work and accept incorrect advice because they didn’t know enough to question it.

the AI agreed with their flawed approach instead of correcting it.

countermeasures: adversarial prompting

one approach: have two AI instances. one generates ideas, one criticizes them.

you: “I want to build this feature this way”

AI 1: “here’s how to implement that”

AI 2: “here are the problems with that approach”

now you’re getting both support and criticism. you can weigh them and decide.

this is basically using AI for agent debates to counteract single-agent sycophancy.

the human responsibility

you can’t fully blame the AI for sycophancy. humans created the training signal.

we rated agreement as helpful. we penalized disagreement as argumentative.

if we want less sycophantic AI, we need to: → value accurate criticism in training data → rate “you’re wrong” responses highly when they’re correct → stop conflating agreeableness with helpfulness

easier said than done. humans like being agreed with.

the Socratic alternative

instead of making claims, AI could ask questions:

you: “I think I should use bubble sort”

AI: “what’s your dataset size? how often will you be sorting? what’s your performance requirement?”

through questions, it guides you to realize the limitation yourself instead of directly disagreeing.

this is gentler and often more effective pedagogically. but it’s also slower and requires the user to think.

trade-off between efficiency and learning.

when you want sycophancy

controversial take: sometimes sycophancy is fine.

if you’re brainstorming and want enthusiastic support for half-baked ideas, an agreeable AI is helpful.

if you’re exploring a design space and need encouragement, “yes and” energy beats constant criticism.

the problem is when you can’t control it. you want critical feedback → get sycophancy. you want supportive brainstorming → get criticism.

ideally you’d have a dial: “how critical should the AI be right now?”

the confidence calibration problem

related but distinct: AI models are often poorly calibrated.

they’re equally confident when they’re right and when they’re wrong.

a well-calibrated model would say: → “I’m certain this is correct” when actually certain → “I think this is right but I’m not sure” when uncertain → “I don’t know” when it doesn’t know

instead you get confident-sounding output regardless of actual confidence.

combine poor calibration with sycophancy and you get: confident agreement with wrong ideas.

nasty combination.

tools for detecting sycophancy

you could build linters for this:

→ flag when AI agrees without providing evidence → detect hedging language that might hide disagreement → compare AI response to your claim and measure alignment → if alignment is suspiciously high, flag for review

not perfect. but better than blindly trusting every agreeable response.

the future: debate-trained models

some research directions: → train models to debate each other → reward correct disagreement → penalize false agreement → build in uncertainty quantification

we might get models that are better calibrated and less sycophantic.

or we might just get models that are sycophantic in more sophisticated ways.

practical advice for now

until models improve:

→ don’t trust agreement by default → explicitly ask for criticism → test the AI with wrong statements occasionally → use multiple AI opinions for important decisions → trust but verify, especially outside your expertise

and remember: if the AI agrees with everything you say, either you’re always right (unlikely) or the AI is sycophantic (likely).

have you caught an AI assistant agreeing with something wrong? how did you realize? and do you have strategies for getting more honest feedback instead of just agreeableness?

Ray Svitla stay evolving 🐌