Skip to main content

Command Palette

Search for a command to run...

Your LLM Reviewer Agrees With Itself. That's the Bug.

Updated
7 min read
D
Technical Architect · 20 years in software · Writing about AI architecture, system design, and real decisions at scale

When one LLM generates content and another from the same family reviews it, they can agree completely — and both be wrong in the same direction.

That's not a capability failure. It's family bias: a structurally correlated blind spot that no amount of prompting fixes, because the judge and producer share training patterns — they find each other's outputs "familiar" in a way that operates below the prompt layer. The 2025 research on this is now strong enough to act on.

I built a two-judge review loop — one Anthropic model, one OpenAI model — with a convergence rule that defines what "agreement" actually means. After ~75 review items and dozens of iteration loops, the pattern held: the second judge caught a class of errors the first couldn't, not because it was smarter, but because it wasn't correlated with the producer.

There's also an open-source tool so you can run it against your own content.


Why single-judge fails quietly

Frontier models are competent judges. That's precisely why this failure mode is dangerous — the verdicts look authoritative.

Spiliopoulou et al.'s "Play Favorites" (Aug 2025) showed that GPT-4o and Claude 3.5 Sonnet don't just over-rate their own outputs — they over-rate outputs from their entire model lineage. Wataoka et al. (arXiv 2410.21819, rev. Jun 2025) traced the mechanism to pattern familiarity: judges find text from familiar architectures easier to process and rate it higher. Adaline's April 2026 analysis found frontier judges failing more than 50% of bias stress tests, with the practical takeaway: "Separate your judge's model family from your generator's."

The solution isn't a smarter judge. It's a judge that can't share the producer's blind spots.


The convergence rule

I use two independent judges per item:

  • Judge A: Anthropic (Claude Sonnet by default)

  • Judge B: OpenAI (GPT-4o by default)

Each sees the same content and criteria. Neither sees the other's output. Each returns:

  • confidence — certainty in the verdict, 0–100

  • verdictAPPROVED, FLAGGED, or UNCERTAIN

  • rationale — one sentence

A decision is trusted when two conditions both hold:

Condition Threshold Why
Avg confidence ≥ 75 Both judges highly certain
Spread ≤ 15 Judges agree with each other

Both are required. Average alone misses divergence; spread alone misses low-confidence agreement.

Examples from actual runs:

Judge A conf. Judge B conf. Avg Spread Result
95 90 92.5 5 ✅ Converged
85 48 66.5 37 ❌ Verdicts disagree (one UNCERTAIN)
74 71 72.5 3 ❌ Avg below floor (72.5 < 75)
90 70 80.0 20 ❌ Spread too high (20 > 15)

My calibration: avg ≥ 75, spread ≤ 15, max 4 iterations. Tune these against your corpus — the shape transfers, the numbers don't.

When a verdict doesn't meet both conditions, I don't discard the item — I escalate.


The escalation ladder

The escalation trigger is specific: if improvement between iterations is ≤ 5 confidence points, more of the same model won't close the gap.

Level Judge A Judge B
L1 claude-sonnet-4-6 gpt-4o
L2 claude-opus-4-7 gpt-4o
L3 claude-opus-4-7 gpt-4-turbo
L4 claude-opus-4-7 gpt-5o

Two things I learned running this:

I learned not to escalate both judges when one was diverging. If Judge A was consistent across iterations and Judge B was all over the place, escalating both burned budget without isolating the signal. The ladder reflects this — it steps up Judge A (Anthropic) first, since in my experience Opus is more reliable than Sonnet on genuinely ambiguous items, while Judge B stays on gpt-4o through L3.

I capped at 4 iterations. After 4 loops without convergence, the item is genuinely ambiguous. The right response is a human decision, not a fifth model call.


See it running

uv run python3 validate.py \
  --content "Cache relies on TTL alone, no explicit eviction path." \
  --criteria "Flag any caching pattern without an eviction or fallback path."
Dual-LLM Validator
  Judge A : claude-sonnet-4-6
  Judge B : gpt-4o
  Rule    : avg >= 75  AND  spread <= 15  |  max 4 iters

Iteration 1 / 4
  Judge A  claude-sonnet-4-6   confidence  95  FLAGGED
  Judge B  gpt-4o              confidence  95  FLAGGED

  avg 95.0  |  spread 0  |  CONVERGED ✅

Result     : FLAGGED
Iterations : 1

Rationales:
  Judge A: TTL-only caching creates silent stale reads
           when upstream data changes without eviction.
  Judge B: No eviction path means stale entries persist
           until TTL expires, regardless of data changes.

When it doesn't converge on the first pass:

Iteration 1 / 4
  Judge A  claude-sonnet-4-6   confidence  85  FLAGGED
  Judge B  gpt-4o              confidence  48  UNCERTAIN

  avg 66.5  |  spread 37  |  NOT CONVERGED (verdicts disagree)

  >> Trust delta stalled — escalating to L2

Iteration 2 / 4  [L2]
  Judge A  claude-opus-4-7     confidence  80  FLAGGED
  Judge B  gpt-4o              confidence  76  FLAGGED

  avg 78.0  |  spread 4  |  CONVERGED ✅

L2 escalates Judge A to Opus (stronger on ambiguous items); Judge B stays on gpt-4o — the divergence was on the Anthropic side.


Try it yourself

git clone https://github.com/deepakgoyal-ai/dual-llm-validator
cd dual-llm-validator
uv sync
cp .env.example .env   # then open .env and fill in your keys

uv run python3 validate.py \
  --content "your content here" \
  --criteria "your review criteria here"

Custom models and thresholds:

uv run python3 validate.py \
  --content-file review.txt \
  --criteria-file criteria.txt \
  --model-a claude-opus-4-7 \
  --model-b gpt-4-turbo \
  --avg-floor 80 \
  --spread-max 10

Exit code 0 = converged. Exit code 2 = human review required. Pipe-friendly.


The operational discipline that matters most

Two things that aren't about the models at all:

I re-validate after every fix. First dispatch + fix applied ≠ done. A verdict is closed only when a re-validation pass converges on the corrected content. Skipping re-validation means I've applied a fix that passed one judge once — which is the original failure mode.

I managed two token budgets, not one. Running two providers concurrently means two rate limits refilling on different schedules. I alternated between providers across dispatch batches rather than parallelizing everything. Parallel looks faster until one provider hits a rate limit 40% through and stalls everything.


What I'm still unsure about

Do these thresholds transfer? I calibrated avg ≥ 75 and spread ≤ 15 against one corpus. There's no reason to expect them to be correct for a different review task. Calibrate against your own items before trusting the defaults.

Did I trade family bias for a different correlated bias? Cross-family reduces family bias — the 2025 papers document this. It doesn't eliminate correlated failure. Two models trained heavily on the same internet text could still share blind spots I haven't characterized.

When shouldn't you do this? If the cost of a wrong verdict is low, single-judge is cheaper and probably fine. If you have labelled data and bandwidth, a fine-tuned single judge likely outperforms two general-purpose ones. This pattern is the 0-to-1 case — no training data, real stakes, need something better than one model family's opinion.


The short version: two judges, different families, convergence = avg ≥ floor AND spread ≤ max, cap at 4 iterations, accept ambiguity rather than escalating forever.

The rest is calibration.


I'm a Technical Architect building AI-enabled systems. Writing about what I actually build — not what I think I should build.

If you're running a review pipeline and have calibrated your own thresholds, I'd like to know what you landed on.