Your LLM Reviewer Agrees With Itself. That's the Bug.
When one LLM generates content and another from the same family reviews it, they can agree completely — and both be wrong in the same direction.
That's not a capability failure. It's family bias: a structurally correlated blind spot that no amount of prompting fixes, because the judge and producer share training patterns — they find each other's outputs "familiar" in a way that operates below the prompt layer. The 2025 research on this is now strong enough to act on.
I built a two-judge review loop — one Anthropic model, one OpenAI model — with a convergence rule that defines what "agreement" actually means. After ~75 review items and dozens of iteration loops, the pattern held: the second judge caught a class of errors the first couldn't, not because it was smarter, but because it wasn't correlated with the producer.
There's also an open-source tool so you can run it against your own content.
Why single-judge fails quietly
Frontier models are competent judges. That's precisely why this failure mode is dangerous — the verdicts look authoritative.
Spiliopoulou et al.'s "Play Favorites" (Aug 2025) showed that GPT-4o and Claude 3.5 Sonnet don't just over-rate their own outputs — they over-rate outputs from their entire model lineage. Wataoka et al. (arXiv 2410.21819, rev. Jun 2025) traced the mechanism to pattern familiarity: judges find text from familiar architectures easier to process and rate it higher. Adaline's April 2026 analysis found frontier judges failing more than 50% of bias stress tests, with the practical takeaway: "Separate your judge's model family from your generator's."
The solution isn't a smarter judge. It's a judge that can't share the producer's blind spots.
The convergence rule
I use two independent judges per item:
Judge A: Anthropic (Claude Sonnet by default)
Judge B: OpenAI (GPT-4o by default)
Each sees the same content and criteria. Neither sees the other's output. Each returns:
confidence— certainty in the verdict, 0–100verdict—APPROVED,FLAGGED, orUNCERTAINrationale— one sentence
A decision is trusted when two conditions both hold:
| Condition | Threshold | Why |
|---|---|---|
| Avg confidence | ≥ 75 | Both judges highly certain |
| Spread | ≤ 15 | Judges agree with each other |
Both are required. Average alone misses divergence; spread alone misses low-confidence agreement.
Examples from actual runs:
| Judge A conf. | Judge B conf. | Avg | Spread | Result |
|---|---|---|---|---|
| 95 | 90 | 92.5 | 5 | ✅ Converged |
| 85 | 48 | 66.5 | 37 | ❌ Verdicts disagree (one UNCERTAIN) |
| 74 | 71 | 72.5 | 3 | ❌ Avg below floor (72.5 < 75) |
| 90 | 70 | 80.0 | 20 | ❌ Spread too high (20 > 15) |
My calibration: avg ≥ 75, spread ≤ 15, max 4 iterations. Tune these against your corpus — the shape transfers, the numbers don't.
When a verdict doesn't meet both conditions, I don't discard the item — I escalate.
The escalation ladder
The escalation trigger is specific: if improvement between iterations is ≤ 5 confidence points, more of the same model won't close the gap.
| Level | Judge A | Judge B |
|---|---|---|
| L1 | claude-sonnet-4-6 | gpt-4o |
| L2 | claude-opus-4-7 | gpt-4o |
| L3 | claude-opus-4-7 | gpt-4-turbo |
| L4 | claude-opus-4-7 | gpt-5o |
Two things I learned running this:
I learned not to escalate both judges when one was diverging. If Judge A was consistent across iterations and Judge B was all over the place, escalating both burned budget without isolating the signal. The ladder reflects this — it steps up Judge A (Anthropic) first, since in my experience Opus is more reliable than Sonnet on genuinely ambiguous items, while Judge B stays on gpt-4o through L3.
I capped at 4 iterations. After 4 loops without convergence, the item is genuinely ambiguous. The right response is a human decision, not a fifth model call.
See it running
uv run python3 validate.py \
--content "Cache relies on TTL alone, no explicit eviction path." \
--criteria "Flag any caching pattern without an eviction or fallback path."
Dual-LLM Validator
Judge A : claude-sonnet-4-6
Judge B : gpt-4o
Rule : avg >= 75 AND spread <= 15 | max 4 iters
Iteration 1 / 4
Judge A claude-sonnet-4-6 confidence 95 FLAGGED
Judge B gpt-4o confidence 95 FLAGGED
avg 95.0 | spread 0 | CONVERGED ✅
Result : FLAGGED
Iterations : 1
Rationales:
Judge A: TTL-only caching creates silent stale reads
when upstream data changes without eviction.
Judge B: No eviction path means stale entries persist
until TTL expires, regardless of data changes.
When it doesn't converge on the first pass:
Iteration 1 / 4
Judge A claude-sonnet-4-6 confidence 85 FLAGGED
Judge B gpt-4o confidence 48 UNCERTAIN
avg 66.5 | spread 37 | NOT CONVERGED (verdicts disagree)
>> Trust delta stalled — escalating to L2
Iteration 2 / 4 [L2]
Judge A claude-opus-4-7 confidence 80 FLAGGED
Judge B gpt-4o confidence 76 FLAGGED
avg 78.0 | spread 4 | CONVERGED ✅
L2 escalates Judge A to Opus (stronger on ambiguous items); Judge B stays on gpt-4o — the divergence was on the Anthropic side.
Try it yourself
git clone https://github.com/deepakgoyal-ai/dual-llm-validator
cd dual-llm-validator
uv sync
cp .env.example .env # then open .env and fill in your keys
uv run python3 validate.py \
--content "your content here" \
--criteria "your review criteria here"
Custom models and thresholds:
uv run python3 validate.py \
--content-file review.txt \
--criteria-file criteria.txt \
--model-a claude-opus-4-7 \
--model-b gpt-4-turbo \
--avg-floor 80 \
--spread-max 10
Exit code 0 = converged. Exit code 2 = human review required. Pipe-friendly.
The operational discipline that matters most
Two things that aren't about the models at all:
I re-validate after every fix. First dispatch + fix applied ≠ done. A verdict is closed only when a re-validation pass converges on the corrected content. Skipping re-validation means I've applied a fix that passed one judge once — which is the original failure mode.
I managed two token budgets, not one. Running two providers concurrently means two rate limits refilling on different schedules. I alternated between providers across dispatch batches rather than parallelizing everything. Parallel looks faster until one provider hits a rate limit 40% through and stalls everything.
What I'm still unsure about
Do these thresholds transfer? I calibrated avg ≥ 75 and spread ≤ 15 against one corpus. There's no reason to expect them to be correct for a different review task. Calibrate against your own items before trusting the defaults.
Did I trade family bias for a different correlated bias? Cross-family reduces family bias — the 2025 papers document this. It doesn't eliminate correlated failure. Two models trained heavily on the same internet text could still share blind spots I haven't characterized.
When shouldn't you do this? If the cost of a wrong verdict is low, single-judge is cheaper and probably fine. If you have labelled data and bandwidth, a fine-tuned single judge likely outperforms two general-purpose ones. This pattern is the 0-to-1 case — no training data, real stakes, need something better than one model family's opinion.
The short version: two judges, different families, convergence = avg ≥ floor AND spread ≤ max, cap at 4 iterations, accept ambiguity rather than escalating forever.
The rest is calibration.
I'm a Technical Architect building AI-enabled systems. Writing about what I actually build — not what I think I should build.
If you're running a review pipeline and have calibrated your own thresholds, I'd like to know what you landed on.
