Two Validators, One Decision: A Convergence Rule for LLM Review

When one LLM generates content and another from the same family reviews it, they can agree completely — and both be wrong in the same direction.

That's not a capability failure. It's family bias: a structurally correlated blind spot that no amount of prompting fixes, because the judge and producer share training patterns — they find each other's outputs "familiar" in a way that operates below the prompt layer. The 2025 research on this is now strong enough to act on.

I built a two-judge review loop — one Anthropic model, one OpenAI model — with a convergence rule that defines what "agreement" actually means. After ~75 review items and dozens of iteration loops, the pattern held: the second judge caught a class of errors the first couldn't, not because it was smarter, but because it wasn't correlated with the producer.

There's also an open-source tool so you can run it against your own content.

Why single-judge fails quietly

Frontier models are competent judges. That's precisely why this failure mode is dangerous — the verdicts look authoritative.

Spiliopoulou et al.'s "Play Favorites" (Aug 2025) showed that GPT-4o and Claude 3.5 Sonnet don't just over-rate their own outputs — they over-rate outputs from their entire model lineage. Wataoka et al. (arXiv 2410.21819, rev. Jun 2025) traced the mechanism to pattern familiarity: judges find text from familiar architectures easier to process and rate it higher. Adaline's April 2026 analysis found frontier judges failing more than 50% of bias stress tests, with the practical takeaway: "Separate your judge's model family from your generator's."

The solution isn't a smarter judge. It's a judge that can't share the producer's blind spots.

The convergence rule

I use two independent judges per item:

Judge A: Anthropic (Claude Sonnet by default)
Judge B: OpenAI (GPT-4o by default)

Each sees the same content and criteria. Neither sees the other's output. Each returns:

confidence — certainty in the verdict, 0–100
verdict — APPROVED, FLAGGED, or UNCERTAIN
rationale — one sentence

A decision is trusted when two conditions both hold:

Condition	Threshold	Why
Avg confidence	≥ 75	Both judges highly certain
Spread	≤ 15	Judges agree with each other

Both are required. Average alone misses divergence; spread alone misses low-confidence agreement.

Examples from actual runs:

Judge A conf.	Judge B conf.	Avg	Spread	Result
95	90	92.5	5	✅ Converged
85	48	66.5	37	❌ Verdicts disagree (one UNCERTAIN)
74	71	72.5	3	❌ Avg below floor (72.5 < 75)
90	70	80.0	20	❌ Spread too high (20 > 15)

My calibration: avg ≥ 75, spread ≤ 15, max 4 iterations. Tune these against your corpus — the shape transfers, the numbers don't.

When a verdict doesn't meet both conditions, I don't discard the item — I escalate.

The escalation ladder

The escalation trigger is specific: if improvement between iterations is ≤ 5 confidence points, more of the same model won't close the gap.

Level	Judge A	Judge B
L1	claude-sonnet-4-6	gpt-4o
L2	claude-opus-4-7	gpt-4o
L3	claude-opus-4-7	gpt-4-turbo
L4	claude-opus-4-7	gpt-5o

Two things I learned running this:

I learned not to escalate both judges when one was diverging. If Judge A was consistent across iterations and Judge B was all over the place, escalating both burned budget without isolating the signal. The ladder reflects this — it steps up Judge A (Anthropic) first, since in my experience Opus is more reliable than Sonnet on genuinely ambiguous items, while Judge B stays on gpt-4o through L3.

I capped at 4 iterations. After 4 loops without convergence, the item is genuinely ambiguous. The right response is a human decision, not a fifth model call.

See it running

uv run python3 validate.py \
  --content "Cache relies on TTL alone, no explicit eviction path." \
  --criteria "Flag any caching pattern without an eviction or fallback path."

Dual-LLM Validator
  Judge A : claude-sonnet-4-6
  Judge B : gpt-4o
  Rule    : avg >= 75  AND  spread <= 15  |  max 4 iters

Iteration 1 / 4
  Judge A  claude-sonnet-4-6   confidence  95  FLAGGED
  Judge B  gpt-4o              confidence  95  FLAGGED

  avg 95.0  |  spread 0  |  CONVERGED ✅

Result     : FLAGGED
Iterations : 1

Rationales:
  Judge A: TTL-only caching creates silent stale reads
           when upstream data changes without eviction.
  Judge B: No eviction path means stale entries persist
           until TTL expires, regardless of data changes.

When it doesn't converge on the first pass:

Iteration 1 / 4
  Judge A  claude-sonnet-4-6   confidence  85  FLAGGED
  Judge B  gpt-4o              confidence  48  UNCERTAIN

  avg 66.5  |  spread 37  |  NOT CONVERGED (verdicts disagree)

  >> Trust delta stalled — escalating to L2

Iteration 2 / 4  [L2]
  Judge A  claude-opus-4-7     confidence  80  FLAGGED
  Judge B  gpt-4o              confidence  76  FLAGGED

  avg 78.0  |  spread 4  |  CONVERGED ✅

L2 escalates Judge A to Opus (stronger on ambiguous items); Judge B stays on gpt-4o — the divergence was on the Anthropic side.

Try it yourself

git clone https://github.com/deepakgoyal-ai/dual-llm-validator
cd dual-llm-validator
uv sync
cp .env.example .env   # then open .env and fill in your keys

uv run python3 validate.py \
  --content "your content here" \
  --criteria "your review criteria here"

Custom models and thresholds:

uv run python3 validate.py \
  --content-file review.txt \
  --criteria-file criteria.txt \
  --model-a claude-opus-4-7 \
  --model-b gpt-4-turbo \
  --avg-floor 80 \
  --spread-max 10

Exit code 0 = converged. Exit code 2 = human review required. Pipe-friendly.

The operational discipline that matters most

Two things that aren't about the models at all:

I re-validate after every fix. First dispatch + fix applied ≠ done. A verdict is closed only when a re-validation pass converges on the corrected content. Skipping re-validation means I've applied a fix that passed one judge once — which is the original failure mode.

I managed two token budgets, not one. Running two providers concurrently means two rate limits refilling on different schedules. I alternated between providers across dispatch batches rather than parallelizing everything. Parallel looks faster until one provider hits a rate limit 40% through and stalls everything.

What I'm still unsure about

Do these thresholds transfer? I calibrated avg ≥ 75 and spread ≤ 15 against one corpus. There's no reason to expect them to be correct for a different review task. Calibrate against your own items before trusting the defaults.

Did I trade family bias for a different correlated bias? Cross-family reduces family bias — the 2025 papers document this. It doesn't eliminate correlated failure. Two models trained heavily on the same internet text could still share blind spots I haven't characterized.

When shouldn't you do this? If the cost of a wrong verdict is low, single-judge is cheaper and probably fine. If you have labelled data and bandwidth, a fine-tuned single judge likely outperforms two general-purpose ones. This pattern is the 0-to-1 case — no training data, real stakes, need something better than one model family's opinion.

The short version: two judges, different families, convergence = avg ≥ floor AND spread ≤ max, cap at 4 iterations, accept ambiguity rather than escalating forever.

The rest is calibration.

I'm a Technical Architect building AI-enabled systems. Writing about what I actually build — not what I think I should build.

If you're running a review pipeline and have calibrated your own thresholds, I'd like to know what you landed on.

Your LLM Reviewer Agrees With Itself. That's the Bug.

Why single-judge fails quietly

The convergence rule

The escalation ladder

See it running

Try it yourself

The operational discipline that matters most

What I'm still unsure about

Comments

More from this blog

Why I rebuilt my AI second brain — three problems that surfaced after five weeks

How I Built an AI-Powered Second Brain with Obsidian + Claude Code

Command Palette

Why single-judge fails quietly

The convergence rule

The escalation ladder

See it running

Try it yourself

The operational discipline that matters most

What I'm still unsure about

Comments

More from this blog