Echo: results so far

By Nick Meinhold, Robin Langer, Meghana Ganapa & Adarsha Aryal · 10 June 2026

TL;DR: Instead of training a classifier to route easy tasks to a cheap model and hard ones to an expensive one, call the cheap model twice with two different personas. If the answers agree, keep the cheap one; if they disagree, escalate. No classifier, no labels. On HumanEval's hard slice, a cross-family local judge (Qwen 2.5 7B) reaches 94% of the oracle's routing quality, is ~29% cheaper than always using Sonnet, and matches its pass rate. The honest part: our first reasoning-benchmark numbers were a harness bug, not a result — finding and fixing it is half this update.

Most LLM apps overpay

A trivial "reverse this string" and a gnarly multi-step refactor usually go to the same expensive endpoint. The standard fix is a router: a learned classifier that decides "easy → cheap model, hard → expensive." RouteLLM, FrugalGPT, Hybrid LLM and AutoMix all do versions of this, and they work — but every one needs labelled training data for your task domain. That label-collection step is the adoption bottleneck. You can't drop a trained router into a new product on day one.

We wanted to know how much of the benefit you can get with none of the training.

The idea: let the cheap model check itself

Call the cheap model twice on the same task, with two different persona prompts — one a "careful, methodical programmer," the other a "pragmatic senior engineer who writes the simplest thing that works." Then:

If the two answers agree, framing didn't matter — the task was easy. Keep the cheap answer.
If they disagree, the task is sitting on a decision boundary where small perturbations change the output. That's your difficulty signal. Escalate to the expensive model.

The difficulty signal is manufactured at inference time, for free, out of the model's own (in)consistency. It's a reframe of self-consistency (Wang et al., 2022) — but used as a cost signal instead of an accuracy one. The arithmetic clears one bar: two cheap calls must cost less than one expensive call. At current Claude pricing, Haiku-twice beats Sonnet-once while the tier gap stays ~3×. It does.

The catch nobody warns you about: what does "agree" mean?

"If the two answers agree" sounds simple until you implement agree(a, b) for code. Two programs can be character-identical, or solve the same problem with a loop vs a comprehension, a dict vs a class, different names, different decomposition — all "agreement" in the sense that matters (same behaviour) and "disagreement" in the sense that's easy to measure (different text).

The ladder of agreement signals we tested

Signal	What it checks	Extra cost
lexical	normalized text match	free
AST	Python syntax-tree structure match	free
judge	a third Haiku call: "are these equivalent?"	+1 cheap call
small-judge	same question, asked to a local Qwen 2.5 7B	~free (local compute)
oracle	ground truth: do the answers pass the hidden tests?	not deployable

The oracle isn't a real strategy — it cheats by looking at test results you'd never have in production. We include it to mark the ceiling. The research question is how close a deployable signal gets to it.

Results: HumanEval hard slice

All arms over HumanEval 100–163 (the first hundred tasks are too easy to separate arms). Cost in units where Haiku = 1, Sonnet = 3 per call. "Oracle alignment" = how often the signal escalates on exactly the tasks the oracle would.

Arm	Pass	Escalations	Oracle align	Cost
haiku-only	63/64	—	—	64
sonnet-only	63/64	—	—	192
echo-lexical	64/64	55/64	16%	293
echo-ast	62/64	54/64	17%	290
echo-judge (Haiku)	61/64	11/64	81%	225
echo-small-judge (Qwen 7B)	62/64	3/64	94%	137
echo-oracle (ceiling)	64/64	1/64	—	131

Three things fall out:

The cost thesis holds with a deployable signal. echo-small-judge lands within ~5% of the oracle's cost floor (137 vs 131), ~29% cheaper than always-Sonnet, with a pass rate statistically equal to Sonnet on this slice.
Free signals are noise here. Lexical and AST escalate ~85% of the time — they cost more than just using Sonnet.
The surprise: a cross-family local judge beats the same-family one. Qwen 7B (a different model family, running locally) tracks the oracle better than a Haiku judge (94% vs 81%) with a third the escalations. Independence beats capability for this job.

Why would a smaller, cheaper, local model judge agreement better than Haiku? Our read: a same-family judge shares Haiku's blind spots — it agrees that two Haiku answers match precisely when Haiku is consistently wrong. A different family disagrees out of genuine independence. That's the whole thesis in miniature.

Methodology & caveats

HumanEval only; n = 64; single hard slice. The local-judge mean wall time is ~86s/task on a CPU-only ARM box — that's infrastructure latency, not API cost, and would drop sharply on a GPU. Earlier sweeps surfaced (and fixed) two output-parser bugs in the code harness before these numbers stabilised — a recurring theme (see below). Full per-task JSONL logs and sweep history live in the repo's experiment/results/.

The plot twist: our first reasoning numbers were a lie

Updated 2026-07-03: the BBH re-run is in. It did not go the way this section's June draft expected — the honest version is below.

HumanEval is code. To claim Echo generalises, we need reasoning benchmarks — so we ported the harness to BBH (Big-Bench Hard). The n=10 pilot came back looking like this:

Arm	Pass rate
haiku-only	0.14
sonnet-only	0.14
echo-judge	0.12

Low, and suspiciously flat. The red flag: on reasoning tasks, Sonnet should clearly beat Haiku. Them tying at 0.14 doesn't say "Echo doesn't work" — it says the measuring instrument is broken.

So we put the BBH scoring code through an adversarial review — three AI reviewers from different model families, each trying to break it. They found the answer parser was silently corrupting results in both directions.

The three bugs (and why a silent one is the worst kind)

Tail truncation. The parser only looked at the last 5 lines of output before searching for the answer. A model that states "The answer is C" early and then keeps explaining had its answer fall outside the window — scored as unparseable, counted as a failure.
Case-folding over-match. A case-insensitive letter pattern matched the first letter of the next word: "the answer is straightforward" was parsed as answer "S". This one is bidirectional — it manufactures both false failures (wrong letter) and false passes (lucky letter), silently, because a bogus-but-valid letter is accepted without complaint.
Cross-family recency. "Answer: A … therefore the answer is C" returned A — an early scratch line beat the final answer because the two were caught by different patterns.

All three are fixed, each with a regression test; the scoring suite is green. The lesson: a measurement apparatus with a silent, bidirectional bias is worse than a noisy one. The flat 0.14 wasn't just low — it was untrustworthy in an unknown direction.

But the numbers stayed flat. Parsers fixed, the pilot still had Sonnet and Haiku tied. So the instrument was lying in a second, deeper way — and this one had nothing to do with parsing. We dumped the raw model output and it didn't begin with an answer. It began with:

All 8 tasks restored. Now to your question:

...followed by an ★ Insight block. We weren't measuring a model answering BBH. We were measuring our own agent answering BBH.

The harness shelled out to claude --print --model X from inside the project directory. There, that is not a clean model call: it loads the project's CLAUDE.md, runs the SessionStart hooks, and ingests a task-restore injection. The model dutifully answered in-character as the Echo project agent — a huge preamble, and sometimes no clean answer line at all. One confound explained every symptom at once: the ~100-second wall times (loading and running all that context on every call), the "unparseable" outputs (agent preamble crowding out the answer), and the depressed, suspiciously flat accuracy.

The fix is a single flag. --setting-sources "" strips every settings layer — hooks, CLAUDE.md, task-restore — while keeping the Max-plan auth. Same directory, before and after:

claude -p "what is 2+2" → (a task-restoration report) claude -p --setting-sources "" "what is 2+2" → Answer: 4

The re-run went nothing like the June draft predicted

We expected clean numbers to separate — Sonnet pulling ahead of Haiku on reasoning. They didn't. At n=99 the wall times dropped ~5×, every arm parsed cleanly, and Haiku and Sonnet came back identical: 0.848 each, matching subtask-for-subtask. We pushed to the hardest subtasks BBH has — seven-object logical deduction, seven-object shuffled-object tracking, temporal sequences — to force a gap. Both models scored a perfect 33/33 on all three.

MCQ BBH is saturated for 2026-era Claude. Even Haiku aces the seven-object state-tracking task that used to break models. There is no difficulty gap left to route on — which, taken at face value, sinks the original "escalate to the smarter model" framing on this benchmark.

Except the oracle didn't tie. echo-oracle scored 0.899 against the baselines' 0.848, and the entire margin came from one subtask — causal judgement, 27/33 vs 22/33. That is only possible if Haiku and Sonnet fail on different items. The capability ladder collapsed, but the errors are decorrelated: two diverse cheap attempts cover disjoint failures.

That reframes the thesis, and arguably strengthens it. Echo never needed the expensive model to be smarter — it needs the two attempts to be different. Decorrelation, not a difficulty ladder, is the mechanism. And decorrelation survives anywhere there's a real failure rate, saturated benchmark or not.

What's next

The saturation result quietly redraws the roadmap. Chasing a Sonnet-beats-Haiku gap on BBH is dead — there isn't one. But the decorrelated-errors finding points somewhere more interesting:

A non-saturated benchmark, next: MMLU-Pro. BBH's MCQ slice no longer has a real Haiku failure rate to rescue. MMLU-Pro (14 domains, 10-choice, genuinely hard) does — it's the honest place to test whether the disagreement signal captures the oracle's headroom when there's actually headroom to capture.
Route by task type, not size. If the difficulty ladder has collapsed, stop routing on it. Route incoming tasks to a hyperspecialist — a small model tuned for one domain (code, math, law) that can beat a larger generalist on its home turf. This needs the specialist to be better on type-X, not smarter overall, and it composes with decorrelation: a specialist that fails on different items than both baselines widens the oracle ceiling further. (Credit to Connor Hood for the reframe.)
The real test still stands — a heterogeneous real-world task (PR review with merge decisions), where difficulty genuinely varies and a saturated multiple-choice ceiling can't hide it.

The through-line: the value was never "the big model is smarter." It's "two independent looks disagree in useful places." Every next step is a hunt for the settings where that independence still pays.

If Echo lands on or above a trained router's cost/accuracy frontier with zero training data, that's the result worth publishing. If it collapses to "Haiku with extra steps," that's a clean negative — also worth publishing.

The team

Nick Meinhold · Director & Tech Lead — originated the self-consistency-as-cost-signal idea and the experiment design.
Robin Langer · Agentic Engineer — agentic engineering and research; co-founder of Sawasdee Cellars.
Meghana Ganapa · Agentic AI Engineer — ML/NLP across healthcare and legal domains. On Echo: cross-family judge arms and BBH scoring.
Adarsha Aryal · Agentic Engineer — Master of Data Science, Monash. On Echo: judge-branch integration, the BBH sweep harness, and run tooling.

Echo is open research at github.com/enspyrco/echo. Background: Wang et al. 2022 (Self-Consistency), Ong et al. 2024 (RouteLLM), Chen et al. 2023 (FrugalGPT), Ding et al. 2024 (Hybrid LLM). Earlier post: Echo: routing LLM requests cheaply without training a router.