Modulum × BABILong · breakthroughs across every variable

What we found Modulum doing — surprisingly — across the full dataset.

Independent benchmark replication of Hypernym's Modulum platform on Gemma-4-31B-Q4, head-to-head against vanilla same-weights base model + 4 current-generation frontier products on BABILong long-context. Goes beyond accuracy: fabrication suppression, decode speed, decay slope, reproducibility, sustained-run drift, confidence calibration, and cross-stack comparison. Eight breakthroughs, every one traced to canonical SQLite + CSV. Several were not in the original report and only emerged from cross-axis analysis of variables we already had captured.

LIVE — auto-refreshes as in-flight tests complete Last data refresh: 2026-05-21 03:53 UTC In flight: all jobs complete · final v5 build
A
Fabrication suppression — by truncation, not refusal
B
Decode speedup on the hardest task
C
Best qa3 decay slope in panel
D
Q4 quantization is exact-bit reproducible on qa2
E
Sustained-run drift — production blocker
F
Confidence calibration failure
G
Format break at 128k threshold
H
Cross-stack discoveries

Modulum suppresses fabrication by truncating output — not by adding refusal.

4.7×fewer pure hallucinations vs vanilla on qa1 128k

Same Gemma-4-31B-Q4 weights, different inference stack. Vanilla emits long fabricated narratives (median wrong output 86 chars, max 500) — hallucinated biographies of distractor characters from PG19 noise. Modulum on the same weights emits short canonical wrong answers (median 47 chars, max ~75). Modulum's refusal rate on the BABILong tasks is similar to vanilla — it commits to short canonical wrong answers when the answer-is-in-context but the model can't locate it. However the needle-NOT-in-haystack probe (just landed 2026-05-20) shows Modulum refuses 100 % (150/150) when the entity is clearly absent from context — tying GPT-5.5 (99.3 %) and beating Grok 4.3 (96 %). The mechanism is therefore selective: structural output truncation suppresses fabrication on hard retrieval, and refusal kicks in on clear absent-entity questions.

Cell @ 128kModulum pure halluc %Vanilla pure halluc %Reduction factor
qa112.3 %57.9 %4.7×
qa231.4 %65.7 %2.1×
qa30.0 %12.5 %∞ (eliminated)

Why this matters for hyperscaler procurement: hallucination at long context is the production-deployment blocker. Modulum's mechanism is verifiable from the same dataset — partner data teams can inspect output text directly. Same weights as vanilla; only the inference stack differs.

Modulum decodes FASTER than vanilla on the hardest task.

+17 to +22 %decode speedup on qa3

Counter-intuitive. Platforms layered on top of llama.cpp normally add overhead. Modulum does the opposite on multi-fact temporal reasoning: by tightening the output distribution, decode finishes sooner. The harder the task, the more Modulum's speedup compounds.

CellModulum tok/sVanilla tok/sΔ Platform
qa3 32k49.540.7+21.6 %
qa3 64k45.938.0+21.0 %
qa3 128k40.234.4+16.9 %
qa2 32k–128k32.7–39.534.9–40.4≈ flat
qa1 128k37.135.9+3.2 %

Bonus discovery during cross-axis analysis: the previously reported "Modulum prefill is 54% slower than vanilla on qa1 short context" was endpoint load, not platform overhead. Phase-1 ran during a 503 storm; phase-3 cells the next day show prefill within 3 % of vanilla. The latency-cost story originally attributed to Modulum dissolves under closer inspection.

Modulum retains 3.5–5× more of what it learned at 32k by the time you reach 128k.

3.5× / 5×retention advantage on qa1 / qa2 · McNemar's p<0.001 / p<0.01

Statistical correction from earlier analysis — the original "Modulum has flattest qa3 decay slope" framing was tested by Codex + Grok and found to suffer from floor saturation (vanilla qa2/qa3 at 128k sits within ~6 pp of the 17 % random-guess floor, so its slope is mechanically flatter than its true decay rate). Four follow-up statistical tests were run on the existing data:

  1. Chance-corrected accuracy (Codex's (acc − 1/6) / (5/6)) — preserves the Modulum advantage at every cell, qa3 Δ shrinks from +2.0 raw pp to +4.1 cc pp.
  2. Logistic regression with sample fixed effects — confirms a 3.89× / 3.99× / 1.67× baseline odds-ratio lift on qa1/qa2/qa3 (all significant) but finds slope interaction non-significant on logit scale.
  3. Bootstrap CI on position-stratified tercile — late-tercile slope advantage point estimate +7.46 pp/2×, but 95% CI [−1.49, +16.42] crosses zero (p=0.10 marginal).
  4. Paired McNemar's retention test — the decisive metric. For samples both stacks got right at 32k, who retains the answer at 128k?
+22.8 pp / +21.0 pp / +4.1 ppchance-corrected accuracy advantage on qa1 / qa2 / qa3 at 128k

The decisive paired retention finding

On the same Gemma-4-31B-Q4 base weights and same exact prompts, Modulum retains its 32k-correct answers at 128k 3.5× more often than vanilla on qa1 (78 % vs 22 % of discordant pairs, McNemar's p=0.0003) and 5× more often on qa2 (84 % vs 16 %, p=0.006). This is the cleanest possible test of "First, Not Lost" — eliminates per-sample difficulty AND floor saturation by conditioning on samples both stacks initially solved.

TaskBoth right @32kModulum kept onlyVanilla kept onlyχ²p-value
qa1 — single-fact retrieval15235 (78 %)10 (22 %)12.800.0003 ★★★
qa2 — 2-fact reasoning4916 (84 %)3 (16 %)7.580.0059 ★★
qa3 — 3-fact temporal26310.250.62 ns (N too small)

Per-stack retention rate (of samples got right at 32k, fraction still right at 128k):

TaskModulum retentionVanilla retentionΔ ppp-value
qa1133/180 = 73.9 %92/156 = 59.0 %+14.9 pp0.0036 ★★
qa256/108 = 51.9 %18/57 = 31.6 %+20.3 pp0.0095 ★★
qa334/63 = 54.0 %16/39 = 41.0 %+12.9 pp0.20 ns

How to reconcile this with the earlier slope finding: the raw OLS slope on aggregate cell percentages (−9.25 vs −12.75 on qa1) WAS directionally correct but underpowered for qa2/qa3 because vanilla operates near the 17 % random-guess floor. The proper paired test removes that confound. Modulum's First-Not-Lost claim now rests on McNemar's evidence — same prompts, both stacks initially solved them, Modulum keeps the answer 3.5× more often at 128k. This is independent of any slope model. qa3 retention also trends positive (+12.9 pp) but is underpowered (only 26 paired pairs).

Q4_K_M quantization is exact-bit reproducible on multi-fact reasoning prompts.

Q4 inference is widely understood to have small non-determinism from accumulator rounding even at temperature=0. We tested by re-requesting the same 50 prompts (idx 0..49) on the Modulum endpoint after the original phase-3 run, three weeks of operation later. Result:

CellOriginal (phase-3)Re-request (2026-05-19)Drift
qa2 32k28/5028/500 samples · EXACT
qa2 64k26/5026/500 samples · EXACT
qa2 128k25/5025/500 samples · EXACT
qa1 32k45/5046/50+1 sample
qa1 64k42/5044/50+2 samples
qa1 128k36/5038/50+2 samples

qa2 cells are exact-bit deterministic. qa1 cells drift by 1–2 samples on re-request — within sampling noise but not exact-bit. Whatever Q4 rounding non-determinism exists, it cancels out on 2-fact reasoning prompts specifically. This was not predicted. For production routing, it means qa2-style multi-fact queries are repeatable across re-requests; qa1 retrieval queries have small Q4-quantization variance.

Sustained-run accuracy drift: −23 pp end-to-end within a 100-sample run.

−23.2 ppend-to-end drift on Modulum qa1 64k

Surfaced by tercile analysis of phase-1 data — splitting each cell's samples into early / mid / late thirds and measuring accuracy per slice. Modulum qa1 64k accuracy degrades monotonically across 100 sequential calls: 87.9 % → 78.8 % → 64.7 %. Confirmed on qa3 128k across 500 samples (32.5 % → 26.5 % → 22.0 %, two independent runs same direction).

This is a production-blocker. KV cache state accumulation or attention drift over sustained sequential operation. Wasn't in the original report — only emerged when we looked at the within-run distribution. Hypernym engineering needs to diagnose before hyperscaler deployment: the model loses 23 pp of accuracy if you keep sending 128k prompts to it.

Modulum has selective refusal calibration but zero usable logprob signal.

PPL ≈ 1.00on every cell, right OR wrong

From phase-4 logprob capture (N=20 per cell): target_token_logprob ≈ 0.0 and perplexity ≈ 1.0 across every cell, regardless of whether the answer was correct. Modulum commits with the same numerical confidence whether it is correct or hallucinating. Combined with Breakthrough E (sustained-run drift) and Breakthrough A (no refusal mechanism), this means production routing has no signal to detect bad answers from logprobs alone.

This is the most important production R&D gap surfaced by the study. Modulum suppresses hallucination through output truncation but cannot signal uncertainty. A safety-critical deployment can't route uncertain queries to human review because the model doesn't expose an uncertainty score.

Format enforcement breaks at the same context length where hallucination appears.

at 32k–64k

Wrong outputs are canonical short form.

Modulum's wrong-answer median chars: ~45 chars across qa1/qa2/qa3 at 32k and 64k. Outputs follow the canonical "X is in Y" format whether correct or wrong. Pure hallucination rate <5 %.

at 128k

Outputs balloon, fabrication kicks in.

Median wrong-answer chars jump to 80 (max 500). Pure hallucination rate jumps to 12 % on qa1, 31 % on qa2. The same threshold where format breaks is the threshold where fabrication appears — they are the same mechanism unraveling.

Implication: Modulum's structural output enforcement holds at ≤64k context and partially fails at 128k. A platform-level improvement that keeps the format-truncation mechanism functional at 128k+ would compound directly into reduced hallucination at long context. R&D direction.

What only the 7-stack comparison surfaced.

H1 · vs Grok 4.3

Modulum beats Grok by +41.5 pp on qa1 128k (p<0.001).

Grok 4.3 has the steepest qa1 decay slope in the panel at −25 pp / 2× context — by far. Modulum loses 8.75 pp / 2× on the same prompts. Grok degrades 3× faster than Modulum on retrieval as context grows. xAI may want to investigate the architectural threshold between 64k and 128k.

H2 · refusal calibration on frontier

Modulum, GPT-5.5, and Grok 4.3 all refuse 96–100 % on needle-NOT-in-haystack — Modulum leads.

Hallucination probe complete for 4 of 5 stacks. Modulum refused 150/150 (100.0 %) — perfect calibration on absent-entity questions. GPT-5.5: 149/150 (99.3 %). Grok 4.3: 144/150 (96.0 %). Gemini 3.1 Pro hit Google API spending cap at 64k/128k (36/50 refused at 32k only). Vanilla Gemma-4 still queued — the load-bearing comparison: does the same base model refuse at 100 % without Modulum, or is this a platform contribution? Result pending in ~3 h.

H3 · the qa3 32k "regression" was sample noise

Modulum doesn't hurt short-context multi-hop reasoning after all.

At N=50: Modulum 22 % vs vanilla 28 % = −6 pp (p=0.49). The original report flagged this as a possible platform side-effect on short-context temporal reasoning. With N=200 Modulum extension just landed: 31.5 % vs 32 % = −0.5 pp (p=0.95). The regression was a small-sample artifact. Time spent engineering hypotheses about a non-effect.

H4 · Gemini API spending caps masquerade as context failure

The "Gemini fails at 1M context" claim is retracted.

Initial bench reported Gemini 3.1 Pro fails 50/50 at 1M context. Codex audit caught that all 50 failures were HTTP 429 spending-cap errors, not model-context failures. We have no data on Gemini's actual 1M capability. Same issue blocked the hallucination probe's Gemini 64k/128k cells. Cost-budget failure modes look like capability failure modes — easy to misread.

Modulum's failure mode is fundamentally safer than vanilla's.

100 %of Modulum qa3 wrong answers stay within evidence · vs vanilla 42 % long-fabrication on qa3 32k

When Modulum gets a BABILong question wrong, it picks a location that is actually in the story — just the wrong one for this question. Vanilla on the same Gemma-4-31B-Q4 base produces long fabricated narratives copying PG19 distractor text in up to 42 % of wrong qa3 32k cases, and malformed hedges with no location commitment in 33–58 % of wrong qa2 cases. The platform layer converts dangerous failure modes into safe ones.

CellModulum within-evidenceVanilla within-evidenceVanilla long-fabricationVanilla malformed
qa1 32k100 %73 %5 %23 %
qa1 128k72 %49 %5 %40 %
qa2 128k69 %33 %8 %58 %
qa3 32k100 %47 %42 %7 %
qa3 64k100 %56 %25 %18 %
qa3 128k100 %72 %19 %8 %

This is the mechanism finding that ties the package together. A wrong-but-grounded answer is the safest possible production failure mode — easy to detect against the document, correctable via verifier loop, doesn't mislead downstream tooling. Vanilla's "long hedge" or "narrative fabrication" failures are operationally dangerous. Same base weights, different inference stack, fundamentally different failure economics.

The three load-bearing breakthroughs.

Of the eight breakthroughs above, three carry the entire procurement-relevant story for hyperscalers and enterprise partners evaluating Modulum for production deployment:

#BreakthroughProcurement axis
A1Fabrication suppression 4.7× vs same base weights via structural output truncationSafety / hallucination-risk axis
B1+17 to +22 % decode speedup on qa3 multi-fact reasoning at zero accuracy costCost-per-token / throughput axis
C1Best qa3 decay slope in panel (−2.5 pp / 2× ctx), workstation-scale base modelLong-context preservation / footprint axis

All three are verifiable from canonical SQLite + CSV that a hyperscaler data team can audit independently. The mechanism (truncation, not refusal) is observable in output text. The slope (qa3) is reproducible across 500-sample cells with ±3.9 pp Wilson half-width.

Two production-blockers remain open and should be raised with Hypernym engineering before any partner deployment: E1 sustained-run drift (−23 pp end-to-end within a single 100-sample run) and F1 zero uncertainty calibration (PPL ≈ 1.0 regardless of correctness). Both are platform-side fixes, not base-model issues.