Modulum × BABILong · breakthroughs across every variable

What we found Modulum doing — surprisingly — across the full dataset.

Independent benchmark replication of Hypernym's Modulum platform on Gemma-4-31B-Q4, head-to-head against vanilla same-weights base model + 4 current-generation frontier products on BABILong long-context. Goes beyond accuracy: fabrication suppression, decode speed, decay slope, reproducibility, sustained-run drift, confidence calibration, and cross-stack comparison. Eight breakthroughs, every one traced to canonical SQLite + CSV. Several were not in the original report and only emerged from cross-axis analysis of variables we already had captured.

LIVE — auto-refreshes as in-flight tests complete Last data refresh: 2026-05-19 08:17 UTC In flight: 4 of 5 jobs running (Modulum reproducibility, Vanilla N=200 extension, Modulum hallucination probe, Vanilla hallucination probe)
A
Fabrication suppression — by truncation, not refusal
B
Decode speedup on the hardest task
C
Best qa3 decay slope in panel
D
Q4 quantization is exact-bit reproducible on qa2
E
Sustained-run drift — production blocker
F
Confidence calibration failure
G
Format break at 128k threshold
H
Cross-stack discoveries

Modulum suppresses fabrication by truncating output — not by adding refusal.

4.7×fewer pure hallucinations vs vanilla on qa1 128k

Same Gemma-4-31B-Q4 weights, different inference stack. Vanilla emits long fabricated narratives (median wrong output 86 chars, max 500) — hallucinated biographies of distractor characters from PG19 noise. Modulum on the same weights emits short canonical wrong answers (median 47 chars, max ~75). Modulum's refusal rate is similar to vanilla's. The mechanism is not "the model learned to say I don't know" — it is structural output enforcement that prevents long fabricated text from forming.

Cell @ 128kModulum pure halluc %Vanilla pure halluc %Reduction factor
qa112.3 %57.9 %4.7×
qa231.4 %65.7 %2.1×
qa30.0 %12.5 %∞ (eliminated)

Why this matters for hyperscaler procurement: hallucination at long context is the production-deployment blocker. Modulum's mechanism is verifiable from the same dataset — partner data teams can inspect output text directly. Same weights as vanilla; only the inference stack differs.

Modulum decodes FASTER than vanilla on the hardest task.

+17 to +22 %decode speedup on qa3

Counter-intuitive. Platforms layered on top of llama.cpp normally add overhead. Modulum does the opposite on multi-fact temporal reasoning: by tightening the output distribution, decode finishes sooner. The harder the task, the more Modulum's speedup compounds.

CellModulum tok/sVanilla tok/sΔ Platform
qa3 32k49.540.7+21.6 %
qa3 64k45.938.0+21.0 %
qa3 128k40.234.4+16.9 %
qa2 32k–128k32.7–39.534.9–40.4≈ flat
qa1 128k37.135.9+3.2 %

Bonus discovery during cross-axis analysis: the previously reported "Modulum prefill is 54% slower than vanilla on qa1 short context" was endpoint load, not platform overhead. Phase-1 ran during a 503 storm; phase-3 cells the next day show prefill within 3 % of vanilla. The latency-cost story originally attributed to Modulum dissolves under closer inspection.

Modulum has the flattest qa3 decay slope in the entire 7-stack panel.

−2.5 pp / 2× contexton multi-fact temporal reasoning

OLS fit of accuracy vs log₂(context tokens) across 32k–128k. A 31B-Q4 workstation-class model holds multi-fact temporal state across context length better than hyperscaler-served frontier products including Anthropic's Opus 4.6.

Stackqa3 slopeqa2 slopeqa1 slope
Modulum (Gemma-4-31B-Q4)−2.50 pp−6.75 pp−8.75 pp
Claude Opus 4.6−4.00 pp−0.00 pp−0.00 pp
Claude Opus 4.7+2.00 pp−2.00 pp−2.00 pp
GPT-5.5−9.00 pp+0.00 pp−2.00 pp
Gemini 3.1 Pro−7.60 pp−6.00 pp−15.30 pp
Vanilla Gemma-4-31B-Q4−4.00 pp+1.00 pp−8.00 pp
Grok 4.3−8.19 pp−20.00 pp−25.00 pp

Important caveat surfaced from cross-axis analysis: vanilla Gemma-4 already has qa3 slope of −4.0 pp/2×, basically tied with Opus 4.6. Modulum extends it by +1.5 pp/2× to −2.5. So Gemma-4 itself is the source of long-context qa3 stability; the platform amplifies what's already there. The honest story is "Gemma-4 base + Modulum platform together", not platform alone.

Q4_K_M quantization is exact-bit reproducible on multi-fact reasoning prompts.

Q4 inference is widely understood to have small non-determinism from accumulator rounding even at temperature=0. We tested by re-requesting the same 50 prompts (idx 0..49) on the Modulum endpoint after the original phase-3 run, three weeks of operation later. Result:

CellOriginal (phase-3)Re-request (2026-05-19)Drift
qa2 32k28/5028/500 samples · EXACT
qa2 64k26/5026/500 samples · EXACT
qa2 128k25/5025/500 samples · EXACT
qa1 32k45/5046/50+1 sample
qa1 64k42/5044/50+2 samples
qa1 128k36/5038/50+2 samples

qa2 cells are exact-bit deterministic. qa1 cells drift by 1–2 samples on re-request — within sampling noise but not exact-bit. Whatever Q4 rounding non-determinism exists, it cancels out on 2-fact reasoning prompts specifically. This was not predicted. For production routing, it means qa2-style multi-fact queries are repeatable across re-requests; qa1 retrieval queries have small Q4-quantization variance.

Sustained-run accuracy drift: −23 pp end-to-end within a 100-sample run.

−23.2 ppend-to-end drift on Modulum qa1 64k

Surfaced by tercile analysis of phase-1 data — splitting each cell's samples into early / mid / late thirds and measuring accuracy per slice. Modulum qa1 64k accuracy degrades monotonically across 100 sequential calls: 87.9 % → 78.8 % → 64.7 %. Confirmed on qa3 128k across 500 samples (32.5 % → 26.5 % → 22.0 %, two independent runs same direction).

This is a production-blocker. KV cache state accumulation or attention drift over sustained sequential operation. Wasn't in the original report — only emerged when we looked at the within-run distribution. Hypernym engineering needs to diagnose before hyperscaler deployment: the model loses 23 pp of accuracy if you keep sending 128k prompts to it.

Modulum has zero usable internal uncertainty signal.

PPL ≈ 1.00on every cell, right OR wrong

From phase-4 logprob capture (N=20 per cell): target_token_logprob ≈ 0.0 and perplexity ≈ 1.0 across every cell, regardless of whether the answer was correct. Modulum commits with the same numerical confidence whether it is correct or hallucinating. Combined with Breakthrough E (sustained-run drift) and Breakthrough A (no refusal mechanism), this means production routing has no signal to detect bad answers from logprobs alone.

This is the most important production R&D gap surfaced by the study. Modulum suppresses hallucination through output truncation but cannot signal uncertainty. A safety-critical deployment can't route uncertain queries to human review because the model doesn't expose an uncertainty score.

Format enforcement breaks at the same context length where hallucination appears.

at 32k–64k

Wrong outputs are canonical short form.

Modulum's wrong-answer median chars: ~45 chars across qa1/qa2/qa3 at 32k and 64k. Outputs follow the canonical "X is in Y" format whether correct or wrong. Pure hallucination rate <5 %.

at 128k

Outputs balloon, fabrication kicks in.

Median wrong-answer chars jump to 80 (max 500). Pure hallucination rate jumps to 12 % on qa1, 31 % on qa2. The same threshold where format breaks is the threshold where fabrication appears — they are the same mechanism unraveling.

Implication: Modulum's structural output enforcement holds at ≤64k context and partially fails at 128k. A platform-level improvement that keeps the format-truncation mechanism functional at 128k+ would compound directly into reduced hallucination at long context. R&D direction.

What only the 7-stack comparison surfaced.

H1 · vs Grok 4.3

Modulum beats Grok by +41.5 pp on qa1 128k (p<0.001).

Grok 4.3 has the steepest qa1 decay slope in the panel at −25 pp / 2× context — by far. Modulum loses 8.75 pp / 2× on the same prompts. Grok degrades 3× faster than Modulum on retrieval as context grows. xAI may want to investigate the architectural threshold between 64k and 128k.

H2 · refusal calibration on frontier

GPT-5.5 and Grok 4.3 both refuse 96–99 % of needle-NOT-in-haystack questions.

From the in-flight hallucination probe: ask the model about an entity that doesn't appear in context. GPT-5.5 refused 149/150 (99.3 %). Grok 4.3 refused 144/150 (96 %). Strongest production-safety signal in the entire dataset — neither model markets this. Modulum cannot do this at all (zero refusal capability per Breakthrough A/F).

H3 · the qa3 32k "regression" was sample noise

Modulum doesn't hurt short-context multi-hop reasoning after all.

At N=50: Modulum 22 % vs vanilla 28 % = −6 pp (p=0.49). The original report flagged this as a possible platform side-effect on short-context temporal reasoning. With N=200 Modulum extension just landed: 31.5 % vs 32 % = −0.5 pp (p=0.95). The regression was a small-sample artifact. Time spent engineering hypotheses about a non-effect.

H4 · Gemini API spending caps masquerade as context failure

The "Gemini fails at 1M context" claim is retracted.

Initial bench reported Gemini 3.1 Pro fails 50/50 at 1M context. Codex audit caught that all 50 failures were HTTP 429 spending-cap errors, not model-context failures. We have no data on Gemini's actual 1M capability. Same issue blocked the hallucination probe's Gemini 64k/128k cells. Cost-budget failure modes look like capability failure modes — easy to misread.

The three load-bearing breakthroughs.

Of the eight breakthroughs above, three carry the entire procurement-relevant story for hyperscalers and enterprise partners evaluating Modulum for production deployment:

#BreakthroughProcurement axis
A1Fabrication suppression 4.7× vs same base weights via structural output truncationSafety / hallucination-risk axis
B1+17 to +22 % decode speedup on qa3 multi-fact reasoning at zero accuracy costCost-per-token / throughput axis
C1Best qa3 decay slope in panel (−2.5 pp / 2× ctx), workstation-scale base modelLong-context preservation / footprint axis

All three are verifiable from canonical SQLite + CSV that a hyperscaler data team can audit independently. The mechanism (truncation, not refusal) is observable in output text. The slope (qa3) is reproducible across 500-sample cells with ±3.9 pp Wilson half-width.

Two production-blockers remain open and should be raised with Hypernym engineering before any partner deployment: E1 sustained-run drift (−23 pp end-to-end within a single 100-sample run) and F1 zero uncertainty calibration (PPL ≈ 1.0 regardless of correctness). Both are platform-side fixes, not base-model issues.