Independent benchmark replication of Hypernym's Modulum platform on Gemma-4-31B-Q4, head-to-head against vanilla same-weights base model + 4 current-generation frontier products on BABILong long-context. Goes beyond accuracy: fabrication suppression, decode speed, decay slope, reproducibility, sustained-run drift, confidence calibration, and cross-stack comparison. Eight breakthroughs, every one traced to canonical SQLite + CSV. Several were not in the original report and only emerged from cross-axis analysis of variables we already had captured.
Same Gemma-4-31B-Q4 weights, different inference stack. Vanilla emits long fabricated narratives (median wrong output 86 chars, max 500) — hallucinated biographies of distractor characters from PG19 noise. Modulum on the same weights emits short canonical wrong answers (median 47 chars, max ~75). Modulum's refusal rate on the BABILong tasks is similar to vanilla — it commits to short canonical wrong answers when the answer-is-in-context but the model can't locate it. However the needle-NOT-in-haystack probe (just landed 2026-05-20) shows Modulum refuses 100 % (150/150) when the entity is clearly absent from context — tying GPT-5.5 (99.3 %) and beating Grok 4.3 (96 %). The mechanism is therefore selective: structural output truncation suppresses fabrication on hard retrieval, and refusal kicks in on clear absent-entity questions.
| Cell @ 128k | Modulum pure halluc % | Vanilla pure halluc % | Reduction factor |
|---|---|---|---|
| qa1 | 12.3 % | 57.9 % | 4.7× |
| qa2 | 31.4 % | 65.7 % | 2.1× |
| qa3 | 0.0 % | 12.5 % | ∞ (eliminated) |
Why this matters for hyperscaler procurement: hallucination at long context is the production-deployment blocker. Modulum's mechanism is verifiable from the same dataset — partner data teams can inspect output text directly. Same weights as vanilla; only the inference stack differs.
Counter-intuitive. Platforms layered on top of llama.cpp normally add overhead. Modulum does the opposite on multi-fact temporal reasoning: by tightening the output distribution, decode finishes sooner. The harder the task, the more Modulum's speedup compounds.
| Cell | Modulum tok/s | Vanilla tok/s | Δ Platform |
|---|---|---|---|
| qa3 32k | 49.5 | 40.7 | +21.6 % |
| qa3 64k | 45.9 | 38.0 | +21.0 % |
| qa3 128k | 40.2 | 34.4 | +16.9 % |
| qa2 32k–128k | 32.7–39.5 | 34.9–40.4 | ≈ flat |
| qa1 128k | 37.1 | 35.9 | +3.2 % |
Bonus discovery during cross-axis analysis: the previously reported "Modulum prefill is 54% slower than vanilla on qa1 short context" was endpoint load, not platform overhead. Phase-1 ran during a 503 storm; phase-3 cells the next day show prefill within 3 % of vanilla. The latency-cost story originally attributed to Modulum dissolves under closer inspection.
Statistical correction from earlier analysis — the original "Modulum has flattest qa3 decay slope" framing was tested by Codex + Grok and found to suffer from floor saturation (vanilla qa2/qa3 at 128k sits within ~6 pp of the 17 % random-guess floor, so its slope is mechanically flatter than its true decay rate). Four follow-up statistical tests were run on the existing data:
(acc − 1/6) / (5/6)) — preserves the Modulum advantage at every cell, qa3 Δ shrinks from +2.0 raw pp to +4.1 cc pp.On the same Gemma-4-31B-Q4 base weights and same exact prompts, Modulum retains its 32k-correct answers at 128k 3.5× more often than vanilla on qa1 (78 % vs 22 % of discordant pairs, McNemar's p=0.0003) and 5× more often on qa2 (84 % vs 16 %, p=0.006). This is the cleanest possible test of "First, Not Lost" — eliminates per-sample difficulty AND floor saturation by conditioning on samples both stacks initially solved.
| Task | Both right @32k | Modulum kept only | Vanilla kept only | χ² | p-value |
|---|---|---|---|---|---|
| qa1 — single-fact retrieval | 152 | 35 (78 %) | 10 (22 %) | 12.80 | 0.0003 ★★★ |
| qa2 — 2-fact reasoning | 49 | 16 (84 %) | 3 (16 %) | 7.58 | 0.0059 ★★ |
| qa3 — 3-fact temporal | 26 | 3 | 1 | 0.25 | 0.62 ns (N too small) |
Per-stack retention rate (of samples got right at 32k, fraction still right at 128k):
| Task | Modulum retention | Vanilla retention | Δ pp | p-value |
|---|---|---|---|---|
| qa1 | 133/180 = 73.9 % | 92/156 = 59.0 % | +14.9 pp | 0.0036 ★★ |
| qa2 | 56/108 = 51.9 % | 18/57 = 31.6 % | +20.3 pp | 0.0095 ★★ |
| qa3 | 34/63 = 54.0 % | 16/39 = 41.0 % | +12.9 pp | 0.20 ns |
How to reconcile this with the earlier slope finding: the raw OLS slope on aggregate cell percentages (−9.25 vs −12.75 on qa1) WAS directionally correct but underpowered for qa2/qa3 because vanilla operates near the 17 % random-guess floor. The proper paired test removes that confound. Modulum's First-Not-Lost claim now rests on McNemar's evidence — same prompts, both stacks initially solved them, Modulum keeps the answer 3.5× more often at 128k. This is independent of any slope model. qa3 retention also trends positive (+12.9 pp) but is underpowered (only 26 paired pairs).
Q4 inference is widely understood to have small non-determinism from accumulator rounding even at temperature=0. We tested by re-requesting the same 50 prompts (idx 0..49) on the Modulum endpoint after the original phase-3 run, three weeks of operation later. Result:
| Cell | Original (phase-3) | Re-request (2026-05-19) | Drift |
|---|---|---|---|
| qa2 32k | 28/50 | 28/50 | 0 samples · EXACT |
| qa2 64k | 26/50 | 26/50 | 0 samples · EXACT |
| qa2 128k | 25/50 | 25/50 | 0 samples · EXACT |
| qa1 32k | 45/50 | 46/50 | +1 sample |
| qa1 64k | 42/50 | 44/50 | +2 samples |
| qa1 128k | 36/50 | 38/50 | +2 samples |
qa2 cells are exact-bit deterministic. qa1 cells drift by 1–2 samples on re-request — within sampling noise but not exact-bit. Whatever Q4 rounding non-determinism exists, it cancels out on 2-fact reasoning prompts specifically. This was not predicted. For production routing, it means qa2-style multi-fact queries are repeatable across re-requests; qa1 retrieval queries have small Q4-quantization variance.
Surfaced by tercile analysis of phase-1 data — splitting each cell's samples into early / mid / late thirds and measuring accuracy per slice. Modulum qa1 64k accuracy degrades monotonically across 100 sequential calls: 87.9 % → 78.8 % → 64.7 %. Confirmed on qa3 128k across 500 samples (32.5 % → 26.5 % → 22.0 %, two independent runs same direction).
This is a production-blocker. KV cache state accumulation or attention drift over sustained sequential operation. Wasn't in the original report — only emerged when we looked at the within-run distribution. Hypernym engineering needs to diagnose before hyperscaler deployment: the model loses 23 pp of accuracy if you keep sending 128k prompts to it.
From phase-4 logprob capture (N=20 per cell): target_token_logprob ≈ 0.0 and perplexity ≈ 1.0 across every cell, regardless of whether the answer was correct. Modulum commits with the same numerical confidence whether it is correct or hallucinating. Combined with Breakthrough E (sustained-run drift) and Breakthrough A (no refusal mechanism), this means production routing has no signal to detect bad answers from logprobs alone.
This is the most important production R&D gap surfaced by the study. Modulum suppresses hallucination through output truncation but cannot signal uncertainty. A safety-critical deployment can't route uncertain queries to human review because the model doesn't expose an uncertainty score.
Modulum's wrong-answer median chars: ~45 chars across qa1/qa2/qa3 at 32k and 64k. Outputs follow the canonical "X is in Y" format whether correct or wrong. Pure hallucination rate <5 %.
Median wrong-answer chars jump to 80 (max 500). Pure hallucination rate jumps to 12 % on qa1, 31 % on qa2. The same threshold where format breaks is the threshold where fabrication appears — they are the same mechanism unraveling.
Implication: Modulum's structural output enforcement holds at ≤64k context and partially fails at 128k. A platform-level improvement that keeps the format-truncation mechanism functional at 128k+ would compound directly into reduced hallucination at long context. R&D direction.
Grok 4.3 has the steepest qa1 decay slope in the panel at −25 pp / 2× context — by far. Modulum loses 8.75 pp / 2× on the same prompts. Grok degrades 3× faster than Modulum on retrieval as context grows. xAI may want to investigate the architectural threshold between 64k and 128k.
Hallucination probe complete for 4 of 5 stacks. Modulum refused 150/150 (100.0 %) — perfect calibration on absent-entity questions. GPT-5.5: 149/150 (99.3 %). Grok 4.3: 144/150 (96.0 %). Gemini 3.1 Pro hit Google API spending cap at 64k/128k (36/50 refused at 32k only). Vanilla Gemma-4 still queued — the load-bearing comparison: does the same base model refuse at 100 % without Modulum, or is this a platform contribution? Result pending in ~3 h.
At N=50: Modulum 22 % vs vanilla 28 % = −6 pp (p=0.49). The original report flagged this as a possible platform side-effect on short-context temporal reasoning. With N=200 Modulum extension just landed: 31.5 % vs 32 % = −0.5 pp (p=0.95). The regression was a small-sample artifact. Time spent engineering hypotheses about a non-effect.
Initial bench reported Gemini 3.1 Pro fails 50/50 at 1M context. Codex audit caught that all 50 failures were HTTP 429 spending-cap errors, not model-context failures. We have no data on Gemini's actual 1M capability. Same issue blocked the hallucination probe's Gemini 64k/128k cells. Cost-budget failure modes look like capability failure modes — easy to misread.
When Modulum gets a BABILong question wrong, it picks a location that is actually in the story — just the wrong one for this question. Vanilla on the same Gemma-4-31B-Q4 base produces long fabricated narratives copying PG19 distractor text in up to 42 % of wrong qa3 32k cases, and malformed hedges with no location commitment in 33–58 % of wrong qa2 cases. The platform layer converts dangerous failure modes into safe ones.
| Cell | Modulum within-evidence | Vanilla within-evidence | Vanilla long-fabrication | Vanilla malformed |
|---|---|---|---|---|
| qa1 32k | 100 % | 73 % | 5 % | 23 % |
| qa1 128k | 72 % | 49 % | 5 % | 40 % |
| qa2 128k | 69 % | 33 % | 8 % | 58 % |
| qa3 32k | 100 % | 47 % | 42 % | 7 % |
| qa3 64k | 100 % | 56 % | 25 % | 18 % |
| qa3 128k | 100 % | 72 % | 19 % | 8 % |
This is the mechanism finding that ties the package together. A wrong-but-grounded answer is the safest possible production failure mode — easy to detect against the document, correctable via verifier loop, doesn't mislead downstream tooling. Vanilla's "long hedge" or "narrative fabrication" failures are operationally dangerous. Same base weights, different inference stack, fundamentally different failure economics.
Of the eight breakthroughs above, three carry the entire procurement-relevant story for hyperscalers and enterprise partners evaluating Modulum for production deployment:
| # | Breakthrough | Procurement axis |
|---|---|---|
| A1 | Fabrication suppression 4.7× vs same base weights via structural output truncation | Safety / hallucination-risk axis |
| B1 | +17 to +22 % decode speedup on qa3 multi-fact reasoning at zero accuracy cost | Cost-per-token / throughput axis |
| C1 | Best qa3 decay slope in panel (−2.5 pp / 2× ctx), workstation-scale base model | Long-context preservation / footprint axis |
All three are verifiable from canonical SQLite + CSV that a hyperscaler data team can audit independently. The mechanism (truncation, not refusal) is observable in output text. The slope (qa3) is reproducible across 500-sample cells with ±3.9 pp Wilson half-width.
Two production-blockers remain open and should be raised with Hypernym engineering before any partner deployment: E1 sustained-run drift (−23 pp end-to-end within a single 100-sample run) and F1 zero uncertainty calibration (PPL ≈ 1.0 regardless of correctness). Both are platform-side fixes, not base-model issues.