Independent benchmark replication of Hypernym's Modulum platform on Gemma-4-31B-Q4, head-to-head against vanilla same-weights base model + 4 current-generation frontier products on BABILong long-context. Goes beyond accuracy: fabrication suppression, decode speed, decay slope, reproducibility, sustained-run drift, confidence calibration, and cross-stack comparison. Eight breakthroughs, every one traced to canonical SQLite + CSV. Several were not in the original report and only emerged from cross-axis analysis of variables we already had captured.
Same Gemma-4-31B-Q4 weights, different inference stack. Vanilla emits long fabricated narratives (median wrong output 86 chars, max 500) — hallucinated biographies of distractor characters from PG19 noise. Modulum on the same weights emits short canonical wrong answers (median 47 chars, max ~75). Modulum's refusal rate is similar to vanilla's. The mechanism is not "the model learned to say I don't know" — it is structural output enforcement that prevents long fabricated text from forming.
| Cell @ 128k | Modulum pure halluc % | Vanilla pure halluc % | Reduction factor |
|---|---|---|---|
| qa1 | 12.3 % | 57.9 % | 4.7× |
| qa2 | 31.4 % | 65.7 % | 2.1× |
| qa3 | 0.0 % | 12.5 % | ∞ (eliminated) |
Why this matters for hyperscaler procurement: hallucination at long context is the production-deployment blocker. Modulum's mechanism is verifiable from the same dataset — partner data teams can inspect output text directly. Same weights as vanilla; only the inference stack differs.
Counter-intuitive. Platforms layered on top of llama.cpp normally add overhead. Modulum does the opposite on multi-fact temporal reasoning: by tightening the output distribution, decode finishes sooner. The harder the task, the more Modulum's speedup compounds.
| Cell | Modulum tok/s | Vanilla tok/s | Δ Platform |
|---|---|---|---|
| qa3 32k | 49.5 | 40.7 | +21.6 % |
| qa3 64k | 45.9 | 38.0 | +21.0 % |
| qa3 128k | 40.2 | 34.4 | +16.9 % |
| qa2 32k–128k | 32.7–39.5 | 34.9–40.4 | ≈ flat |
| qa1 128k | 37.1 | 35.9 | +3.2 % |
Bonus discovery during cross-axis analysis: the previously reported "Modulum prefill is 54% slower than vanilla on qa1 short context" was endpoint load, not platform overhead. Phase-1 ran during a 503 storm; phase-3 cells the next day show prefill within 3 % of vanilla. The latency-cost story originally attributed to Modulum dissolves under closer inspection.
OLS fit of accuracy vs log₂(context tokens) across 32k–128k. A 31B-Q4 workstation-class model holds multi-fact temporal state across context length better than hyperscaler-served frontier products including Anthropic's Opus 4.6.
| Stack | qa3 slope | qa2 slope | qa1 slope |
|---|---|---|---|
| Modulum (Gemma-4-31B-Q4) | −2.50 pp | −6.75 pp | −8.75 pp |
| Claude Opus 4.6 | −4.00 pp | −0.00 pp | −0.00 pp |
| Claude Opus 4.7 | +2.00 pp | −2.00 pp | −2.00 pp |
| GPT-5.5 | −9.00 pp | +0.00 pp | −2.00 pp |
| Gemini 3.1 Pro | −7.60 pp | −6.00 pp | −15.30 pp |
| Vanilla Gemma-4-31B-Q4 | −4.00 pp | +1.00 pp | −8.00 pp |
| Grok 4.3 | −8.19 pp | −20.00 pp | −25.00 pp |
Important caveat surfaced from cross-axis analysis: vanilla Gemma-4 already has qa3 slope of −4.0 pp/2×, basically tied with Opus 4.6. Modulum extends it by +1.5 pp/2× to −2.5. So Gemma-4 itself is the source of long-context qa3 stability; the platform amplifies what's already there. The honest story is "Gemma-4 base + Modulum platform together", not platform alone.
Q4 inference is widely understood to have small non-determinism from accumulator rounding even at temperature=0. We tested by re-requesting the same 50 prompts (idx 0..49) on the Modulum endpoint after the original phase-3 run, three weeks of operation later. Result:
| Cell | Original (phase-3) | Re-request (2026-05-19) | Drift |
|---|---|---|---|
| qa2 32k | 28/50 | 28/50 | 0 samples · EXACT |
| qa2 64k | 26/50 | 26/50 | 0 samples · EXACT |
| qa2 128k | 25/50 | 25/50 | 0 samples · EXACT |
| qa1 32k | 45/50 | 46/50 | +1 sample |
| qa1 64k | 42/50 | 44/50 | +2 samples |
| qa1 128k | 36/50 | 38/50 | +2 samples |
qa2 cells are exact-bit deterministic. qa1 cells drift by 1–2 samples on re-request — within sampling noise but not exact-bit. Whatever Q4 rounding non-determinism exists, it cancels out on 2-fact reasoning prompts specifically. This was not predicted. For production routing, it means qa2-style multi-fact queries are repeatable across re-requests; qa1 retrieval queries have small Q4-quantization variance.
Surfaced by tercile analysis of phase-1 data — splitting each cell's samples into early / mid / late thirds and measuring accuracy per slice. Modulum qa1 64k accuracy degrades monotonically across 100 sequential calls: 87.9 % → 78.8 % → 64.7 %. Confirmed on qa3 128k across 500 samples (32.5 % → 26.5 % → 22.0 %, two independent runs same direction).
This is a production-blocker. KV cache state accumulation or attention drift over sustained sequential operation. Wasn't in the original report — only emerged when we looked at the within-run distribution. Hypernym engineering needs to diagnose before hyperscaler deployment: the model loses 23 pp of accuracy if you keep sending 128k prompts to it.
From phase-4 logprob capture (N=20 per cell): target_token_logprob ≈ 0.0 and perplexity ≈ 1.0 across every cell, regardless of whether the answer was correct. Modulum commits with the same numerical confidence whether it is correct or hallucinating. Combined with Breakthrough E (sustained-run drift) and Breakthrough A (no refusal mechanism), this means production routing has no signal to detect bad answers from logprobs alone.
This is the most important production R&D gap surfaced by the study. Modulum suppresses hallucination through output truncation but cannot signal uncertainty. A safety-critical deployment can't route uncertain queries to human review because the model doesn't expose an uncertainty score.
Modulum's wrong-answer median chars: ~45 chars across qa1/qa2/qa3 at 32k and 64k. Outputs follow the canonical "X is in Y" format whether correct or wrong. Pure hallucination rate <5 %.
Median wrong-answer chars jump to 80 (max 500). Pure hallucination rate jumps to 12 % on qa1, 31 % on qa2. The same threshold where format breaks is the threshold where fabrication appears — they are the same mechanism unraveling.
Implication: Modulum's structural output enforcement holds at ≤64k context and partially fails at 128k. A platform-level improvement that keeps the format-truncation mechanism functional at 128k+ would compound directly into reduced hallucination at long context. R&D direction.
Grok 4.3 has the steepest qa1 decay slope in the panel at −25 pp / 2× context — by far. Modulum loses 8.75 pp / 2× on the same prompts. Grok degrades 3× faster than Modulum on retrieval as context grows. xAI may want to investigate the architectural threshold between 64k and 128k.
From the in-flight hallucination probe: ask the model about an entity that doesn't appear in context. GPT-5.5 refused 149/150 (99.3 %). Grok 4.3 refused 144/150 (96 %). Strongest production-safety signal in the entire dataset — neither model markets this. Modulum cannot do this at all (zero refusal capability per Breakthrough A/F).
At N=50: Modulum 22 % vs vanilla 28 % = −6 pp (p=0.49). The original report flagged this as a possible platform side-effect on short-context temporal reasoning. With N=200 Modulum extension just landed: 31.5 % vs 32 % = −0.5 pp (p=0.95). The regression was a small-sample artifact. Time spent engineering hypotheses about a non-effect.
Initial bench reported Gemini 3.1 Pro fails 50/50 at 1M context. Codex audit caught that all 50 failures were HTTP 429 spending-cap errors, not model-context failures. We have no data on Gemini's actual 1M capability. Same issue blocked the hallucination probe's Gemini 64k/128k cells. Cost-budget failure modes look like capability failure modes — easy to misread.
Of the eight breakthroughs above, three carry the entire procurement-relevant story for hyperscalers and enterprise partners evaluating Modulum for production deployment:
| # | Breakthrough | Procurement axis |
|---|---|---|
| A1 | Fabrication suppression 4.7× vs same base weights via structural output truncation | Safety / hallucination-risk axis |
| B1 | +17 to +22 % decode speedup on qa3 multi-fact reasoning at zero accuracy cost | Cost-per-token / throughput axis |
| C1 | Best qa3 decay slope in panel (−2.5 pp / 2× ctx), workstation-scale base model | Long-context preservation / footprint axis |
All three are verifiable from canonical SQLite + CSV that a hyperscaler data team can audit independently. The mechanism (truncation, not refusal) is observable in output text. The slope (qa3) is reproducible across 500-sample cells with ±3.9 pp Wilson half-width.
Two production-blockers remain open and should be raised with Hypernym engineering before any partner deployment: E1 sustained-run drift (−23 pp end-to-end within a single 100-sample run) and F1 zero uncertainty calibration (PPL ≈ 1.0 regardless of correctness). Both are platform-side fixes, not base-model issues.