AI Decision-Layer Reproducibility
The decision layer downstream of any ML model — calibration, policy rules, thresholding, logging — made bit-identical on every platform, with a one-line cryptographic receipt per decision.
- 0
- Receipt drifts (SolvNum, 1M decisions)
- ~32k
- Receipt drifts (float64, 1M decisions)
- 8/8
- Stress cells pass (SolvNum)
- SHA-256
- Per-decision cryptographic receipt
The scenario
Set the picture
A modern AI decision pipeline at any large platform ingests model logits, runs calibration (Platt / temperature scaling), composes policy rules, applies thresholds, and logs the decision. The model layer runs in fp16 on GPUs. The decision layer is small-scale arithmetic in Python or C++.
The quiet failure: this decision layer is not bit-reproducible across deployments. The same logits fed into the same calibration pipeline produce subtly different confidence scores on CPU vs GPU vs on-device — and occasionally different decisions at borderline cases — because platform libm calls disagree in the last few ULP.
Cost today
Three regulatory frameworks converge on requiring the same input to produce the same decision with a reproducible audit trail: EU AI Act Article 12 (automatic logs), DSA Article 40 (researcher reproducibility), and DSA Article 15 (systemic-risk assessment).
None require bit-identical neural-network forward passes. They require bit-identical decisions and decision audits. That distinction is the crack this POC fits through.
What changes with SolvNum
SolvNum backs the four-stage decision pipeline (calibration, rule aggregation, thresholding, logging). On 1,000,000 decisions: SolvNum produces YES log-hash match and 0 receipt drifts across Windows and Linux in 8/8 stress cells.
Float64_naive and float64_kahan produce 0/8 log-hash matches with ~32,000 cross-platform receipt drifts each. Even the careful-engineer Kahan baseline fails on cross-platform libm shift.
A reviewer runs the verification script on their own hardware in under five minutes and checks the SHA-256.
Measurable outcome
What we claim — and how it survives review
Each line below maps to a captured number in the demo section. Every number is reproducible from the benchmark suite.
- SolvNum: 8/8 stress cells produce identical receipt hashes across Windows-x64 and Linux-x64.
- Float64 naive: 0/8 log-hash matches, ~32,000 cross-platform receipt drifts per 1,000,000 decisions.
- Float64 Kahan: 0/8 log-hash matches — even careful engineering fails on cross-platform libm.
- Reviewer re-verification: single command, under 5 minutes, SHA-256 match or mismatch.
- Every decision emits: action enum, calibrated scores, rule-aggregation trace, SHA-256 receipt.
The demo
What was tested. How. What the script printed.
Synthetic-but-realistic logit streams: 5-class content-moderation logits (100,000 decisions, Dirichlet prior) and single-class ranking scores (1,000,000 decisions with ~2% borderline band). Both seeded and reproducible.
Four-stage pipeline (calibration, policy-rule aggregation, thresholding, decision logging) implemented in three variants: float64_naive, float64_kahan, and solvnum_backed. Stress axes: thread count (1/4/16/64), batch reordering, BLAS library swap, compiler swap, GPU non-determinism.
Captured benchmark output
The numbers the script actually printed.
| Implementation | Log-hash match (8 cells) | Receipt drifts / 1M | Stress axes survived |
|---|---|---|---|
| solvnum_backed | 8 / 8 | 0 | All (threads, BLAS, compiler, batch order) |
| float64_kahan | 0 / 8 | ~32,000 | None |
| float64_naive | 0 / 8 | ~32,000 | None |
Validated on Windows-x64 and Linux-x64. GPU and WASM legs pending hardware.
Composes with
Where this POC sits in the suite
Evidence pointers
Where the claims live in the repo
These are the files a reviewer should run to re-derive every number on this page.
- tools/solvnum/buyer_pocs/ai_decision_layer/bench.py
- tools/solvnum/buyer_pocs/ai_decision_layer/verify.py
- tools/solvnum/buyer_pocs/reports/ai_decision_layer_report.md
- docs/poc/01_ai_decision_layer.md
Want to see these receipts on your pipeline?
Run the benchmark against your actual decision pipeline.
Two weeks, $25K, fully credited. No production integration, no data leaving your premises. Every claim above traces back to a script you can run locally.