All platform POCs
POC 01Reproducible now

AI Decision-Layer Reproducibility

The decision layer downstream of any ML model — calibration, policy rules, thresholding, logging — made bit-identical on every platform, with a one-line cryptographic receipt per decision.

0
Receipt drifts (SolvNum, 1M decisions)
~32k
Receipt drifts (float64, 1M decisions)
8/8
Stress cells pass (SolvNum)
SHA-256
Per-decision cryptographic receipt

The scenario

Set the picture

A modern AI decision pipeline at any large platform ingests model logits, runs calibration (Platt / temperature scaling), composes policy rules, applies thresholds, and logs the decision. The model layer runs in fp16 on GPUs. The decision layer is small-scale arithmetic in Python or C++.

The quiet failure: this decision layer is not bit-reproducible across deployments. The same logits fed into the same calibration pipeline produce subtly different confidence scores on CPU vs GPU vs on-device — and occasionally different decisions at borderline cases — because platform libm calls disagree in the last few ULP.

Cost today

Three regulatory frameworks converge on requiring the same input to produce the same decision with a reproducible audit trail: EU AI Act Article 12 (automatic logs), DSA Article 40 (researcher reproducibility), and DSA Article 15 (systemic-risk assessment).

None require bit-identical neural-network forward passes. They require bit-identical decisions and decision audits. That distinction is the crack this POC fits through.

What changes with SolvNum

SolvNum backs the four-stage decision pipeline (calibration, rule aggregation, thresholding, logging). On 1,000,000 decisions: SolvNum produces YES log-hash match and 0 receipt drifts across Windows and Linux in 8/8 stress cells.

Float64_naive and float64_kahan produce 0/8 log-hash matches with ~32,000 cross-platform receipt drifts each. Even the careful-engineer Kahan baseline fails on cross-platform libm shift.

A reviewer runs the verification script on their own hardware in under five minutes and checks the SHA-256.

Measurable outcome

What we claim — and how it survives review

Each line below maps to a captured number in the demo section. Every number is reproducible from the benchmark suite.

  • SolvNum: 8/8 stress cells produce identical receipt hashes across Windows-x64 and Linux-x64.
  • Float64 naive: 0/8 log-hash matches, ~32,000 cross-platform receipt drifts per 1,000,000 decisions.
  • Float64 Kahan: 0/8 log-hash matches — even careful engineering fails on cross-platform libm.
  • Reviewer re-verification: single command, under 5 minutes, SHA-256 match or mismatch.
  • Every decision emits: action enum, calibrated scores, rule-aggregation trace, SHA-256 receipt.

The demo

What was tested. How. What the script printed.

Synthetic-but-realistic logit streams: 5-class content-moderation logits (100,000 decisions, Dirichlet prior) and single-class ranking scores (1,000,000 decisions with ~2% borderline band). Both seeded and reproducible.

Four-stage pipeline (calibration, policy-rule aggregation, thresholding, decision logging) implemented in three variants: float64_naive, float64_kahan, and solvnum_backed. Stress axes: thread count (1/4/16/64), batch reordering, BLAS library swap, compiler swap, GPU non-determinism.

Captured benchmark output

The numbers the script actually printed.

Cross-platform decision reproducibility: SolvNum vs float64 baselines
ImplementationLog-hash match (8 cells)Receipt drifts / 1MStress axes survived
solvnum_backed8 / 80All (threads, BLAS, compiler, batch order)
float64_kahan0 / 8~32,000None
float64_naive0 / 8~32,000None

Validated on Windows-x64 and Linux-x64. GPU and WASM legs pending hardware.

Evidence pointers

Where the claims live in the repo

These are the files a reviewer should run to re-derive every number on this page.

  • tools/solvnum/buyer_pocs/ai_decision_layer/bench.py
  • tools/solvnum/buyer_pocs/ai_decision_layer/verify.py
  • tools/solvnum/buyer_pocs/reports/ai_decision_layer_report.md
  • docs/poc/01_ai_decision_layer.md

Want to see these receipts on your pipeline?

Run the benchmark against your actual decision pipeline.

Two weeks, $25K, fully credited. No production integration, no data leaving your premises. Every claim above traces back to a script you can run locally.

Talk to us