Documentation Index
Fetch the complete documentation index at: https://verdictweight.dev/llms.txt
Use this file to discover all available pages before exploring further.
Purpose
A confidence score of 0.9 must mean the system would be correct approximately 90% of the time on inputs that produce that score. If it does not, every downstream gate, escalation rule, and audit threshold is built on a false premise. Stream 5 enforces this property as a final step in the composition pipeline. It applies an empirical reliability map — fitted on held-out validation data — to the aggregated raw confidence, producing a calibrated final value.What the stream does
Take raw aggregated confidence
The output of Streams 1–4, post-aggregation, is a raw confidence value. This value is not the final reported confidence.
Apply the empirical reliability map
A function — fitted on held-out validation data — maps raw confidence to expected empirical correctness.
How is fitted
The reliability map is built from labeled validation data using isotonic regression or an equivalent monotone-fitting method. Two properties are enforced:- Monotonicity. is non-decreasing. Higher raw confidence cannot map to lower calibrated confidence.
- Reliability. On the held-out fitting data, within tolerance.
Reliability results
The framework’s calibration achieves substantially better reliability than averaging baselines. The headline metric — reliability error (REL) — is reported in detail in Calibration curves along with the methodology, dataset, and confidence bounds.When to refit
The map should be refitted whenever any of the following changes:- The upstream model stack (different model, different prompt, different retrieval index).
- The deployment domain (the input distribution has shifted).
- The configured weights or thresholds in any other stream.
What this stream does not do
- It does not improve the underlying decision. Calibration corrects how confident the system reports being, not which decision it would have made.
- It does not extrapolate. is a fitted map; it does not generalize beyond its support without explicit refitting on representative data.