Documentation Index
Fetch the complete documentation index at: https://verdictweight.dev/llms.txt
Use this file to discover all available pages before exploring further.
What calibration curves measure
A calibration curve plots reported confidence on the x-axis against empirical correctness rate on the y-axis. A perfectly calibrated method produces the diagonal: when it reports 0.7 confidence, it is right 70% of the time. Deviation from the diagonal is reliability error (REL). A method that systematically reports higher confidence than its empirical correctness rate is overconfident; a method that reports lower is underconfident. Both are bad. Calibration is the property of being on the diagonal.Headline result
VERDICT WEIGHT achieves REL ≈ 0.0019 on the validation dataset described in CVE dataset. The Simple Averaging baseline achieves REL ≈ 0.018 on the same data. VERDICT WEIGHT is approximately 9.6× better calibrated than averaging on the same evidence. This is the single most consequential metric for deployment. A miscalibrated confidence value is not just numerically wrong; it makes every downstream gate, escalation rule, and audit threshold built on it operationally wrong as well.Why this is the metric to lead with
The framework is benchmarked on four metrics (REL, AUC, Brier, ECE; see Head-to-head comparison). It outperforms every baseline on all four. The case for the framework rests primarily on REL because:- Calibration is what gating requires. A score of 0.9 must mean ~90% correct; otherwise threshold-based gating is incoherent.
- Calibration is what audit requires. A regulator asking “what was your confidence in this decision and what does that confidence mean?” needs an answer grounded in empirical reliability.
- Calibration is what improves over baselines most dramatically. AUC differences are often modest; reliability differences are often categorical.
How calibration is achieved
Calibration in VERDICT WEIGHT comes from two compounding effects:- Streams 1–4 produce a less-overconfident raw aggregate than naive baselines, because each stream actively penalizes a known source of overconfidence (correlation collapse, drift, conflated uncertainty, etc.).
- Stream 5 applies a fitted post-hoc reliability map that corrects whatever residual miscalibration remains in the raw aggregate.
Confidence-bin breakdown
The full reliability curve, broken into confidence bins, is in Paper 2, Section 4.6. The summary: in every bin from 0.5 upward, VERDICT WEIGHT’s empirical correctness rate matches its reported confidence within ~0.01. Below 0.5, where bin populations are small, confidence intervals widen but reported confidence remains within statistical tolerance of empirical correctness.Re-fitting for new domains
Calibration is distribution-bound. A reliability map fitted on one validation distribution is not guaranteed to hold on another. Operators deploying into a new domain should:Verify the new reliability
Plot the calibration curve on a held-out portion of the new validation data and confirm REL is in the expected range.
Visual
A reference calibration curve image (/images/validation/calibration-curve.svg) accompanies this page in the published documentation. The curve shows: VERDICT WEIGHT tightly tracking the diagonal; Simple Averaging systematically above the diagonal (overconfident); Naive Bayes more steeply above the diagonal still.
If the asset path above does not yet exist in your repo, drop a placeholder SVG or remove the image reference. The numerical result above stands on its own.