VERDICT WEIGHT - Confidence Scoring for Autonomous AI

What calibration curves measure

A calibration curve plots reported confidence on the x-axis against empirical correctness rate on the y-axis. A perfectly calibrated method produces the diagonal: when it reports 0.7 confidence, it is right 70% of the time. Deviation from the diagonal is reliability error (REL). A method that systematically reports higher confidence than its empirical correctness rate is overconfident; a method that reports lower is underconfident. Both are bad. Calibration is the property of being on the diagonal.

Headline result

VERDICT WEIGHT achieves REL ≈ 0.0019 on the validation dataset described in CVE dataset. The Simple Averaging baseline achieves REL ≈ 0.018 on the same data. VERDICT WEIGHT is approximately 9.6× better calibrated than averaging on the same evidence. This is the single most consequential metric for deployment. A miscalibrated confidence value is not just numerically wrong; it makes every downstream gate, escalation rule, and audit threshold built on it operationally wrong as well.

Why this is the metric to lead with

The framework is benchmarked on four metrics (REL, AUC, Brier, ECE; see Head-to-head comparison). It outperforms every baseline on all four. The case for the framework rests primarily on REL because:

Calibration is what gating requires. A score of 0.9 must mean ~90% correct; otherwise threshold-based gating is incoherent.
Calibration is what audit requires. A regulator asking “what was your confidence in this decision and what does that confidence mean?” needs an answer grounded in empirical reliability.
Calibration is what improves over baselines most dramatically. AUC differences are often modest; reliability differences are often categorical.

How calibration is achieved

Calibration in VERDICT WEIGHT comes from two compounding effects:

Streams 1–4 produce a less-overconfident raw aggregate than naive baselines, because each stream actively penalizes a known source of overconfidence (correlation collapse, drift, conflated uncertainty, etc.).
Stream 5 applies a fitted post-hoc reliability map that corrects whatever residual miscalibration remains in the raw aggregate.

Either effect alone is insufficient. Stream 5 alone (calibration applied to a poorly-aggregated raw confidence) recovers significantly less reliability than the full eight-stream composition. This is shown in the ablation studies.

Confidence-bin breakdown

The full reliability curve, broken into confidence bins, is in Paper 2, Section 4.6. The summary: in every bin from 0.5 upward, VERDICT WEIGHT’s empirical correctness rate matches its reported confidence within ~0.01. Below 0.5, where bin populations are small, confidence intervals widen but reported confidence remains within statistical tolerance of empirical correctness.

Re-fitting for new domains

Calibration is distribution-bound. A reliability map fitted on one validation distribution is not guaranteed to hold on another. Operators deploying into a new domain should:

Collect representative validation data

A few hundred labeled decisions from the deployment domain.

Re-fit Stream 5

Use the bundled fitting utility. Runtime is on the order of seconds.

Verify the new reliability

Plot the calibration curve on a held-out portion of the new validation data and confirm REL is in the expected range.

Promote

Update the configuration registry with the new map. The change is recorded in the audit chain.

The framework provides a CLI for the full re-fit-and-validate workflow.

Visual

A reference calibration curve image (/images/validation/calibration-curve.svg) accompanies this page in the published documentation. The curve shows: VERDICT WEIGHT tightly tracking the diagonal; Simple Averaging systematically above the diagonal (overconfident); Naive Bayes more steeply above the diagonal still.

If the asset path above does not yet exist in your repo, drop a placeholder SVG or remove the image reference. The numerical result above stands on its own.

IEEE Hardening

Real-World Proxy

Test Suite

Papers

Calibration curves

What calibration curves measure

Headline result

Why this is the metric to lead with

How calibration is achieved

Confidence-bin breakdown

Re-fitting for new domains

Visual

IEEE Hardening

Real-World Proxy

Test Suite

Papers

Documentation Index

​What calibration curves measure

​Headline result

​Why this is the metric to lead with

​How calibration is achieved

​Confidence-bin breakdown

​Re-fitting for new domains

​Visual

What calibration curves measure

Headline result

Why this is the metric to lead with

How calibration is achieved

Confidence-bin breakdown

Re-fitting for new domains

Visual