Most confidence scoring frameworks make their case axiomatically: “we use Bayesian updating, therefore our scores are principled.” The framework’s position is that axiomatic appeal is not evidence of empirical reliability. The right test is direct, head-to-head, on the same data, with the same evaluation metrics.This page summarizes that benchmark. The full methodology, dataset, and per-metric breakdown are in Paper 2.
These four are chosen because they are the standard methods cited in confidence-fusion literature and in production deployments. They are not chosen because they are weak.
VERDICT WEIGHT outperforms each baseline on all four metrics. The full table with effect sizes, confidence intervals, and Bonferroni-corrected significance is in Paper 2, Section 4.7.The most consequential individual finding is on reliability error:
VERDICT WEIGHT REL: ~0.0019
Simple Averaging REL: ~0.018
Improvement factor: ~9.6× better than averaging
This is the metric that matters most for deployment: it is the metric that determines whether a stated 0.9 confidence actually means 0.9 empirical correctness.
A common temptation in confidence-scoring benchmarks is to lead with AUC. The framework deliberately does not. AUC measures rank-order discrimination — the ability to distinguish correct from incorrect predictions — but it does not measure whether the magnitudes of confidence values are meaningful.A method can produce excellent AUC while being badly miscalibrated, in which case its confidence values are useful for sorting decisions but not for gating them. Gating requires calibration. Calibration is what REL measures.VERDICT WEIGHT leads on both metrics, but the case for the framework rests on REL.
The improvements over each baseline are statistically significant under Bonferroni correction across the four metric tests, with effect sizes in the meaningful range (Cohen’s d > 0.5 for the headline comparisons). The full statistical apparatus is in Paper 2, Section 4.8.
The benchmark uses the CVE-derived validation dataset described in CVE dataset. That dataset is a real-world proxy — the benchmark is not run on synthetic data — but it is one dataset, and any single benchmark dataset has limitations.In particular, AUC values approaching 1.0 in this dataset reflect a property of the dataset construction, not a generalization claim. See Known limitations for the explicit treatment.