VERDICT WEIGHT - Confidence Scoring for Autonomous AI

Why a CVE dataset

VERDICT WEIGHT is a confidence-scoring framework intended for high-stakes deployment. Benchmarking it on synthetic data — the standard practice in confidence-fusion literature — would not be a credible test. The framework is benchmarked instead on a real-world proxy dataset constructed from public vulnerability records. The proxy is not “the deployment domain.” No public dataset is. But it is an adversarial-adjacent dataset: vulnerability triage involves correlated evidence sources, distributional shift, and incentives for adversaries to manipulate signal — all of which exercise the framework’s stream coverage.

What the dataset contains

The validation dataset comprises:

120 real CVEs drawn from the NIST National Vulnerability Database (NVD).
A subset overlapping with the CISA Known Exploited Vulnerabilities (KEV) catalog, used as a positive-class anchor for the gating decision.
Per-record evidence vectors derived from CVSS metrics, exploit-availability indicators, vendor-supplied severity, and observed exploitation status.
Ground-truth labels derived from KEV inclusion and from documented exploitation events.

The data sources are public; the dataset construction pipeline is published with the framework so that the benchmark is reproducible end-to-end.

What it tests

The CVE dataset exercises:

Stream 1: heterogeneous evidence (numeric scores, boolean indicators, structured metadata).
Stream 2: epistemic uncertainty (some CVEs are well-documented; many are not).
Stream 3: temporal stability (CVE descriptions are revised over time).
Stream 4: cross-source coherence (CVSS, vendor severity, and exploit databases frequently disagree).
Stream 5: calibration (KEV inclusion provides ground-truth correctness signal).

It is not an adversarial-input benchmark. Stream 6 is benchmarked separately on synthetic adversarial inputs, because no public adversarial CVE corpus exists.

Methodology

Full methodology is in Paper 2, Section 4.10, and reproducible end-to-end:

python -m verdict_weight.benchmarks.real_world

The benchmark:

Pulls the relevant NVD records (or loads from a cached snapshot for reproducibility).
Cross-references with the CISA KEV catalog at the same snapshot date.
Constructs evidence vectors per the documented mapping rules.
Scores each record under VERDICT WEIGHT and under each baseline.
Reports the four headline metrics (REL, AUC, Brier, ECE) per method.

Results

VERDICT WEIGHT outperforms each baseline on each metric on this dataset. The headline numbers are reported in Head-to-head comparison and the full per-metric table with confidence intervals is in Paper 2, Section 4.10.

Honest treatment of AUC = 1.0

On this particular dataset, several methods (including VERDICT WEIGHT and several baselines) achieve AUC very close to 1.0. This is a property of the dataset construction, not a generalization claim. The KEV inclusion criterion correlates strongly with several of the input evidence channels (CVSS score, exploit availability) by construction. Methods that aggregate those channels at all will produce nearly-perfect rank-order discrimination on this label. What this means in practice:

AUC = 1.0 is not the case for the framework’s positioning. The case rests on REL.
Methods that achieve high AUC can still differ dramatically on REL — and they do. See Calibration curves for the actual differentiation.
Future versions of the validation suite will introduce datasets with weaker label-evidence coupling so that AUC is a meaningful discriminator.

This caveat is documented prominently because hiding it would be exactly the kind of dataset-artifact-as-performance-claim that the framework’s IEEE hardening was designed to prevent.

Future datasets

The CVE dataset is the first real-world proxy. Planned additions:

An NLP triage dataset (sentiment, intent, content classification) with adversarial perturbations.
A medical decision-support proxy with documented label noise.
An adversarial-input benchmark for Stream 6 with documented attack budgets.

These are tracked as open work in the public repository.

IEEE Hardening

Real-World Proxy

Test Suite

Papers

CVE dataset

Why a CVE dataset

What the dataset contains

What it tests

Methodology

Results

Honest treatment of AUC = 1.0

Future datasets

IEEE Hardening

Real-World Proxy

Test Suite

Papers

Documentation Index

​Why a CVE dataset

​What the dataset contains

​What it tests

​Methodology

​Results

​Honest treatment of AUC = 1.0

​Future datasets

Why a CVE dataset

What the dataset contains

What it tests

Methodology

Results

Honest treatment of AUC = 1.0

Future datasets