Documentation Index
Fetch the complete documentation index at: https://verdictweight.dev/llms.txt
Use this file to discover all available pages before exploring further.
Why a CVE dataset
VERDICT WEIGHT is a confidence-scoring framework intended for high-stakes deployment. Benchmarking it on synthetic data — the standard practice in confidence-fusion literature — would not be a credible test. The framework is benchmarked instead on a real-world proxy dataset constructed from public vulnerability records. The proxy is not “the deployment domain.” No public dataset is. But it is an adversarial-adjacent dataset: vulnerability triage involves correlated evidence sources, distributional shift, and incentives for adversaries to manipulate signal — all of which exercise the framework’s stream coverage.What the dataset contains
The validation dataset comprises:- 120 real CVEs drawn from the NIST National Vulnerability Database (NVD).
- A subset overlapping with the CISA Known Exploited Vulnerabilities (KEV) catalog, used as a positive-class anchor for the gating decision.
- Per-record evidence vectors derived from CVSS metrics, exploit-availability indicators, vendor-supplied severity, and observed exploitation status.
- Ground-truth labels derived from KEV inclusion and from documented exploitation events.
What it tests
The CVE dataset exercises:- Stream 1: heterogeneous evidence (numeric scores, boolean indicators, structured metadata).
- Stream 2: epistemic uncertainty (some CVEs are well-documented; many are not).
- Stream 3: temporal stability (CVE descriptions are revised over time).
- Stream 4: cross-source coherence (CVSS, vendor severity, and exploit databases frequently disagree).
- Stream 5: calibration (KEV inclusion provides ground-truth correctness signal).
Methodology
Full methodology is in Paper 2, Section 4.10, and reproducible end-to-end:- Pulls the relevant NVD records (or loads from a cached snapshot for reproducibility).
- Cross-references with the CISA KEV catalog at the same snapshot date.
- Constructs evidence vectors per the documented mapping rules.
- Scores each record under VERDICT WEIGHT and under each baseline.
- Reports the four headline metrics (REL, AUC, Brier, ECE) per method.
Results
VERDICT WEIGHT outperforms each baseline on each metric on this dataset. The headline numbers are reported in Head-to-head comparison and the full per-metric table with confidence intervals is in Paper 2, Section 4.10.Honest treatment of AUC = 1.0
On this particular dataset, several methods (including VERDICT WEIGHT and several baselines) achieve AUC very close to 1.0. This is a property of the dataset construction, not a generalization claim. The KEV inclusion criterion correlates strongly with several of the input evidence channels (CVSS score, exploit availability) by construction. Methods that aggregate those channels at all will produce nearly-perfect rank-order discrimination on this label. What this means in practice:- AUC = 1.0 is not the case for the framework’s positioning. The case rests on REL.
- Methods that achieve high AUC can still differ dramatically on REL — and they do. See Calibration curves for the actual differentiation.
- Future versions of the validation suite will introduce datasets with weaker label-evidence coupling so that AUC is a meaningful discriminator.
Future datasets
The CVE dataset is the first real-world proxy. Planned additions:- An NLP triage dataset (sentiment, intent, content classification) with adversarial perturbations.
- A medical decision-support proxy with documented label noise.
- An adversarial-input benchmark for Stream 6 with documented attack budgets.