Skip to main content

Documentation Index

Fetch the complete documentation index at: https://verdictweight.dev/llms.txt

Use this file to discover all available pages before exploring further.

Why a CVE dataset

VERDICT WEIGHT is a confidence-scoring framework intended for high-stakes deployment. Benchmarking it on synthetic data — the standard practice in confidence-fusion literature — would not be a credible test. The framework is benchmarked instead on a real-world proxy dataset constructed from public vulnerability records. The proxy is not “the deployment domain.” No public dataset is. But it is an adversarial-adjacent dataset: vulnerability triage involves correlated evidence sources, distributional shift, and incentives for adversaries to manipulate signal — all of which exercise the framework’s stream coverage.

What the dataset contains

The validation dataset comprises: The data sources are public; the dataset construction pipeline is published with the framework so that the benchmark is reproducible end-to-end.

What it tests

The CVE dataset exercises:
  • Stream 1: heterogeneous evidence (numeric scores, boolean indicators, structured metadata).
  • Stream 2: epistemic uncertainty (some CVEs are well-documented; many are not).
  • Stream 3: temporal stability (CVE descriptions are revised over time).
  • Stream 4: cross-source coherence (CVSS, vendor severity, and exploit databases frequently disagree).
  • Stream 5: calibration (KEV inclusion provides ground-truth correctness signal).
It is not an adversarial-input benchmark. Stream 6 is benchmarked separately on synthetic adversarial inputs, because no public adversarial CVE corpus exists.

Methodology

Full methodology is in Paper 2, Section 4.10, and reproducible end-to-end:
python -m verdict_weight.benchmarks.real_world
The benchmark:
  1. Pulls the relevant NVD records (or loads from a cached snapshot for reproducibility).
  2. Cross-references with the CISA KEV catalog at the same snapshot date.
  3. Constructs evidence vectors per the documented mapping rules.
  4. Scores each record under VERDICT WEIGHT and under each baseline.
  5. Reports the four headline metrics (REL, AUC, Brier, ECE) per method.

Results

VERDICT WEIGHT outperforms each baseline on each metric on this dataset. The headline numbers are reported in Head-to-head comparison and the full per-metric table with confidence intervals is in Paper 2, Section 4.10.

Honest treatment of AUC = 1.0

On this particular dataset, several methods (including VERDICT WEIGHT and several baselines) achieve AUC very close to 1.0. This is a property of the dataset construction, not a generalization claim. The KEV inclusion criterion correlates strongly with several of the input evidence channels (CVSS score, exploit availability) by construction. Methods that aggregate those channels at all will produce nearly-perfect rank-order discrimination on this label. What this means in practice:
  • AUC = 1.0 is not the case for the framework’s positioning. The case rests on REL.
  • Methods that achieve high AUC can still differ dramatically on REL — and they do. See Calibration curves for the actual differentiation.
  • Future versions of the validation suite will introduce datasets with weaker label-evidence coupling so that AUC is a meaningful discriminator.
This caveat is documented prominently because hiding it would be exactly the kind of dataset-artifact-as-performance-claim that the framework’s IEEE hardening was designed to prevent.

Future datasets

The CVE dataset is the first real-world proxy. Planned additions:
  • An NLP triage dataset (sentiment, intent, content classification) with adversarial perturbations.
  • A medical decision-support proxy with documented label noise.
  • An adversarial-input benchmark for Stream 6 with documented attack budgets.
These are tracked as open work in the public repository.