Skip to main content

Documentation Index

Fetch the complete documentation index at: https://verdictweight.dev/llms.txt

Use this file to discover all available pages before exploring further.

What ablation establishes

The completeness argument in Architecture / completeness proof has two halves: coverage (every failure class is detected by at least one stream) and necessity (no stream can be removed without leaving at least one failure class undetected). Coverage is established structurally. Necessity is established empirically, by ablation. This page summarizes those results.

Ablation procedure

For each of the eight streams, run the benchmark with that stream disabled and measure the change in the four headline metrics:
1

Disable one stream

Configure a scorer with disable_streams={i} for a single stream i.
2

Re-run the benchmark

Run the head-to-head comparison and the per-failure-class detection rate measurement.
3

Record the delta

Record the change in each metric relative to the full eight-stream configuration.
4

Test for significance

Apply Bonferroni-corrected significance testing across the eight ablation conditions.

Results summary

For each stream, the ablation produces a measurable, statistically significant degradation in at least one metric, with the specific failure class re-admitted as predicted by the completeness analysis.
Stream removedPrimary failure re-admittedMetric most affected
1 (Evidence aggregation)F1: miscalibrated raw confidenceREL, Brier
2 (Uncertainty quantification)F3: aleatoric/epistemic conflationOOD reliability
3 (Temporal stability)F4: confidence driftBrier (under perturbation)
4 (Cross-source coherence)F2: source-correlation collapseREL (under correlated sources)
5 (Calibration)F1: systematic overconfidenceREL, ECE
6 (SIS / Curveball)F5: confidence-flip attacksAdversarial detection rate
7 (CPS / hash chain)F6: silent tamperingAudit chain verification
8 (RIS / kill switch)F7: scoring-layer compromiseRecovery from compromise
The full per-stream ablation table with effect sizes (Cohen’s d), 95% confidence intervals, and Bonferroni-corrected p-values is in Paper 2, Section 4.8.

What the ablation does not establish

Ablation establishes that each individual stream contributes meaningfully to the framework’s coverage. It does not establish:
  • That the streams are optimal in their current form. A different version of any single stream might be better; ablation does not test alternative implementations.
  • That eight streams is the minimum number. A future version might collapse two streams into a single equivalent component. Ablation tests the current decomposition, not the necessary one.
  • That each stream is necessary in every deployment. A deployment whose threat model is a strict subset of the documented taxonomy might rationally disable a hardening stream. The framework allows this configuration; the audit chain records it.

Reproducibility

python -m verdict_weight.benchmarks.ablation
The full ablation suite takes longer than the head-to-head benchmark because it runs each ablation condition independently. Expect roughly N× the runtime of the head-to-head benchmark for N ablation conditions. Results match the published numbers within floating-point tolerance.

Why we publish ablations

Ablation studies are one of the seven attack categories that IEEE-grade reviewers apply to confidence-fusion claims. The framework was hardened against all seven before submission. See IEEE hardening for the broader context and the other six categories.