Skip to main content

Documentation Index

Fetch the complete documentation index at: https://verdictweight.dev/llms.txt

Use this file to discover all available pages before exploring further.

The autonomy gap

Every high-stakes deployment of AI faces the same question: when is the system confident enough to act, and when must it escalate? That question has, until now, had no defensible answer. Softmax probabilities are miscalibrated by orders of magnitude. Verbalized self-confidence from LLMs correlates poorly with correctness. Ensemble agreement papers over the disagreement that should have triggered abstention. The result is a brittle deployment posture: either the system acts on confidence it has not earned, or it escalates so frequently that human review becomes the bottleneck.

Where existing approaches fall short

Modern neural networks are systematically overconfident, especially on out-of-distribution inputs. The probability score reported by the model is rarely a faithful estimate of correctness.
Mathematically rigorous, but combinatorially expensive and brittle under conflicting evidence. Produces meaningful results only when sources are well-calibrated, which is the problem we started with.
Assumes source independence that almost never holds in practice. Cross-source correlation collapses the variance estimate and produces overconfident posteriors.
Trades calibration for simplicity. Provides no mechanism for penalizing instability, detecting adversarial inputs, or producing an audit trail.
Reflects the model’s aspirations about its own correctness, not its actual error distribution. Particularly unreliable under adversarial conditions.

What changes with a confidence layer

A defensible confidence layer changes three things at deployment:
  1. Thresholding becomes principled. Instead of choosing an opaque cut-off, operators can choose a calibrated reliability target (e.g. “act only when the system would be correct ≥ 99% of the time”).
  2. Audit becomes possible. Every decision is reproducible, signed, and chained. Regulators, internal reviewers, and counsel can reconstruct why the system acted.
  3. Adversarial robustness becomes testable. Confidence can be deliberately attacked. VERDICT WEIGHT™ treats this as a first-class problem (see Stream 6).

Where this matters most

Defense & national security

Adversarial autonomy. Tamper-evident decisioning. Kill-switch primitives.

Regulated industry

Healthcare triage, financial compliance, legal review — anywhere the question “why did the system decide that?” must be answerable.

Critical infrastructure

Energy, transportation, communications — any deployment where a confidently-wrong decision is a safety event.

Agentic systems

Multi-step autonomous workflows where compounding overconfidence becomes a systemic risk.