Consulting / AI Model Verification

Interpretability Claim Verification

Verify interpretability claims before you rely on them.

We test whether candidate features, probes, steering targets, and internal monitors actually track the claimed concept under hard negatives, controlled confounds, held-out evaluation, and baseline comparisons.

Book a scoping call

The verification gap

Interpretability tools can surface candidate latents, features, probes, and explanations. The hard part is proving that the explanation survives serious attempts to falsify it.

A feature looks clean against random negatives, then collapses against confusable near-misses.
A claimed meaning survives easy examples but tracks token position, source format, domain, or another shortcut.
A probe, monitor, or steering target works on the discovery set, but fails after held-out distribution shifts.
A score looks plausible even though directionality, threshold tuning, or selection bias makes the conclusion unreliable.

What we verify

Bring a specific claim: "this feature means X," "this probe detects Y," or "this steering target controls Z." We test whether that claim survives the strongest confounds we can build within scope.

SAE and dictionary features

Feature-to-concept claims from sparse autoencoders, dictionaries, and related interpretability pipelines.

Probes and internal monitors

Claims that a classifier, probe, or monitor tracks a specific internal state rather than an easier proxy.

Steering and model-edit targets

Claims that a steering vector, feature intervention, or model edit controls the intended behavior without unwanted confounds.

Research and release claims

Interpretability results your team wants to publish, rely on in a safety case, or use as evidence before a high-stakes launch.

The protocol

We apply CFSE discipline to model-internals claims: explicit claims, falsification pressure, bounded scope, and auditable evidence.

State the claim

Turn an interpretation into a bounded claim: what the feature is supposed to track, where the claim applies, and what would count as failure.

Build falsification pressure

Design positives, matched negatives, confusable near-miss families, format shifts, and control conditions around the claim.

Lock the evaluation

Keep discovery, threshold tuning, and prompt iteration separate from held-out verification. Fix search budgets and baselines before scoring.

Compare against simpler explanations

Test whether a baseline or proxy explains the result as well as the claimed feature: domain, format, length, position, label leakage, or memorized artifacts.

Issue a bounded verdict

Deliver a scoped verdict with per-family results, failure modes, and the evidence needed for your team to rerun or challenge the conclusion.

What this is not

The scope is intentionally narrow. That is what makes the evidence useful.

Not a generic chatbot safety scan.
Not a compliance certification or audit signoff.
Not a runtime guardrail or prompt-injection firewall.
Not a claim that a feature universally means a concept across every model, domain, or deployment setting.

What you receive

Evidence you can inspect, challenge, and rerun.

Claim Register

A scoped list of the feature, probe, steering, or monitor claims under review, including the proposed meaning and validity boundary.

Hard-Negative Test Suites

Positives, matched negatives, near-miss families, format shifts, and controls for each claim.

Verification Report

Per-claim verdicts with family-level metrics, baseline comparisons, direction checks, and confidence notes.

Failure Analysis

For refuted or weak claims: the confound that broke the interpretation and what that implies for research, monitoring, or steering use.

Evidence Pack

Inputs, outputs, scoring code, run logs, and result tables your team can inspect, rerun, or extend.

Engagement modes

Start with the smallest claim set that would change a decision.

1-2 weeks

Spot Check

Verify 3-5 high-value interpretability claims before relying on them in a report, safety review, or product decision.

3-4 weeks

Verification Sprint

Evaluate a feature set or monitor family with custom test suites, baseline comparisons, and a complete evidence bundle.

Ongoing

Research Retainer

Continuous verification as your team discovers new features, updates models, or turns internal representations into monitors or controls.

Custom

Protocol Design

Design internal verification workflows for interpretability research, monitoring, and release evidence.

FAQ

Standard evals and red-team exercises test external model behavior. This service tests interpretability claims: whether an internal feature, probe, monitor, or steering target really tracks the meaning your team assigns to it.

The protocol is tool-agnostic. We can verify claims from SAE-based workflows, probing approaches, internal monitors, steering vectors, and other methods that produce a concrete feature-to-claim mapping.

It means the claim survived the agreed verification suite. It does not mean the feature universally represents the concept under every model, domain, or distribution. We state the boundary explicitly.

That is useful evidence. The report identifies which near-miss or control family broke the claim and whether the feature appears to track a shortcut such as format, domain, position, or label leakage.

Not always. We need access to the feature extraction path: activations, SAE or probe outputs, monitor scores, or a hosted environment where the required signals can be exposed. API-only access may be enough for some behavior-level claims, but not for all internals claims.

Bring the claims your team wants to rely on. We will help scope which claims can be verified, what access is required, and what evidence would be decision-grade.

Book a scoping call