Consulting / AI Model Verification
Interpretability Claim Verification
Verify interpretability claims before you rely on them.
We test whether candidate features, probes, steering targets, and internal monitors actually track the claimed concept under hard negatives, controlled confounds, held-out evaluation, and baseline comparisons.
The verification gap
Interpretability tools can surface candidate latents, features, probes, and explanations. The hard part is proving that the explanation survives serious attempts to falsify it.
- A feature looks clean against random negatives, then collapses against confusable near-misses.
- A claimed meaning survives easy examples but tracks token position, source format, domain, or another shortcut.
- A probe, monitor, or steering target works on the discovery set, but fails after held-out distribution shifts.
- A score looks plausible even though directionality, threshold tuning, or selection bias makes the conclusion unreliable.
What we verify
Bring a specific claim: "this feature means X," "this probe detects Y," or "this steering target controls Z." We test whether that claim survives the strongest confounds we can build within scope.
SAE and dictionary features
Feature-to-concept claims from sparse autoencoders, dictionaries, and related interpretability pipelines.
Probes and internal monitors
Claims that a classifier, probe, or monitor tracks a specific internal state rather than an easier proxy.
Steering and model-edit targets
Claims that a steering vector, feature intervention, or model edit controls the intended behavior without unwanted confounds.
Research and release claims
Interpretability results your team wants to publish, rely on in a safety case, or use as evidence before a high-stakes launch.
The protocol
We apply CFSE discipline to model-internals claims: explicit claims, falsification pressure, bounded scope, and auditable evidence.
01
State the claim
Turn an interpretation into a bounded claim: what the feature is supposed to track, where the claim applies, and what would count as failure.
02
Build falsification pressure
Design positives, matched negatives, confusable near-miss families, format shifts, and control conditions around the claim.
03
Lock the evaluation
Keep discovery, threshold tuning, and prompt iteration separate from held-out verification. Fix search budgets and baselines before scoring.
04
Compare against simpler explanations
Test whether a baseline or proxy explains the result as well as the claimed feature: domain, format, length, position, label leakage, or memorized artifacts.
05
Issue a bounded verdict
Deliver a scoped verdict with per-family results, failure modes, and the evidence needed for your team to rerun or challenge the conclusion.
What this is not
The scope is intentionally narrow. That is what makes the evidence useful.
- Not a generic chatbot safety scan.
- Not a compliance certification or audit signoff.
- Not a runtime guardrail or prompt-injection firewall.
- Not a claim that a feature universally means a concept across every model, domain, or deployment setting.
What you receive
Evidence you can inspect, challenge, and rerun.
Claim Register
A scoped list of the feature, probe, steering, or monitor claims under review, including the proposed meaning and validity boundary.
Hard-Negative Test Suites
Positives, matched negatives, near-miss families, format shifts, and controls for each claim.
Verification Report
Per-claim verdicts with family-level metrics, baseline comparisons, direction checks, and confidence notes.
Failure Analysis
For refuted or weak claims: the confound that broke the interpretation and what that implies for research, monitoring, or steering use.
Evidence Pack
Inputs, outputs, scoring code, run logs, and result tables your team can inspect, rerun, or extend.
Engagement modes
Start with the smallest claim set that would change a decision.
1-2 weeks
Spot Check
Verify 3-5 high-value interpretability claims before relying on them in a report, safety review, or product decision.
3-4 weeks
Verification Sprint
Evaluate a feature set or monitor family with custom test suites, baseline comparisons, and a complete evidence bundle.
Ongoing
Research Retainer
Continuous verification as your team discovers new features, updates models, or turns internal representations into monitors or controls.
Custom
Protocol Design
Design internal verification workflows for interpretability research, monitoring, and release evidence.
FAQ
Bring the claims your team wants to rely on. We will help scope which claims can be verified, what access is required, and what evidence would be decision-grade.
Book a scoping call