Empirical results

Findings

Seven findings from ~100 ARM runs (105 JSON traces in the full repo, 89 in Appendix D), spanning v0.1 through v0.8 and three providers. Each finding is stated with its honest scope — the caveats are not disclaimers, they are the calibration.

ARM's epistemic ceiling: it can show the shape of cross-vendor disagreement but cannot escape monoculture from shared training data. The value is accountability legibility, not safety certification.

Epistemic tightening dominates

Quantitative

Across 304 R2 agent-rounds (v0.1–v0.8 corpus), 64.8% tightened (Δ < 0) and 76.3% held or tightened (Δ ≤ 0). Deliberation usually calibrates rather than inflates confidence. This is the baseline: exposure to peer reasoning tends to make agents more cautious, not more certain.

⚠ Scope / caveat

Read as within-agent telemetry, not cross-model probability. Each agent's confidence is self-reported against its own scale — the numbers are not directly comparable across providers.

Model-level epistemic fingerprinting

Headline ★

On a CFAA zero-day question, three same-model meshes reached confident consensus in opposite directions: all-Claude unanimously NO, all-Gemini unanimously YES, all-GPT split internally. A system built on one provider would refuse; an isomorphic system on another would act — both reporting high-confidence consensus, neither signaling that the answer was provider-dependent.

⚠ Scope / caveat

Preliminary pilot, not an established result. Three confounds, stated up front: (1) provider is confounded with capability tier — frontier Claude vs. smaller Gemini/GPT models; (2) n=1 per provider-question cell against a 0.15–0.20 variance floor; (3) the disagreement label comes from a single Gamma instance with no inter-rater reliability. The directional opposition — a property of the R1 claims themselves — is the most robust part, but needs same-tier, multi-run replication.

The Alignment Monoculture is real but bounded

Scope boundary

On the Meta-layoffs question, all three providers converged — so provider-dependent consensus is not a universal law. The monoculture effect appears strongest on questions where there is a dominant "safe" answer baked into a provider's post-training distribution. When the question is genuinely ambiguous across training corpora, providers can converge to the same position for different reasons.

⚠ Scope / caveat

One question is not a controlled study. The convergence itself may reflect a genuinely correct answer rather than monoculture. Disentangling shared truth from shared prior is the central challenge ARM is designed to flag, not resolve.

Convergence ≠ agreement

Methodological

Proven three ways in the run corpus: (1) convergence 0.917 with a genuine YES/NO split still present — high overlap in vocabulary, opposite conclusions; (2) convergence 0.000 classified as "none" — two traces with near-zero shared language and no substantive disagreement; (3) convergence 0.511 with a surviving YES/NO split. Jaccard similarity measures shared vocabulary, not consensus. The directional unanimity flag was added specifically to catch the third case.

⚠ Scope / caveat

These are illustrative examples from the run corpus, not a systematic audit of all cases. The directional unanimity flag was validated informally, not against a pre-registered test set.

The Gamma-flip failure mode

Failure mode

The reconciler (γ) reversed its own R1 position between rounds while reporting reconciliation_status: "success" — 10+ confirmed cases across 4 domains and all 3 providers. This is a silent failure: the system logs consensus but the reconciler has actually changed sides. v0.8's polarity-check gate now catches position reversals before they are logged as successful reconciliation.

⚠ Scope / caveat

10+ cases is a small corpus. The polarity gate was designed around these cases — it has not been validated on a held-out set or against a systematic rate in natural use. The gate catches reversals it can detect; it cannot catch subtle drift that falls below the polarity threshold.

Gates are decoupled (v0.8)

Prototype finding

The polarity-check gate (position reversals) and FAP (magnitude drift) fire on non-overlapping runs across an 11-trace battery. This is the "smoke detector → sprinkler" upgrade: from logging failures to interrupting at the gate. The polarity gate catches gamma-flips; FAP catches high-magnitude drift in cases that do not cross the polarity threshold. The two gates address distinct failure modes.

⚠ Scope / caveat

Prototype, not validation. n=11 constructed test cases — chosen to exercise both gates, not drawn randomly from natural use. The decoupling result is a property of the constructed battery, not a claim about how often each gate fires in production.

Role injection is a confound, not just a feature

Confound identified

The OT-001 vs. OT-002 pair is the clean experiment: with role injection ON, Gemini's natural YES was forced to NO at 0.95 confidence. With injection OFF on the identical question (OT-002), Gemini returned to YES and held. Role-injected runs may reflect frame compliance — the agent adopting the injected ethical frame — rather than independent agreement. This does not invalidate role injection as a design choice; it means injected and non-injected runs are not directly comparable.

⚠ Scope / caveat

One provider, one domain, one pair of runs. The effect may vary across providers and question types. A controlled study varying injection frame while holding question constant across multiple providers and runs would be needed to establish a reliable effect size.

Summary

#	Finding	Status
1	Epistemic tightening dominates	Quantitative
2	Model-level epistemic fingerprinting	Headline ★
3	The Alignment Monoculture is real but bounded	Scope boundary
4	Convergence ≠ agreement	Methodological
5	The Gamma-flip failure mode	Failure mode
6	Gates are decoupled (v0.8)	Prototype finding
7	Role injection is a confound, not just a feature	Confound identified

What would strengthen these findings

→Same-tier replication of the fingerprinting finding (frontier Claude × frontier Gemini × frontier GPT, matched on capability).
→Multi-run per cell (at least 5–10 per provider-question pair) to establish a variance floor.
→Inter-rater reliability study on the Gamma disagreement classifier (13-trace batch, pending API budget).
→Gate validation on a held-out test set, not just the constructed 11-trace battery.

← How ARM works About the project