Findings
Seven findings from ~100 ARM runs (105 JSON traces in the full repo, 89 in Appendix D), spanning v0.1 through v0.8 and three providers. Each finding is stated with its honest scope — the caveats are not disclaimers, they are the calibration.
ARM's epistemic ceiling: it can show the shape of cross-vendor disagreement but cannot escape monoculture from shared training data. The value is accountability legibility, not safety certification.
Epistemic tightening dominates
Across 304 R2 agent-rounds (v0.1–v0.8 corpus), 64.8% tightened (Δ < 0) and 76.3% held or tightened (Δ ≤ 0). Deliberation usually calibrates rather than inflates confidence. This is the baseline: exposure to peer reasoning tends to make agents more cautious, not more certain.
Read as within-agent telemetry, not cross-model probability. Each agent's confidence is self-reported against its own scale — the numbers are not directly comparable across providers.
Model-level epistemic fingerprinting
On a CFAA zero-day question, three same-model meshes reached confident consensus in opposite directions: all-Claude unanimously NO, all-Gemini unanimously YES, all-GPT split internally. A system built on one provider would refuse; an isomorphic system on another would act — both reporting high-confidence consensus, neither signaling that the answer was provider-dependent.
Preliminary pilot, not an established result. Three confounds, stated up front: (1) provider is confounded with capability tier — frontier Claude vs. smaller Gemini/GPT models; (2) n=1 per provider-question cell against a 0.15–0.20 variance floor; (3) the disagreement label comes from a single Gamma instance with no inter-rater reliability. The directional opposition — a property of the R1 claims themselves — is the most robust part, but needs same-tier, multi-run replication.
The Alignment Monoculture is real but bounded
On the Meta-layoffs question, all three providers converged — so provider-dependent consensus is not a universal law. The monoculture effect appears strongest on questions where there is a dominant "safe" answer baked into a provider's post-training distribution. When the question is genuinely ambiguous across training corpora, providers can converge to the same position for different reasons.
One question is not a controlled study. The convergence itself may reflect a genuinely correct answer rather than monoculture. Disentangling shared truth from shared prior is the central challenge ARM is designed to flag, not resolve.
Convergence ≠ agreement
Proven three ways in the run corpus: (1) convergence 0.917 with a genuine YES/NO split still present — high overlap in vocabulary, opposite conclusions; (2) convergence 0.000 classified as "none" — two traces with near-zero shared language and no substantive disagreement; (3) convergence 0.511 with a surviving YES/NO split. Jaccard similarity measures shared vocabulary, not consensus. The directional unanimity flag was added specifically to catch the third case.
These are illustrative examples from the run corpus, not a systematic audit of all cases. The directional unanimity flag was validated informally, not against a pre-registered test set.
The Gamma-flip failure mode
The reconciler (γ) reversed its own R1 position between rounds while reporting reconciliation_status: "success" — 10+ confirmed cases across 4 domains and all 3 providers. This is a silent failure: the system logs consensus but the reconciler has actually changed sides. v0.8's polarity-check gate now catches position reversals before they are logged as successful reconciliation.
10+ cases is a small corpus. The polarity gate was designed around these cases — it has not been validated on a held-out set or against a systematic rate in natural use. The gate catches reversals it can detect; it cannot catch subtle drift that falls below the polarity threshold.
Gates are decoupled (v0.8)
The polarity-check gate (position reversals) and FAP (magnitude drift) fire on non-overlapping runs across an 11-trace battery. This is the "smoke detector → sprinkler" upgrade: from logging failures to interrupting at the gate. The polarity gate catches gamma-flips; FAP catches high-magnitude drift in cases that do not cross the polarity threshold. The two gates address distinct failure modes.
Prototype, not validation. n=11 constructed test cases — chosen to exercise both gates, not drawn randomly from natural use. The decoupling result is a property of the constructed battery, not a claim about how often each gate fires in production.
Role injection is a confound, not just a feature
The OT-001 vs. OT-002 pair is the clean experiment: with role injection ON, Gemini's natural YES was forced to NO at 0.95 confidence. With injection OFF on the identical question (OT-002), Gemini returned to YES and held. Role-injected runs may reflect frame compliance — the agent adopting the injected ethical frame — rather than independent agreement. This does not invalidate role injection as a design choice; it means injected and non-injected runs are not directly comparable.
One provider, one domain, one pair of runs. The effect may vary across providers and question types. A controlled study varying injection frame while holding question constant across multiple providers and runs would be needed to establish a reliable effect size.
Summary
| # | Finding | Status |
|---|---|---|
| 1 | Epistemic tightening dominates | Quantitative |
| 2 | Model-level epistemic fingerprinting | Headline ★ |
| 3 | The Alignment Monoculture is real but bounded | Scope boundary |
| 4 | Convergence ≠ agreement | Methodological |
| 5 | The Gamma-flip failure mode | Failure mode |
| 6 | Gates are decoupled (v0.8) | Prototype finding |
| 7 | Role injection is a confound, not just a feature | Confound identified |
What would strengthen these findings
- →Same-tier replication of the fingerprinting finding (frontier Claude × frontier Gemini × frontier GPT, matched on capability).
- →Multi-run per cell (at least 5–10 per provider-question pair) to establish a variance floor.
- →Inter-rater reliability study on the Gamma disagreement classifier (13-trace batch, pending API budget).
- →Gate validation on a held-out test set, not just the constructed 11-trace battery.