arxiv:2602.22600

Transformers converge to invariant algorithmic cores

Published on Feb 26

· Submitted by

Josh Schiffman on Mar 4

Upvote

Authors:

Joshua S. Schiffman

Abstract

Research reveals that independently trained transformers converge to shared algorithmic cores despite different weight configurations, indicating low-dimensional invariants in transformer computations.

AI-generated summary

Large language models exhibit sophisticated capabilities, yet understanding how they work internally remains a central challenge. A fundamental obstacle is that training selects for behavior, not circuitry, so many weight configurations can implement the same function. Which internal structures reflect the computation, and which are accidents of a particular training run? This work extracts algorithmic cores: compact subspaces necessary and sufficient for task performance. Independently trained transformers learn different weights but converge to the same cores. Markov-chain transformers embed 3D cores in nearly orthogonal subspaces yet recover identical transition spectra. Modular-addition transformers discover compact cyclic operators at grokking that later inflate, yielding a predictive model of the memorization-to-generalization transition. GPT-2 language models govern subject-verb agreement through a single axis that, when flipped, inverts grammatical number throughout generation across scales. These results reveal low-dimensional invariants that persist across training runs and scales, suggesting that transformer computations are organized around compact, shared algorithmic structures. Mechanistic interpretability could benefit from targeting such invariants -- the computational essence -- rather than implementation-specific details.

View arXiv page View PDF Add to collection

Community

joshseth

Paper author Paper submitter about 3 hours ago

Training selects for behavior, not circuitry – so which internal structures reflect the computation, and which are accidents of a particular training run? Independently trained transformers – despite having very different weights – converge to shared low-dimensional algorithmic core subspaces that capture essential computations.

Three independently trained Markov-chain transformers (cosine similarity ~0.03) share a 3D core that recovers ground-truth mechanism: transition spectra to within 1%.
On modular addition, cores crystallize at grokking and automatically reveal rotational operators, yielding a predictive scaling law for grokking time: τ ∝ 1/(ωp), validated at R² > 0.99; grokking speed scales inversely with task symmetry (p) and weight decay (ω).
Subject–verb agreement in GPT-2 Small/Medium/Large reduces to a single shared axis across scales -- flipping it inverts grammatical number throughout autoregressive generation.

Cyclic mechanism emerges in core at grokking

Core steering inverts grammatical number in GPT-2

Broader take: Mechanistic interpretability should target invariants – structure preserved across models – rather than implementation-specific details of any single model or checkpoint.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.22600 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.22600 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.22600 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.