When and why the singular vectors of attention matrices align with the features a model uses.
Several studies have noticed that you can often read a model’s features off the singular vectors of its attention matrices (including my own work), but it was not clear why this happens. In this ICML 2026 paper, we give an answer (Franco et al., 2026). We first show that singular vectors reliably align with features in a setting where the features can be observed directly, and then prove that this alignment is expected under a range of conditions. We also identify sparse attention decomposition as a testable signature of the alignment and find it in real models.
In a controlled setting, the singular vectors of an attention head come to align with the model's features over the course of training.
Identifying feature representations in language models is a central task in mechanistic interpretability. Several recent studies have made the observation that feature representations can be inferred in some cases from singular vectors of attention matrices. However, sound justification for this phenomenon is lacking. In this paper we address that question, asking: why and when do singular vectors align with features? First, we demonstrate that singular vectors robustly align with features in a model where features can be directly observed. We then show theoretically that such alignment is expected under a range of conditions. We close by asking how, operationally, alignment may be recognized in real models where feature representations are not directly observable.
@article{franco2026singular,title={Singular Vectors of Attention Heads Align with Features},author={Franco, Gabriel and Loughridge, Carson and Crovella, Mark},journal={Proceedings of the 43rd International Conference on Machine Learning (ICML)},year={2026},}