<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://gaabrielfranco.github.io/feed.xml" rel="self" type="application/atom+xml"/><link href="https://gaabrielfranco.github.io/" rel="alternate" type="text/html" hreflang="en"/><updated>2026-07-05T00:19:32+00:00</updated><id>https://gaabrielfranco.github.io/feed.xml</id><title type="html">Gabriel Franco</title><subtitle>Gabriel Franco is a Computer Science Ph.D. candidate at Boston University, advised by Mark Crovella, working on the mechanistic interpretability of large language models. </subtitle><entry><title type="html">Handling Bias and RoPE in QK Attention with a Unified Geometric View</title><link href="https://gaabrielfranco.github.io/blog/2026/bias-and-rope-in-attention/" rel="alternate" type="text/html" title="Handling Bias and RoPE in QK Attention with a Unified Geometric View"/><published>2026-02-23T00:00:00+00:00</published><updated>2026-02-23T00:00:00+00:00</updated><id>https://gaabrielfranco.github.io/blog/2026/bias-and-rope-in-attention</id><content type="html" xml:base="https://gaabrielfranco.github.io/blog/2026/bias-and-rope-in-attention/"><![CDATA[<div class="callout"> <strong>Pre-print</strong> <a href="https://arxiv.org/abs/2602.13483">Finding Highly Interpretable Prompt-Specific Circuits in Language Models</a> </div> <p>This note summarizes Appendix B of the paper above and gives a geometric intuition for handling QK bias and RoPE using a single bilinear form.</p> <h2 id="1-qk-attention-as-a-bilinear-map">1. QK attention as a bilinear map</h2> <p>For one attention head, the pre-Softmax score between destination token $d$ and source token $s$ is defined as</p> \[A'_{ds} = x_d^\top \Omega x_s, \qquad \Omega = W_Q W_K^\top.\] <p>So QK analysis is fundamentally a bilinear map: destination-side vectors are matched against source-side vectors through $\Omega$. Using SVD,</p> \[\Omega = \sum_{k=1}^{R} u_k \sigma_k v_k^\top,\] <p>gives paired directions $(u_k, v_k)$ that define candidate communication channels.</p> <h2 id="2-related-work-on-svd-of-qk">2. Related work on SVD of QK</h2> <p>A growing line of work studies QK structure in singular-vector coordinates and shows that this basis captures interpretable communication structure.</p> <p>Pan et al. analyze query-key interaction in vision transformers through spectral structure, showing that singular directions provide a useful lens for understanding attention behavior <d-cite key="NEURIPS2024_6216515a"></d-cite>. Merullo et al. study inter-layer communication in language transformers and similarly motivate analyzing how structured directions are transmitted and read across layers <d-cite key="merullo2024talkingheadsunderstandinginterlayer"></d-cite>. Franco and Crovella introduce sparse attention decomposition, explicitly using QK singular vectors to isolate low-dimensional attention-relevant signal components for circuit tracing <d-cite key="franco2024sparseattentiondecompositionapplied"></d-cite>.</p> <p>This post builds on that perspective and focuses on a practical extension: keeping one fixed bilinear core even when attention includes bias and/or RoPE terms.</p> <h2 id="3-why-bias-and-position-terms-make-qk-analysis-harder">3. Why bias and position terms make QK analysis harder</h2> <p>In practice, many models do not use the plain form $x_d^\top \Omega x_s$:</p> <ul> <li>bias models add query/key bias terms,</li> <li>RoPE models apply position-dependent rotations,</li> <li>some models do both.</li> </ul> <p>If handled naively, this introduces either homogeneous-coordinate bookkeeping or position-specific $\Omega$ matrices, which complicates implementation and interpretation.</p> <p>The Appendix B goal is to keep a single fixed $\Omega$ per head and absorb bias/position effects into transformed token vectors.</p> <h2 id="4-bias-translate-first-then-project">4. Bias: translate first, then project</h2> <p>For bias models,</p> \[A'_{ds} = (x_d^\top W_Q + b_Q^\top)(x_s^\top W_K + b_K)^\top.\] <p>Under the (empirically supported) well-conditioning assumptions on $W_Q$ and $W_K$, we can define bias-derived offsets $c_d, c_s$ through pseudoinverse mappings (e.g., $c_d^\top = b_Q^\top W_Q^{\dagger}$, similarly for $c_s$) and rewrite:</p> \[A'_{ds} = (x_d^\top + c_d^\top)\,\Omega\,(x_s + c_s).\] <p>As a concrete toy illustration, the figure uses a map $P:\mathbb{R}^3\to\mathbb{R}^2$ as an analogue of the Q/K projection. Read the two rows as two paths:</p> <ul> <li>Top row (light red): $y_A=Px+\alpha b$ (project, then translate in low dimension).</li> <li>Bottom row (light blue): $y_B=P(x+\alpha c)$ with $c=P^\dagger b$ (translate in high dimension, then project).</li> </ul> <p>The right panel compares the endpoints, showing the same commutation pattern used in the attention derivation.</p> <div class="l-page"> <iframe src="/assets/plotly/bias-geometric-pipeline.html" frameborder="0" scrolling="no" height="660px" width="100%" style="border: 1px solid #ddd; background: #fff;"></iframe> </div> <h2 id="5-rope-rotate-first-then-project">5. RoPE: rotate first, then project</h2> <p>For RoPE models, scores involve position rotations $R_d, R_s$, often written as a position-dependent effective matrix. Appendix B shows we can instead define token-space linear maps $M_d, M_s$ and keep a fixed $\Omega = W_Q W_K^\top$:</p> \[A'_{ds} = (x_d^\top M_d)\,\Omega\,(M_s x_s).\] <p>Using the Appendix B.2 construction:</p> \[M_d = W_Q R_d W_Q^{\dagger}, \qquad M_s = (W_K^\top)^{\dagger} R_s^\top W_K^\top,\] <p>which gives</p> \[(x_d^\top M_d)\,\Omega\,(M_s x_s) = x_d^\top W_Q R_d R_s^\top W_K^\top x_s = x_d^\top W_Q R_{(d-s)} W_K^\top x_s.\] <p>As a concrete toy illustration, the figure again uses a $\mathbb{R}^3\to\mathbb{R}^2$ setup. Read the two rows as two paths:</p> <ul> <li>Top row (light red): $y_A=PR_3x$ (rotate in $\mathbb{R}^3$, then project).</li> <li>Bottom row (light blue): $y_B=R_2Px$ (project, then rotate in $\mathbb{R}^2$).</li> </ul> <p>Here $R_3$ and $R_2$ use the same angle parameter $\theta$ (controlled by the slider). The right panel compares the endpoints. This mirrors Appendix B: position effects are absorbed into transformed vectors while preserving a fixed core bilinear map.</p> <div class="l-page"> <iframe src="/assets/plotly/rope-geometric-pipeline.html" frameborder="0" scrolling="no" height="660px" width="100%" style="border: 1px solid #ddd; background: #fff;"></iframe> </div> <h2 id="6-when-the-model-has-both-rope-and-bias">6. When the model has both RoPE and Bias</h2> <p>These two derivations combine directly, so bias+RoPE models (e.g., Pythia) can still be written with transformed token vectors and one fixed $\Omega = W_Q W_K^\top$.</p> <h2 id="7-appendix-condition-numbers-in-practice">7. Appendix: condition numbers in practice</h2> <p>The derivations rely on $W_Q$ and $W_K$ being well-conditioned (full column rank, stable pseudoinverse behavior). Appendix B.4 reports this holds for the studied models (condition numbers below 1000, usually much smaller), including GPT-2 small, Pythia-160M, and Gemma-2 2B.</p> <h3 id="gpt-2-small">GPT-2 small</h3> <div class="cond-grid"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/posts/qk-bias-rope/gpt2-small_W_Q_condition_numbers-480.webp 480w,/assets/img/posts/qk-bias-rope/gpt2-small_W_Q_condition_numbers-800.webp 800w,/assets/img/posts/qk-bias-rope/gpt2-small_W_Q_condition_numbers-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/posts/qk-bias-rope/gpt2-small_W_Q_condition_numbers.png" class="img-fluid rounded z-depth-1 qk-img-white" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> <figcaption class="caption">Condition numbers of $W_Q$ across layers and heads.</figcaption> </figure> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/posts/qk-bias-rope/gpt2-small_W_K_condition_numbers-480.webp 480w,/assets/img/posts/qk-bias-rope/gpt2-small_W_K_condition_numbers-800.webp 800w,/assets/img/posts/qk-bias-rope/gpt2-small_W_K_condition_numbers-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/posts/qk-bias-rope/gpt2-small_W_K_condition_numbers.png" class="img-fluid rounded z-depth-1 qk-img-white" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> <figcaption class="caption">Condition numbers of $W_K$ across layers and heads.</figcaption> </figure> </div> <h3 id="pythia-160m">Pythia-160M</h3> <div class="cond-grid"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/posts/qk-bias-rope/pythia-160m_W_Q_condition_numbers-480.webp 480w,/assets/img/posts/qk-bias-rope/pythia-160m_W_Q_condition_numbers-800.webp 800w,/assets/img/posts/qk-bias-rope/pythia-160m_W_Q_condition_numbers-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/posts/qk-bias-rope/pythia-160m_W_Q_condition_numbers.png" class="img-fluid rounded z-depth-1 qk-img-white" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> <figcaption class="caption">Condition numbers of $W_Q$ across layers and heads.</figcaption> </figure> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/posts/qk-bias-rope/pythia-160m_W_K_condition_numbers-480.webp 480w,/assets/img/posts/qk-bias-rope/pythia-160m_W_K_condition_numbers-800.webp 800w,/assets/img/posts/qk-bias-rope/pythia-160m_W_K_condition_numbers-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/posts/qk-bias-rope/pythia-160m_W_K_condition_numbers.png" class="img-fluid rounded z-depth-1 qk-img-white" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> <figcaption class="caption">Condition numbers of $W_K$ across layers and heads.</figcaption> </figure> </div> <h3 id="gemma-2-2b">Gemma-2 2B</h3> <div class="cond-grid"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/posts/qk-bias-rope/gemma-2-2b_W_Q_condition_numbers-480.webp 480w,/assets/img/posts/qk-bias-rope/gemma-2-2b_W_Q_condition_numbers-800.webp 800w,/assets/img/posts/qk-bias-rope/gemma-2-2b_W_Q_condition_numbers-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/posts/qk-bias-rope/gemma-2-2b_W_Q_condition_numbers.png" class="img-fluid rounded z-depth-1 qk-img-white" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> <figcaption class="caption">Condition numbers of $W_Q$ across layers and heads.</figcaption> </figure> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/posts/qk-bias-rope/gemma-2-2b_W_K_condition_numbers-480.webp 480w,/assets/img/posts/qk-bias-rope/gemma-2-2b_W_K_condition_numbers-800.webp 800w,/assets/img/posts/qk-bias-rope/gemma-2-2b_W_K_condition_numbers-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/posts/qk-bias-rope/gemma-2-2b_W_K_condition_numbers.png" class="img-fluid rounded z-depth-1 qk-img-white" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> <figcaption class="caption">Condition numbers of $W_K$ across layers and heads.</figcaption> </figure> </div> <h3 id="practical-takeaway">Practical takeaway</h3> <p>With this reformulation, ACC++ can treat no-bias, bias-only, RoPE-only, and bias+RoPE models under one unified bilinear interface. In code terms: one $\Omega$, one SVD per head, same downstream signal extraction/tracing logic.</p>]]></content><author><name>Gabriel Franco</name></author><category term="mechanistic interpretability"/><category term="transformers"/><category term="attention"/><category term="linear algebra"/><summary type="html"><![CDATA[A practical geometric view for analyzing QK circuits with bias and RoPE using a single bilinear form]]></summary></entry></feed>