TFG · 2026
Universitat Politècnica
de Catalunya · FIBThesis · 2026
Universitat Politècnica
de Catalunya · FIB

Vol. 01
Compresión selectiva
Interpretabilidad mecánicaVol. 01
Selective compression
Mechanistic interpretability

Trabajo de fin de grado · 2026

Bachelor's thesis · 2026

Anatomía
emocional
de un transformer Anatomy
of an
emotional transformer Compresión selectiva e interpretabilidad mecánica de BERT-base sobre GoEmotions. Un manual visual para desarmar el modelo capa por capa. Selective compression and mechanistic interpretability of BERT-base on GoEmotions. A visual manual that takes the model apart layer by layer.

12 × 144 capas × cabezaslayers × heads

36 864 neuronas FFNFFN neurons

23 emociones, multi-labelemotions, multi-label

25 paneles + 4 diagramaspanels + 4 diagrams

Scroll · Empezar el recorridoScroll · Begin the tour

Guido Biosca Lasa · Director: Lluís Padró CireraAdvisor: Lluís Padró Cirera

Parte 01

Part 01

El sujeto y las preguntas

The subject and the questions

BERT-base. 12 capas. 144 cabezas de atención. 36 864 neuronas en los bloques FFN. 109,5 millones de parámetros. Encima, un fine-tune sobre GoEmotions: 23 emociones, etiquetado multi-label. F1 macro de 0,577 sobre el conjunto de test — el baseline contra el que se mide todo.

BERT-base. 12 layers. 144 attention heads. 36,864 neurons in the FFN blocks. 109.5 million parameters. On top of it, a fine-tune on GoEmotions: 23 emotions, multi-label tagging. F1 macro of 0.577 on the test set — the baseline everything is measured against.

Hay tres preguntas en juego. La primera es descriptiva: qué le pasa al modelo cuando le aplicas SVD a las matrices que lo componen y cuánto puedes quitar antes de romperlo. La segunda es explicativa: qué dice la interpretabilidad mecánica sobre dónde y cómo vive la información emocional dentro de la red. La tercera es prescriptiva: si juntas las dos respuestas, ¿se puede comprimir mejor que tirando rangos al tuntún?

There are three questions at play. The first is descriptive: what happens when you apply SVD to the matrices that make up the model, and how much you can take away before it breaks. The second is explanatory: what mechanistic interpretability says about where and how emotional information lives inside the network. The third is prescriptive: if you combine the two answers, can you compress better than by picking ranks at random?

La compresión funciona aquí como un bisturí experimental. Cada configuración comprimida es una intervención que mide la importancia funcional del componente afectado por la magnitud de lo que se rompe. Quitar partes del modelo y mirar qué se sobrevive es una manera empírica de hacer el mapa de cómo está organizado por dentro.

Compression here works as an experimental scalpel. Every compressed configuration is an intervention that measures the functional importance of the affected component by the magnitude of what breaks. Removing parts of the model and watching what survives is an empirical way to map how the information is laid out inside it.

Antes de operar nada, primero hay que conocer al sujeto.

Before cutting anything open, you have to meet the subject.

§ 01.1 · Arquitectura

§ 01.1 · Architecture

BERT entero, en 3D

All of BERT, in 3D

12 capas apiladas, cada una con 12 cabezas de atención y un bloque FFN.

12 stacked layers, each with 12 attention heads and an FFN block.

Cada esfera es una cabeza de atención. Cada anillo turquesa es un bloque FFN. Por la columna central viaja el residual stream con el [CLS] desde abajo hasta el classifier de arriba.

Each sphere is an attention head. Each turquoise ring is an FFN block. The dotted central column is the residual stream that carries [CLS] from the bottom to the classifier on top.

El color codifica la categoría funcional según el ablation study (notebook 6). Roja: critical specialist. Azul: critical generalist. Amarilla: minor. Gris: prescindible. El tamaño es el impacto agregado al ablacionar.

Color encodes the functional category from the ablation study (notebook 6). Red: critical specialist. Blue: critical generalist. Yellow: minor. Grey: dispensable. Size encodes aggregate impact when the head is ablated.

Compara la capa 11 con la capa 0. Arriba casi todo es rojo y azul. Abajo casi todo es gris. El trabajo emocional no está distribuido uniformemente — vive concentrado al final.

Compare layer 11 with layer 0. Up top almost everything is red and blue. Down at the bottom it's mostly grey. Emotional work isn't spread evenly — it's concentrated at the end of the stack.

Esferas: cabezas. Anillos: bloques FFN. Columna punteada: residual stream. Diamante rojo: token [CLS]. Cuadrado verde: classifier de 23 emociones. Datos: head_categories.csv.

Spheres: heads. Rings: FFN blocks. Dotted column: residual stream. Red diamond: [CLS] token. Green square: 23-emotion classifier. Data: head_categories.csv.

Fig. 01.1

§ 01.2 · Pipeline

El experimento entero

The full experiment

Mapa interactivo: dónde está cada cosa.

Interactive map: where everything lives.

Antes de bajar a las figuras una a una, una vista de pájaro de qué se ha hecho. Tres bloques.

Before diving into the figures one by one, a bird's-eye view of what was done. Three blocks.

Setup: GoEmotions filtrado a 23 emociones, BERT-base, fine-tune AdamW (LR 2e-5, 4 epochs, batch 32) → checkpoint 23emo-final, F1 macro 0,577 sobre el test set. Es el baseline contra el que se mide TODO lo demás.

Setup: GoEmotions filtered to 23 emotions, BERT-base, AdamW fine-tune (LR 2e-5, 4 epochs, batch 32) → checkpoint 23emo-final, F1 macro 0.577 on the test set. It's the baseline EVERYTHING else is measured against.

Dos brazos paralelos: en uno, compresión SVD con cuatro familias de estrategias (21 estrategias evaluadas en total: 6 uniformes, 4 adaptativas por energía, 3 heurísticas, 8 greedy). En el otro, cinco técnicas de interpretabilidad mecánica con granularidad creciente: capa, componente, cabeza, neurona.

Two parallel arms: on one side, SVD compression with four strategy families (21 evaluated strategies in total: 6 uniform, 4 energy-adaptive, 3 heuristic, 8 greedy). On the other, five mechanistic interpretability techniques at increasing granularity: layer, component, head, neuron.

Síntesis: los dos brazos confluyen en una compresión informada por datos empíricos de sensibilidad — el algoritmo greedy. Y un ciclo final de fine-tuning que recupera (y supera) el baseline con menos parámetros.

Synthesis: the two arms converge into compression informed by empirical sensitivity data — the greedy algorithm. And one final fine-tuning loop that recovers (and surpasses) the baseline with fewer parameters.

Click en cualquier bloque del pipeline para saltar directamente a la figura correspondiente.

Click any block in the pipeline to jump directly to the corresponding figure.

Pipeline experimental interactivo. Equivalente a la Figura 1 del Capítulo 3 de la memoria, con enlaces a cada sección.

Interactive experimental pipeline. Equivalent to Figure 1 of Chapter 3 of the thesis, with links to each section.

Fig. 01.2

§ 01.3 · Base teórica

§ 01.3 · Theoretical base

De diccionario a clasificador

From dictionary to classifier

Cómo BERT olvida palabras y aprende emociones.

How BERT forgets words and learns emotions.

Una palabra entra en BERT y, al salir, ya no es esa palabra. El modelo sustituye "qué palabra es" por "qué papel juega" capa a capa. Es un fenómeno conocido (Tenney 2019, Ethayarajh 2019).

A word enters BERT and on the way out it isn't that word anymore. The model replaces "which word is this" with "what role does it play" layer by layer. The phenomenon is well documented (Tenney 2019, Ethayarajh 2019).

Tres curvas miden la transición. Azul: retención léxica, cuánto del embedding original sobrevive. Amarilla: anisotropía, cuán parecidos son los tokens entre sí. Roja: F1 del probe lineal de emoción, cuánta información de la etiqueta es linealmente extraíble.

Three curves track the transition. Blue: lexical retention — how much of the original embedding survives. Yellow: anisotropy — how similar the tokens look to each other. Red: linear-probe F1 for emotion — how much of the label is linearly recoverable.

Las tres se cruzan en L8–L9. Antes, fase léxica. Después, fase semántica. Los cuatro mini-heatmaps lo hacen visceral: filas tempranas con colores variados (cada token preserva identidad), filas tardías monocromas (todos los tokens han colapsado al mismo vector contextual).

The three curves cross around L8–L9. Before that, lexical phase. After that, semantic phase. The four mini-heatmaps make it tangible: early rows show varied colors (each token keeps its identity), late rows go monochrome (every token has collapsed onto the same contextual vector).

Tres curvas sobre 46 frases del test set. Heatmaps de cuatro frases ejemplo: cos(hidden[L,t], hidden[0,t]). Refs: Tenney et al. ACL 2019; Ethayarajh EMNLP 2019.

Three curves over 46 test-set sentences. Heatmaps of four example sentences: cos(hidden[L,t], hidden[0,t]). Refs: Tenney et al. ACL 2019; Ethayarajh EMNLP 2019.

Fig. 01.3

Parte 02

Part 02

La SVD como bisturí experimental

SVD as an experimental scalpel

La compresión por SVD se trata habitualmente como una técnica de reducción: quedarte con los k mayores valores singulares para ahorrar parámetros. Aquí se da la vuelta. La SVD se usa como instrumento de medida: la magnitud de lo que se rompe al eliminar un componente cuantifica su contribución funcional.

SVD compression is usually framed as a reduction technique: keep the top-k singular values to save parameters. Here it gets flipped. SVD becomes a measurement instrument: the magnitude of what breaks when you remove a component quantifies its functional contribution.

Esa inversión es la columna vertebral del trabajo. Cinco secciones, una respuesta: el modelo ya está comprimido por dentro; la compresión externa lo confirma; pero la sensibilidad no se reparte por igual.

That inversion is the spine of the work. Five sections, one answer: the model is already compressed on the inside; external compression confirms it; but the sensitivity isn't spread evenly.

§ 02.1 · Diagrama

§ 02.1 · Diagram

Comprimir es medir

Compressing is measuring

Cómo una técnica de reducción se convierte en un instrumento de medida.

How a reduction technique becomes a measurement instrument.

La SVD descompone una matriz W en tres factores y te permite quedarte sólo con los k mayores valores singulares — la mejor aproximación de rango k en norma de Frobenius (Eckart-Young, 1936). Si k es pequeño, ahorras parámetros. Hasta aquí, libro de texto.

SVD decomposes a matrix W into three factors, letting you keep only the top-k singular values — the optimal rank-k approximation in Frobenius norm (Eckart-Young, 1936). Small k, fewer parameters. Textbook so far.

Lo no obvio es lo siguiente. Si comprimes una matriz específica del modelo y la red entera se rompe, esa matriz importaba. Si la comprimes y nada cambia, no importaba. El delta de F1 mide la contribución funcional del componente, y la SVD se convierte en un bisturí experimental: cada configuración comprimida es una intervención controlada que lee el contenido del modelo por la magnitud de lo que se destruye al quitarlo.

What's not obvious: if you compress one specific matrix and the whole network breaks, that matrix mattered. If you compress it and nothing changes, it didn't. The F1 delta measures the component's functional contribution, and SVD becomes an experimental scalpel: every compressed configuration is a controlled intervention that reads the model's contents by the size of the damage done by removing them.

Las cuatro secciones siguientes son ese bisturí en acción. Se comprimen las 72 matrices del encoder de BERT — primero todas a la vez, después por tipo, después por capa — y se mide qué F1 sobrevive.

The next four sections are that scalpel in action. The 72 encoder matrices of BERT get compressed — first all at once, then by type, then by depth — and we measure what F1 survives.

La mayor parte de la literatura usa la SVD para ahorrar memoria. Aquí se usa para hacer mapas. La asimetría que sale al usar el bisturí — 14× a rango 128, hasta 72× normalizado por parámetros — es información sobre cómo está organizado el modelo, no sobre la SVD.

Most of the literature uses SVD to save memory. Here it's used to make maps. The asymmetry the scalpel reveals — 14× at rank 128, up to 72× normalised by parameters — is information about how the model is organised, not about SVD.

01 Matriz original Original matrix

Una de las 72 matrices lineales del encoder de BERT. Rango completo. One of BERT's 72 linear encoder matrices. Full rank.

SVD_k

02 Truncada al rango k Truncated to rank k

La mejor aproximación de rango k en norma de Frobenius. Eckart-Young, 1936. The optimal rank-k approximation in Frobenius norm. Eckart-Young, 1936.

al modeloto model

03 Medida funcional Functional measurement

La caída de F1 al aplicar el modelo comprimido cuantifica la importancia de W. The F1 drop when applying the compressed model quantifies the importance of W.

Una técnica de reducción se convierte en instrumento de medida: lo que se rompe al quitarlo mide lo que importaba. A reduction technique becomes a measurement instrument: what breaks when you remove it measures what mattered.

SVD aplicada a una matriz W ∈ ℝ^m×n con truncamiento al rango k. La pérdida de F1 al aplicar la versión comprimida del modelo cuantifica la importancia funcional de W.

SVD applied to a matrix W ∈ ℝ^m×n with rank-k truncation. The F1 drop when applying the compressed model quantifies the functional importance of W.

Fig. 02.1

§ 02.2 · El puente

§ 02.2 · The bridge

El modelo se comprime a sí mismo

The model compresses itself

Antes de aplicar SVD, la red ya está reduciendo dimensión.

Before SVD touches anything, the network is already reducing dimensionality.

Apilas las representaciones de tokens de cada capa en una matriz 469 × 768 y le aplicas SVD. La pregunta: cuántas dimensiones está usando realmente el modelo capa a capa.

Stack each layer's token representations into a 469 × 768 matrix and run SVD on it. The question: how many dimensions does the model actually use at each layer?

Las curvas azules son norma euclídea de los vectores. Crecen con la profundidad (Kobayashi 2021). Las rojas son rango efectivo y k95 — cuántos valores singulares cubren el 95 % de la varianza. Caen en pico: de ~130 dimensiones en capas tempranas a 22 en L12. El k95 baja hasta 35.

The blue curves are vector L2 norms. They grow with depth (Kobayashi 2021). The red ones are effective rank and k95 — how many singular values cover 95 % of the variance. They collapse: ~130 dimensions in early layers down to 22 by L12. k95 drops to 35.

De 768 dimensiones disponibles, el modelo termina usando un 5 %. El heatmap inferior es la prueba completa: energía espectral acumulada por capa, con la línea negra del k95 desplazándose a la izquierda en L9–L12.

Out of 768 available dimensions, the model ends up using around 5 %. The bottom heatmap shows the full picture: cumulative spectral energy per layer, with the black k95 line drifting left from L9 to L12.

Si la representación interna en L12 vive en 22 de 768 dimensiones, las matrices que la producen son aproximables por low-rank. La SVD no introduce compresión donde no la había. Materializa la que ya existe.

If the internal representation at L12 lives in 22 of 768 dimensions, the matrices producing it are amenable to low-rank approximation. SVD doesn't introduce compression where there wasn't any — it just makes the existing compression explicit.

Norma media de hidden states (azul) vs rango efectivo y k95 (terra). Heatmap de energía espectral acumulada. Datos: 46 frases del test set sobre 23emo-final. Refs: Kobayashi EMNLP 2021; Dong et al. ICML 2021.

Mean hidden-state norm (blue) vs effective rank and k95 (terra). Heatmap of cumulative spectral energy. Data: 46 test-set sentences on 23emo-final. Refs: Kobayashi EMNLP 2021; Dong et al. ICML 2021.

Fig. 02.2

§ 02.3 · El acantilado

§ 02.3 · The cliff

21 estrategias, una transición de fase

21 strategies, one phase transition

SVD uniforme a todas las capas. Entre rango 384 y 256 el F1 se desploma.

Uniform SVD across every layer. Between rank 384 and 256, F1 falls off the cliff.

La superficie de la derecha es F1 sobre el plano (rango × profundidad). Cada celda es una configuración. El color es retención de F1.

The right-hand surface is F1 over the (rank × depth) plane. Each cell is a configuration. Color is F1 retention.

Las capas tardías (8–11) caen al vacío antes que las tempranas. Esa asimetría es lo que motiva la compresión informada de la Parte 07.

Late layers (8–11) plunge into the void before the early ones. That asymmetry is what motivates the informed compression in Part 07.

El algoritmo greedy — alimentado con datos empíricos de sensibilidad — domina la frontera de Pareto en 8 de 9 puntos óptimos sobre 21 estrategias evaluadas. Al 80 % de parámetros retiene el 87 % del F1; la compresión uniforme al mismo ratio retiene un 43 %.

The greedy algorithm — fed empirical sensitivity data — dominates the Pareto frontier on 8 of 9 optimal points across 21 evaluated strategies. At 80 % parameters it retains 87 % of F1; uniform compression at the same ratio retains 43 %.

Frontera de Pareto: 21 estrategias evaluadas. Datos: compression_comparison.csv. Eje rango uniforme, eje profundidad de capa, color F1 macro.

Pareto frontier: 21 evaluated strategies. Data: compression_comparison.csv. Axes: uniform rank, layer depth, color F1 macro.

Fig. 02.3

§ 02.4 · Asimetría

§ 02.4 · Asymmetry

14× de diferencia entre Q y FFN

14× gap between Q and FFN

Aislando qué tipo de componente aguanta y cuál se rompe.

Isolating which component type survives and which breaks.

Para ver el efecto de cada tipo por separado se comprimen las 12 matrices de un solo tipo a un rango fijo y se deja todo lo demás intacto. Se repite el experimento para los seis tipos y tres rangos.

To see each type's effect on its own, the 12 matrices of a single type get compressed to a fixed rank while everything else stays intact. The experiment is repeated for the six types and three ranks.

A rango 128 las diferencias son enormes. Q retiene el 99,4 % del F1, K el 98,4 %. La FFN Intermediate, al mismo rango, se queda en 6,9 %. Eso son 14× de retención absoluta entre uno y otro, 72× si lo normalizas por porcentaje de parámetros eliminados.

At rank 128 the differences are huge. Q keeps 99.4 % of F1, K 98.4 %. FFN Intermediate, at the same rank, drops to 6.9 %. That's 14× the absolute retention between one and the other, 72× once you normalise by parameters eliminated.

Aparecen tres regímenes. Q y K son inmunes: el espectro está concentrado, la dimensionalidad efectiva es baja, y la degradación con el rango es casi lineal. V y Attn-O hacen un acantilado: aguantan rango 256 pero colapsan en cuanto bajas a 64. Las dos FFN son frágiles desde el principio: el espectro es plano, cada dimensión aporta lo suyo, y a rango 256 ya están rotas.

Three regimes show up. Q and K are immune: the spectrum is concentrated, effective dimensionality is low, and degradation with rank is roughly linear. V and Attn-O hit a cliff: they survive rank 256 but collapse the moment you drop to 64. The two FFNs are fragile from the start: a flat spectrum, every dimension contributing something, and at rank 256 they're already broken.

La estructura del espectro predice bastante bien el comportamiento bajo compresión. Tratar igual a Q y a FFN es subóptimo por un factor de hasta 72×.

The spectral structure predicts compression behaviour fairly accurately. Treating Q the same as FFN is suboptimal by up to 72×.

Datos: notebook 3, component_sensitivity.csv. F1 retención al comprimir uniformemente sólo las 12 matrices de un tipo.

Data: notebook 3, component_sensitivity.csv. F1 retention from uniformly compressing only the 12 matrices of one type.

Fig. 02.4

§ 02.5 · Topografía

§ 02.5 · Topography

La asimetría espectral, hecha topografía

Spectral asymmetry as a landscape

Las 72 matrices del modelo apiladas como filas de un terreno.

All 72 weight matrices stacked as rows of a relief map.

Cada fila es una matriz de pesos. Eje X: índice del valor singular. Eje Z: magnitud normalizada σᵢ/σ₁.

Each row is one weight matrix. X-axis: singular-value index. Z-axis: normalised magnitude σᵢ/σ₁.

Q y K forman picos abruptos: pocos valores singulares dominan, espectro concentrado, rango efectivo bajo, fácil de comprimir. Las FFN forman mesetas casi planas: espectro distribuido, cada dimensión aporta, frágiles bajo SVD.

Q and K form sharp peaks: a few singular values dominate, concentrated spectrum, low effective rank, easy to compress. The FFN matrices form near-flat plateaus: distributed spectrum, every dimension contributes, fragile under SVD.

Los diamantes marcan k95 por matriz. Q/K cerca de 395. FFN cerca de 620. Es la Tabla 6 de la memoria, en relieve.

The diamonds mark k95 per matrix. Q/K hover around 395. FFN around 620. It's Table 6 of the thesis, but in 3D.

SVD computada sobre el checkpoint 23emo-final. 72 matrices = 12 capas × 6 componentes (Q, K, V, Attn-O, FFN-i, FFN-o).

SVD computed on the 23emo-final checkpoint. 72 matrices = 12 layers × 6 components (Q, K, V, Attn-O, FFN-i, FFN-o).

Fig. 02.5

Parte 03

Part 03

Dónde se forma la emoción

Where the emotion forms

La Parte 02 ha medido qué se rompe. Esta parte mide dónde se forma. Para cada una de las 23 emociones, para cada una de las 13 representaciones intermedias del modelo, ¿cuánta información está ya ahí, linealmente accesible?

Part 02 measured what breaks. This part measures where it forms. For each of the 23 emotions, for each of the 13 intermediate representations of the model, how much information is already there, linearly accessible?

Probing capa a capa, geometría 3D, logit lens. Tres formas de hacer la misma pregunta. Las respuestas — a primera vista — no encajan.

Layer-wise probing, 3D geometry, logit lens. Three ways of asking the same question. The answers — at first glance — don't fit together.

§ 03.1 · Probing

Cristalización por capas

Layer-wise crystallisation

Cuándo aparece cada emoción dentro del modelo.

When each emotion shows up inside the model.

Para cada capa entrenas un clasificador lineal sobre el [CLS]. Te dice cuánta información de cada emoción es linealmente extraíble en ese punto.

For each layer you train a linear classifier on top of [CLS]. It tells you how much of each emotion is linearly recoverable at that point.

Gratitude sale ya en L0. Tiene vocabulario obvio ("thanks", "thank you"). Realization aguanta hasta L11. Necesita contexto entero para diferenciarse.

Gratitude shows up as early as L0. The vocabulary is obvious ("thanks", "thank you"). Realization holds out until L11. It needs the full context to be told apart.

La frecuencia en el dataset NO predice la profundidad. Annoyance tiene 3× más ejemplos que disgust. Cristaliza 6 capas más tarde. Lo que importa es la complejidad semántica, no el volumen de datos.

Frequency in the dataset does NOT predict depth. Annoyance has 3× more examples than disgust, yet crystallises 6 layers later. What matters is semantic complexity, not data volume.

Los diamantes marcan la capa de cristalización: donde el F1 alcanza el 80 % de su máximo. La barra de la izquierda es el cluster psicológico de cada emoción.

The diamonds mark the crystallisation layer — where F1 reaches 80 % of its maximum. The left ribbon is each emotion's psychological cluster.

La derivada del mapa cuenta otra historia. Del embedding (F1 = 0) a L0, el F1 medio del probe salta a 0,349 — el 61 % de la separabilidad final del modelo (0,569 en L11) absorbida en una sola capa. L1–L7 aportan ganancias modestas (+0,01 a +0,05). L10 y L11 prácticamente nada (< 0,005). Las capas tardías casi no añaden información; haces falta otro experimento para ver qué hacen.

The derivative tells a second story. From the embedding (F1 = 0) to L0, mean probe F1 jumps to 0.349 — 61 % of the model's final separability (0.569 at L11) absorbed in a single layer. L1–L7 add modest gains (+0.01 to +0.05). L10 and L11 contribute almost nothing (< 0.005). Late layers barely add information; you need a different experiment to see what they're doing.

El probing dice que las capas tardías no aportan información nueva. El activation patching, más abajo, dice que son las únicas suficientes para revivir el modelo desde el colapso. Una paradoja en apariencia. La parte 05 la resuelve: lo que hacen no es crear señal, es rotarla.

Probing says the late layers don't add new information. Activation patching, further down, says they're the only ones sufficient to revive the model from collapse. An apparent paradox. Part 05 resolves it: what they do isn't create signal, it's rotate it.

Probing lineal por capa, 23 emociones × 13 capas. Datos: probe_results.csv del notebook 4.

Layer-wise linear probing, 23 emotions × 13 layers. Data: probe_results.csv from notebook 4.

Fig. 03.1

§ 03.2 · Geometría

§ 03.2 · Geometry

Galaxy formation

23 emociones cristalizando en el espacio LDA, capa por capa.

23 emotions crystallising in LDA space, layer by layer.

Tomamos el [CLS] de cada capa, le aplicamos el pooler real del modelo (Linear + tanh) y proyectamos con LDA supervisada ajustada en L12. Los tres ejes son las direcciones que mejor separan las 23 emociones.

Take the [CLS] from each layer, apply the model's real pooler (Linear + tanh), and project with supervised LDA fitted at L12. The three axes are the directions that best separate the 23 emotions.

En L0 los 23 diamantes (los centroides de cada emoción) están casi superpuestos en el origen. En L11 ocupan regiones separadas, cerca de las frases que les corresponden.

At L0 the 23 diamonds (one centroid per emotion) sit almost on top of the origin. By L11 they occupy distinct regions, next to the sentences they belong to.

Cuantitativo: separation ratio de 4,3 (los centroides están 4× más separados que la dispersión interna). 40 % de accuracy por nearest-centroid contra 4 % de un baseline aleatorio.

Quantitative: separation ratio of 4.3 (centroids are 4× farther apart than the within-cluster spread). Nearest-centroid accuracy of 40 % vs a 4 % random baseline.

Click en la leyenda para aislar una emoción.

Click an emotion in the legend to isolate it.

2300 frases del test set. Pooler aplicado, LDA-3D fija ajustada en L12. Mismas coordenadas para todas las capas.

2,300 test-set sentences. Pooler applied, fixed LDA-3D fitted at L12. Same coordinates across every layer.

Fig. 03.2

§ 03.3 · Logit lens

La curva en U

The U curve

Aplicar el classifier real a cada capa, no sólo a la última.

Applying the real classifier to every layer, not just the last.

Logit lens. Técnica de Nostalgebraist (2020), refinada por Belrose et al. (NeurIPS 2023). Aplicas el pooler + classifier reales del modelo a las 13 representaciones intermedias.

Logit lens. Originally Nostalgebraist (2020), refined by Belrose et al. (NeurIPS 2023). You take the model's real pooler + classifier and apply them to all 13 intermediate representations.

Promediado sobre 2300 frases, el sigmoid medio traza una U. Capas 0–3: las estadísticas del [CLS] saturan el tanh del pooler — muchas emociones se disparan a la vez con magnitud media. Capas 4–9: transición, tanh cerca de cero, todas las sigmoides colapsan. Capas 10–11: el [CLS] alcanza su régimen entrenado y una emoción pega un salto.

Averaged over 2,300 sentences, the mean sigmoid traces a U. Layers 0–3: [CLS] statistics saturate the pooler's tanh — several emotions fire at once with medium magnitude. Layers 4–9: transition zone, tanh near zero, every sigmoid collapses. Layers 10–11: [CLS] reaches its trained regime and one emotion jumps.

Esto explica por qué el activation patching de L11 recupera el 100 % del F1: lo que restauras es justamente esta calibración.

This is why activation-patching layer 11 alone recovers 100 % of F1: what you're restoring is exactly this calibration.

Sigmoid promedio del pooler+classifier aplicado por capa. Top-1, gold, suma de las 23. 2300 frases.

Mean sigmoid of pooler+classifier applied per layer. Top-1, gold, sum of all 23. 2,300 sentences.

Fig. 03.3

§ 03.4 · Comparación

§ 03.4 · Comparison

Lo que sabe vs lo que sabe leer

What it knows vs what it can read

Probing y logit lens contestan preguntas distintas. La diferencia es informativa.

Probing and logit lens answer different questions. The gap between them is informative.

Dos curvas, tres paneles. Las dos miden información de emoción en cada capa. Mismos datos. Preguntas distintas.

Two curves, three panels. Both measure emotion information at every layer. Same data. Different questions.

El probe lineal (rojo) es un clasificador nuevo entrenado sobre cada capa. Te dice cuánta información hay ahí, linealmente extraíble. Sube monótonamente: la información se acumula con la profundidad.

The linear probe (red) is a fresh classifier trained on each layer. It tells you how much information is there, linearly recoverable. It rises monotonically: information accumulates with depth.

El logit lens (verde) es la cabeza entrenada en L11, aplicada en cada capa. Hace U porque sólo sabe leer activaciones estilo L11. En las capas tempranas y medias lee mal aunque la información esté ahí.

The logit lens (green) is the head trained at L11, applied at each layer. It makes a U because it can only read L11-style activations. Early and middle layers fool it even though the information is there.

La banda gris entre las dos curvas es la diferencia: información que existe pero el modelo no usa. Por eso basta con restaurar L11 en el activation patching para recuperar el 100 % del F1: no falta información en el resto del modelo, falta una cabeza que sepa leerla.

The grey band between the two curves is the gap: information that exists but the model doesn't use. That's why restoring just L11 via activation patching recovers 100 % of F1 — the rest of the model isn't missing information, it's missing a head that knows how to read it.

Los tres paneles separan emociones por capa de cristalización. Tempranas: probe alto en L0 pero el lens aún no decide, brecha máxima. Tardías: probe y lens suben a la vez, brecha mínima.

The three panels split emotions by crystallisation layer. Early: probe is already high at L0 but the lens hasn't decided — maximum gap. Late: probe and lens rise together — minimum gap.

Probe F1 macro de notebook 4 contra logit lens (pooler+classifier reales aplicados al [CLS] de cada capa) sobre 2300 frases del test set. Bandas de fase compartidas con la curva U.

Probe F1 macro from notebook 4 vs logit lens (real pooler+classifier applied to [CLS] at each layer) over 2,300 test-set sentences. Phase bands shared with the U curve.

Fig. 03.4

Parte 04

Part 04

144 cabezas

144 heads

En cada capa hay 12 cabezas de atención. 12 × 12 = 144 en total. Cada una con su propio patrón y su propia función.

Each layer has 12 attention heads. 12 × 12 = 144 in total. Each one with its own pattern and its own role.

No todas son iguales. No todas son críticas. No todas hacen lo mismo.

Not all are equal. Not all are critical. Not all do the same job.

§ 04.1 · Categorías

§ 04.1 · Categories

Las 144 cabezas

The 144 heads

Critical specialist, critical generalist, minor, prescindible.

Critical specialist, critical generalist, minor, dispensable.

Cada cabeza es una celda en una matriz 12 × 12. Color = categoría según el ablation study. La capa 11 (fila inferior) no contiene ni una cabeza prescindible. Sus 12 son críticas.

Each head is a cell in a 12 × 12 matrix. Color = category from the ablation study. Layer 11 (bottom row) has zero dispensable heads. All 12 are critical.

Los puntos negros marcan las cabezas que cada emoción necesita específicamente, según la Tabla 19. L11-H6 aparece dos veces — la comparten sadness y realization.

The black dots mark the heads each emotion needs specifically, from Table 19. L11-H6 appears twice — shared between sadness and realization.

El 77 % de las cabezas en capas 8–11 son críticas. En capas 0–4 es sólo el 27 %. La capa 11 entera (12 cabezas) no contiene NINGUNA cabeza prescindible. Y hay 21 cabezas interferentes — su ablación MEJORA el F1.

77 % of heads in layers 8–11 are critical. In layers 0–4 it's only 27 %. Layer 11 as a whole (12 heads) has ZERO dispensable heads. And there are 21 interfering heads — ablating them IMPROVES F1.

12 capas × 12 cabezas = 144 celdas. Datos: head_categories.csv del notebook 6.

12 layers × 12 heads = 144 cells. Data: head_categories.csv from notebook 6.

Fig. 04.1

§ 04.2 · Patrones

§ 04.2 · Patterns

Attention atlas

Las 144 cabezas, todas a la vez, sobre una frase.

All 144 heads, simultaneously, over one sentence.

Eliges una frase. El sistema te enseña los 144 patrones de atención simultáneamente. El borde de cada celda codifica su categoría funcional.

Pick a sentence. The system shows you all 144 attention patterns at once. Each cell's border encodes its functional category.

Patrones reconocibles. Las capas tempranas atienden diagonalmente: token a sí mismo, vecinos. Las tardías concentran atención en [CLS] o [SEP] — rayas verticales. Es el patrón típico de las cabezas agregadoras.

Recognisable shapes. Early layers attend diagonally: token to itself, neighbours. Late layers concentrate attention on [CLS] or [SEP] — vertical stripes. The classic aggregator pattern.

Click en cualquier cabeza para ampliarla con etiquetas de tokens.

Click any head to enlarge it with token labels.

Pesos de atención reales sobre 46 frases del test set. 12 capas × 12 cabezas = 144 mini-mapas.

Real attention weights over 46 test-set sentences. 12 layers × 12 heads = 144 mini-maps.

Fig. 04.2

Parte 05

Part 05

Tres conceptos, no uno

Three concepts, not one

Las técnicas de la Parte 03 parecen contradecirse. El probing dice que la información emocional se forma temprano. El logit lens dice que en las capas medias la información existe pero el clasificador no sabe leerla. La predicción del activation patching de la Parte 04 dirá que restaurar sólo la capa 11 basta para recuperar el 100 % del F1.

The techniques in Part 03 seem to contradict each other. Probing says emotion forms early. Logit lens says in the middle layers the information exists but the classifier can't read it. Part 04's activation patching prediction will say restoring just layer 11 is enough to recover 100 % of F1.

Las tres mediciones son consistentes — pero sólo si distingues tres conceptos que la literatura suele confundir.

All three measurements are consistent — but only if you tell apart three concepts the literature tends to conflate.

§ 05.1 · Diagrama

§ 05.1 · Diagram

Representación · Alineación · Acceso

Representation · Alignment · Access

Una clarificación tripartita que resuelve la paradoja entre técnicas.

A tripartite clarification that resolves the paradox between techniques.

La memoria llama a esto clarificación tripartita, y es en sí misma una contribución metodológica del trabajo. Tres conceptos relacionados pero distintos, medidos por tres técnicas distintas. Confundirlos produce conclusiones que parecen contradictorias.

The thesis calls this the tripartite clarification, and it's itself a methodological contribution. Three related but distinct concepts, measured by three distinct techniques. Conflate them and the conclusions look contradictory.

Representación. Qué información está presente en una capa, linealmente accesible. Lo mide el probing — un clasificador entrenado sobre cada capa. Sube monotónicamente con la profundidad: L0 ya tiene el 61 %, L11 el 100 %.

Representation. What information is present at a layer, linearly accessible. Measured by probing — a fresh classifier trained on each layer. It rises monotonically with depth: L0 already has 61 %, L11 has 100 %.

Alineación. Esa información está proyectada sobre la base sobre la que opera el clasificador entrenado. Lo mide el logit lens — la cabeza entrenada en L11, aplicada en cada capa. Hace una U: alto-valle-alto. La capa 11 sabe leer la representación; las capas medias, no.

Alignment. That information is projected onto the basis the trained classifier operates on. Measured by the logit lens — the L11 head applied at every layer. Traces a U: high-valley-high. Layer 11 knows how to read the representation; the middle layers don't.

Acceso. Qué componentes son funcionalmente suficientes para que el clasificador prediga bien. Lo mide el activation patching — restaurar pesos uno a uno desde un modelo colapsado. La FFN de la capa 11 sola: 100 % del F1.

Access. Which components are functionally sufficient for the classifier to predict correctly. Measured by activation patching — restoring weights one at a time from a collapsed model. The FFN of layer 11 alone: 100 % of F1.

Que la información esté presente (representación) no implica que esté en el formato que el clasificador necesita (alineación), y eso no implica que ese componente sea estructuralmente necesario (acceso). Tres preguntas, tres respuestas, una única organización subyacente: el cuello de botella geométrico vive en la capa 11.

Information being present (representation) doesn't mean it's in the format the classifier needs (alignment), and that doesn't mean the component is structurally necessary (access). Three questions, three answers, one underlying organisation: the geometric bottleneck lives in layer 11.

Representación Representation

¿Qué información está presente en esta capa, linealmente accesible? What information is present at this layer, linearly accessible?

Medido porMeasured by

Probing lineal Linear probing

clasificador entrenado
sobre cada capa fresh classifier trained
on each layer

Sube monótonamente.
L0 → 61 %. L11 → 100 %. Rises monotonically.
L0 → 61 %. L11 → 100 %.

Alineación Alignment

¿Está esa información proyectada sobre la base del clasificador? Is that information projected onto the classifier's basis?

Medido porMeasured by

Logit lens Logit lens

cabeza de L11
aplicada en cada capa L11 head applied
at every layer

Curva en U.
Capas medias no leíbles. U-shaped curve.
Middle layers unreadable.

Acceso Access

¿Es este componente suficiente para que el clasificador opere? Is this component sufficient for the classifier to operate?

Medido porMeasured by

Activation patching Activation patching

restaurar pesos desde
un modelo colapsado restore weights from
a collapsed model

Sólo la FFN de L11:
100 % del F1. L11 FFN alone:
100 % of F1.

Las tres mediciones son consistentes. Confundir representación con acceso, o alineación con presencia, produce las paradojas aparentes que el trabajo resuelve. The three measurements are consistent. Conflating representation with access, or alignment with presence, produces the apparent paradoxes the thesis resolves.

Diagrama original de la memoria, sección 5.2.3. Las dos figuras que vienen a continuación (lesion theater y patching por componente) son la evidencia experimental del eje acceso.

Original diagram from the thesis, section 5.2.3. The two figures that follow (lesion theater and per-component patching) are the experimental evidence for the access axis.

Fig. 05.1

§ 05.2 · Lesion

Lesion theater

Lesion theatre

Restaurar capa por capa, ver el modelo revivir.

Restore the model layer by layer and watch it come back.

Empezamos con un modelo a F1 = 0. Todas las barras a cero. Cada etapa restaura los pesos originales de UNA capa más. 12 etapas en total.

Start with a model at F1 = 0. Every bar at zero. Each stage restores the original weights of one more layer. 12 stages in total.

Las primeras 8 etapas no mueven nada. L8 enciende un destello (0,1 % de restauración). L9 empuja arriba a las emociones léxicas (4,1 %). L10 recupera el 24,1 %. L11 hace explotar todas las barras al baseline simultáneamente: 100 % de restauración con una sola capa.

The first 8 stages move nothing. L8 sets off a flicker (0.1 % restoration). L9 lifts the lexical emotions (4.1 %). L10 recovers 24.1 %. L11 blows every bar back to baseline simultaneously: 100 % restoration with a single layer.

Y hay más. Restaurar SOLO la FFN de L11 (no la atención) ya recupera el 100 %. La atención de L11 sola, sólo el 63,3 %. La capacidad emocional del modelo no está distribuida. Vive concentrada en un sub-componente específico de la última capa.

And it goes further. Restoring ONLY the FFN of L11 (not attention) already gets 100 %. L11 attention alone, only 63.3 %. The model's emotional capacity isn't distributed. It lives concentrated in one specific sub-component of the last layer.

Activation patching secuencial por capa. F1 macro y por emoción. 12 etapas + estado inicial.

Sequential layer-wise activation patching. F1 macro and per emotion. 12 stages plus initial state.

Fig. 05.2

§ 05.3 · Por componente

§ 05.3 · By component

Solo la FFN de L11 basta

L11 FFN alone is enough

Patching no por capa entera sino por sub-componente.

Patching not by full layer but by sub-component.

La figura anterior restaura una capa entera (sus 6 matrices). ¿Y si restauramos solo una mitad — solo el bloque de atención, o solo el FFN? Es la versión más fina del experimento.

The previous figure restores a whole layer (all 6 matrices). What happens if we restore only half — just the attention block, or just the FFN? The finer version of the experiment.

De L8 a L10 las dos columnas crecen lentamente y a la par. Pero en L11 se separan: la atención sola devuelve el 63 %, mientras que la FFN sola devuelve el 100 %. Un sub-componente — una tercera parte de los parámetros de L11 — basta para revivir un modelo completamente colapsado.

From L8 to L10 both columns grow slowly and in lockstep. But in L11 they split: attention alone gets 63 %, FFN alone gets 100 %. One sub-component — a third of L11's parameters — is enough to revive a totally collapsed model.

Esto va contra la narrativa de "Attention Is All You Need". Para clasificación emocional sobre este fine-tune, lo crítico es la FFN tardía. La atención ayuda; la FFN decide.

This goes against the "Attention Is All You Need" narrative. For emotion classification on this fine-tune, what's critical is the late FFN. Attention helps; FFN decides.

La FFN de L11 ejecuta una rotación geométrica que lleva la representación a la base sobre la que opera el classifier. Por eso es suficiente — y por eso es la única matriz del modelo que NUNCA conviene comprimir.

L11's FFN executes a geometric rotation that takes the representation onto the basis where the classifier operates. That's why it's sufficient — and why it's the single matrix in the model that should NEVER be compressed.

Datos: notebook 5, activation_patching_per_component.csv. Restauración media sobre 23 emociones. Capas 0–7 omitidas (restauran 0 % al ser patcheadas individualmente).

Data: notebook 5, activation_patching_per_component.csv. Mean restoration over 23 emotions. Layers 0–7 omitted (each restores 0 % when patched individually).

Fig. 05.3

Parte 06

Part 06

La convergencia

The convergence

La memoria sostiene una afirmación fuerte: el hallazgo central del trabajo no es ningún resultado individual, sino su convergencia. Cinco técnicas de interpretabilidad mecánica, dos familias de compresión y un algoritmo greedy ciego apuntan al mismo eje funcional: las capas 8–11, y especialmente sus FFN, son donde reside la capacidad emocional del modelo.

The thesis makes a strong claim: the central finding of the work isn't any single result but their convergence. Five mechanistic interpretability techniques, two compression families and a blind greedy algorithm all point to the same functional axis: layers 8–11, and especially their FFNs, are where the model's emotional capacity lives.

Aquí se pone todo junto. Las neuronas emocionales por selectividad. El paisaje emocional cruzando intensidad y cristalización. Una frase recorriendo las 12 capas a la vez. Y un diagrama que reconcilia las cinco técnicas en una sola arquitectura.

This is where everything fits together. The emotional neurons by selectivity. The emotional landscape crossing intensity and crystallisation. A single sentence traversing all 12 layers at once. And a diagram that reconciles the five techniques into one architecture.

§ 06.1 · Neuronas

§ 06.1 · Neurons

Dónde viven las neuronas emocionales

Where the emotional neurons live

Selectividad por neurona: 84 % de las significativas en L8–L11.

Per-neuron selectivity: 84 % of significant ones in L8–L11.

Para cada una de las 36 864 neuronas intermedias del modelo (12 capas × 3 072) calculamos un score tipo Cohen's d: ¿cuánto se diferencia su activación cuando una emoción está presente respecto a cuando no? Llamamos significativas a las que tienen |d| > 2,0.

For each of the 36,864 intermediate neurons in the model (12 layers × 3,072) we compute a Cohen's-d–style score: how different is its activation when an emotion is present versus absent? We call significant those with |d| > 2.0.

Hay 3 642 en total. La distribución por profundidad es extrema: 11 en capas 0–3 (0,3 %), 570 en capas 4–7 (16 %), 3 061 en capas 8–11 (84 %). L11 sola contiene 1 127 — más que toda la mitad inferior del modelo combinada.

There are 3,642 in total. The depth distribution is extreme: 11 in layers 0–3 (0.3 %), 570 in layers 4–7 (16 %), 3,061 in layers 8–11 (84 %). Layer 11 alone contains 1,127 — more than the entire bottom half of the model combined.

Por emoción, el desequilibrio también es brutal. Gratitude tiene 818 neuronas dedicadas, max selectivity 6,88. Remorse 442. Love 399. En el extremo opuesto, annoyance, disappointment y realization tienen CERO neuronas significativas. Su representación distribuida explica por qué son las más vulnerables a CUALQUIER perturbación del modelo.

By emotion the imbalance is also brutal. Gratitude has 818 dedicated neurons, max selectivity 6.88. Remorse 442. Love 399. At the other extreme, annoyance, disappointment and realization have ZERO significant neurons. Their distributed representation explains why they're the most fragile under ANY model perturbation.

La norma del vector de selectividad es el mejor predictor de la caída de F1 bajo SVD (Spearman ρ = 0,64, p = 0,001). Las emociones "escritas en negrita" en los pesos son las que más sufren cualquier compresión.

The selectivity-vector norm is the best predictor of F1 drop under SVD (Spearman ρ = 0.64, p = 0.001). The emotions written "in bold" inside the weights are the ones that suffer most under any compression.

Datos: notebook 7. neuron_significant_counts.csv y neuron_catalog.csv. Conteos reales sobre el conjunto de test del checkpoint 23emo-final.

Data: notebook 7. neuron_significant_counts.csv and neuron_catalog.csv. Real counts on the test set of the 23emo-final checkpoint.

Fig. 06.1

§ 06.2 · Mapa

§ 06.2 · Map

El paisaje emocional

The emotional landscape

Cada emoción en (cristalización × intensidad).

Each emotion plotted in (crystallisation × intensity).

Las 23 emociones, posicionadas en un plano 2D. Eje X: capa de cristalización. Eje Y: norma de selectividad.

The 23 emotions placed in a 2D plane. X-axis: crystallisation layer. Y-axis: selectivity norm.

Cuadrante superior izquierdo: gratitude, love. Tempranas e intensas. Cuadrante inferior derecho: realization, disappointment. Tardías y difusas.

Top-left quadrant: gratitude, love. Early and intense. Bottom-right quadrant: realization, disappointment. Late and diffuse.

La línea punteada conecta sadness y realization. Comparten L11-H6.

The dotted line connects sadness and realization. They share L11-H6.

Selecciona una emoción en el menú de la derecha y verás su huella radial: 6 dimensiones funcionales en una figura polar.

Pick an emotion in the right-hand menu to see its radial fingerprint — 6 functional dimensions on a polar plot.

23 emociones sobre el plano (cristalización × norma de selectividad). Datos: crystallization_layers.csv y neuron_catalog.csv.

23 emotions on the (crystallisation × selectivity-norm) plane. Data: crystallization_layers.csv and neuron_catalog.csv.

Fig. 06.2

§ 06.3 · Síntesis

§ 06.3 · Synthesis

Una frase, cuatro vistas

One sentence, four views

El experimento que se pide al inicio: ver BERT pensar.

The experiment you ask for at the start: watching BERT think.

Aquí se juntan todas las vistas en una sola experiencia. Eliges una frase. Arrastras el slider de capa. Los cuatro paneles se mueven sincronizados. Trayectoria 3D del [CLS]. Atención de las 3 cabezas más críticas. Sigmoides multi-label. Curva del gold.

All the previous views collapse into a single experience. Pick a sentence. Drag the layer slider. The four panels move in sync. 3D trajectory of [CLS]. Attention of the 3 most critical heads. Multi-label sigmoids. Gold curve.

Pulsa Play. El [CLS] arranca cerca del origen y se desplaza hacia su centroide. Las cabezas críticas se activan capa a capa. Los pétalos saltan del valle a la cristalización. La curva del gold traza la U.

Hit Play. [CLS] starts near the origin and slides toward its centroid. The critical heads activate layer by layer. The petals jump from valley to crystallisation. The gold curve traces the U.

Cuatro paneles síncronos. Datos reales del modelo 23emo-final aplicado en vivo a la frase elegida.

Four synchronised panels. Real data from the 23emo-final model applied live to the selected sentence.

Fig. 06.3

§ 06.4 · Diagrama

§ 06.4 · Diagram

Cinco técnicas, una arquitectura

Five techniques, one architecture

Lo que la memoria llama el hallazgo central del trabajo.

What the thesis calls the central finding of the work.

Cinco técnicas de interpretabilidad operan sobre granularidades diferentes y miden cosas distintas. Y sin embargo apuntan a la misma arquitectura funcional. La convergencia, no ninguno de los resultados individuales, es lo que hace la conclusión robusta.

Five interpretability techniques operate at different granularities and measure different things. Yet they point to the same functional architecture. The convergence, not any single result, is what makes the conclusion robust.

Capas tempranas (Emb–L2): extracción de señal léxica bruta. El probing absorbe el 61 % de la separabilidad final en L0 solo. Capas medias (L3–L7): cómputo de transición en una geometría desalineada con el clasificador. El logit lens hace su valle, las cabezas críticas son minoría (27 %), las neuronas selectivas son una excepción (16 %). Capas tardías (L8–L11): alineación geométrica con la base del clasificador. 77 % de cabezas críticas, 84 % de neuronas significativas, la FFN de L11 sola recupera el 100 % del F1.

Early layers (Emb–L2): raw lexical signal extraction. Probing absorbs 61 % of final separability in L0 alone. Middle layers (L3–L7): transition computation in a geometry misaligned with the classifier. The logit lens dips into its valley, critical heads are a minority (27 %), selective neurons are the exception (16 %). Late layers (L8–L11): geometric alignment with the classifier basis. 77 % of heads critical, 84 % of neurons significant, L11's FFN alone recovers 100 % of F1.

La quinta evidencia es ortogonal a las otras cuatro: el algoritmo greedy de la Parte 07, que opera ciegamente con datos numéricos de sensibilidad sin tocar ninguna técnica de interpretabilidad, redescubre exactamente esta misma estructura. Comprime Q/K primero, nunca toca la FFN Intermediate de las capas tardías. Dos líneas independientes llegando a la misma respuesta.

The fifth piece of evidence is orthogonal to the other four: the greedy algorithm in Part 07, operating blindly on numerical sensitivity data without touching any interpretability technique, independently rediscovers this same structure. It compresses Q/K first, never touches the late-layer FFN Intermediate. Two independent lines arriving at the same answer.

Que cinco metodologías independientes lleguen a la misma conclusión hace la conclusión más robusta que cualquiera de ellas individualmente. Y la organización funcional que emerge — cristalización progresiva, dominio de la FFN tardía, U del logit lens — no se programó. Salió sola. La interpretabilidad mecánica documenta lo que el gradiente decidió, no lo que nadie prescribió.

Five independent methodologies reaching the same conclusion make the conclusion more robust than any of them alone. And the functional organisation that emerges — progressive crystallisation, late-FFN dominance, the logit-lens U — wasn't programmed. It came out on its own. Mechanistic interpretability documents what gradient descent decided, not what anyone prescribed.

técnica / banda funcional technique / functional band

TempranasEarly

Emb · L0–L2

Señal léxica brutaRaw lexical signal

MediasMiddle

L3–L7

Cómputo de transiciónTransition computation

TardíasLate

L8–L11

Alineación con el clasificadorClassifier alignment

ProbingProbing

F1 por capaF1 per layer

● ● ●

L0 absorbe 61 % L0 absorbs 61 %

● ● ●

+0,01 a +0,05 +0.01 to +0.05

● ● ●

casi nulo near zero

Logit lensLogit lens

Σ sigmoidesΣ sigmoids

● ● ●

5,4 difuso 5.4 diffuse

● ● ●

0,2 colapso 0.2 collapsed

● ● ●

1,3 focalizado 1.3 focused

Activation patchingActivation patching

% F1 restaurado% F1 restored

● ● ●

0 % 0 %

● ● ●

0 % 0 %

● ● ●

100 % en L11 100 % at L11

Head ablationHead ablation

% cabezas críticas% critical heads

● ● ●

27 % 27 %

● ● ●

53 % 53 %

● ● ●

77 % (L11 = 100 %) 77 % (L11 = 100 %)

Neuron selectivityNeuron selectivity

Neuronas |d| > 2Neurons |d| > 2

● ● ●

11 (0,3 %) 11 (0.3 %)

● ● ●

570 (16 %) 570 (16 %)

● ● ●

3 061 (84 %) 3,061 (84 %)

Densidad de marca = magnitud del hallazgo. Cinco filas independientes, una columna dominante. Mark density = magnitude of the finding. Five independent rows, one dominant column.

Diagrama síntesis del Capítulo 5 de la memoria. Cinco técnicas en filas, tres bandas de profundidad funcional en columnas. La densidad de marcas codifica la criticidad documentada por cada método.

Synthesis diagram from Chapter 5 of the thesis. Five techniques in rows, three functional depth bands in columns. Mark density encodes the criticality documented by each method.

Fig. 06.4

Parte 07

Part 07

Compresión informada

Informed compression

La pregunta prescriptiva del proyecto: ¿se puede usar lo que ahora sabemos del modelo para comprimirlo mejor?

The project's prescriptive question: can we use what we now know about the model to compress it better?

Antes de saltar al algoritmo final hay que contar el intento que NO funcionó. Tres heurísticas escritas a mano a partir de los hallazgos de interpretabilidad. Resultado: convergen exactamente sobre los baselines ciegos. Saber QUÉ medir no basta — hay que medir CUÁNTO.

Before jumping to the final algorithm we have to tell the attempt that did NOT work. Three hand-written heuristics built from the interpretability findings. Result: they collapse onto blind baselines. Knowing WHAT to measure isn't enough — you have to measure HOW MUCH.

§ 07.1 · Resultado negativo

§ 07.1 · Negative result

La heurística colapsa sobre lo ciego

Heuristic collapses onto blind

Tres reglas informadas, exactamente sobre uniform_r256 y r512.

Three informed rules, exactly on top of uniform_r256 and r512.

Las tres heurísticas se escribieron antes que el greedy. La idea era directa: si las capas tardías son críticas, protégerlas; si Q y K son inmunes, comprímelos primero. Tres niveles de agresividad. Resultado experimental:

The three heuristics were written before the greedy one. The idea was direct: if late layers are critical, protect them; if Q and K are immune, compress them first. Three aggressiveness levels. Experimental result:

informed_aggressive coincide exactamente con uniform_r256 (mismo ratio 0,612, mismo F1 0,025). informed_moderate coincide con uniform_r512 (ratio 1,000, F1 0,464). Y informed_light necesita ratio 1,285, es decir, más parámetros que el modelo original. Tres reglas escritas a mano y los tres puntos caen literalmente sobre los baselines ciegos.

informed_aggressive matches uniform_r256 exactly (same ratio 0.612, same F1 0.025). informed_moderate matches uniform_r512 (ratio 1.000, F1 0.464). And informed_light needs ratio 1.285, that is, more parameters than the original model. Three hand-written rules and the three points fall literally on top of the blind baselines.

Ninguna está sobre la frontera de Pareto. Saber cualitativamente qué es importante no se convierte por sí solo en una asignación óptima de rangos. La interpretabilidad cualitativa identifica las variables que importan; los valores numéricos los tienen que fijar los datos empíricos de sensibilidad.

None of them lands on the Pareto frontier. Knowing qualitatively what's important doesn't translate by itself into an optimal rank assignment. Qualitative interpretability identifies the variables that matter; the actual numeric values have to come from empirical sensitivity data.

Es un resultado negativo, pero informativo. Habría sido fácil contar la historia como éxito si las heurísticas hubiesen funcionado. La narrativa entera — la heurística no aporta, pivot a data-driven, el greedy domina Pareto — es en sí misma una observación metodológica que se puede llevar a otros sitios.

It's a negative result, but an informative one. It would have been easy to spin the story as a success if the heuristics had worked. The whole arc — heuristics don't add anything, pivot to data-driven, greedy dominates Pareto — is in itself a methodological observation that travels well.

Datos: notebook 9, compression_comparison.csv. 21 estrategias evaluadas: 6 uniformes, 4 adaptativas, 3 heurísticas, 8 greedy.

Data: notebook 9, compression_comparison.csv. 21 evaluated strategies: 6 uniform, 4 adaptive, 3 heuristic, 8 greedy.

Fig. 07.1

§ 07.2 · Algoritmo

§ 07.2 · Algorithm

Greedy en acción

Greedy in action

Cómo el algoritmo construye la compresión paso a paso.

How the algorithm builds the compression step by step.

El greedy elige movimientos por eficiencia: parámetros ahorrados / coste F1. Aquí lo ves en acción. Empezamos con baseline (todo a rango 768) y avanzamos: greedy_95 → greedy_90 → … → greedy_50.

Greedy picks moves by efficiency: parameters saved / F1 cost. Here you watch it work. Start at baseline (every rank 768) and step through: greedy_95 → greedy_90 → … → greedy_50.

La matriz 12 × 6 se va iluminando célula a célula. Las primeras decisiones son Q y K — gratis, sin coste F1. Exactamente lo que predice §4.3 sobre la inmunidad de Q/K. Después vienen FFN-output en capas tempranas. Las tardías (8–11) se mantienen intactas hasta el final.

The 12 × 6 matrix lights up cell by cell. The first decisions are Q and K — free, zero F1 cost. Exactly what §4.3 predicts about Q/K immunity. Then FFN-output in early layers. The late layers (8–11) stay untouched until the very end.

La línea derecha sigue el F1 vs ratio de compresión paso a paso. Es la prueba algorítmica de que el greedy reproduce los hallazgos de interpretabilidad sin acceso a ellos. Se entera sólo con datos empíricos de sensibilidad.

The right-hand line tracks F1 vs compression ratio at each step. Algorithmic proof that greedy reproduces the interpretability findings without access to them. It figures it out from empirical sensitivity data alone.

Replay del algoritmo greedy_50, _60, …, _95. Datos: greedy_*_ranks.csv del notebook 9.

Replay of the greedy_50, _60, …, _95 algorithm. Data: greedy_*_ranks.csv from notebook 9.

Fig. 07.2

§ 07.3 · Recuperación

§ 07.3 · Recovery

El comprimido vuelve

The compressed model comes back

Greedy_90 + 3 épocas de fine-tuning supera al baseline.

Greedy_90 + 3 epochs of fine-tuning beat the baseline.

Punto de partida: greedy_90, 86,4 % de los parámetros del baseline. F1 macro 0,539. Habíamos perdido un 6,7 % de rendimiento, que sería el coste razonable de una compresión.

Starting point: greedy_90, 86.4 % of baseline parameters. F1 macro 0.539. We'd lost 6.7 % of performance, which would be the reasonable cost of compression.

Tres épocas de fine-tuning después, F1 macro 0,591. Por encima del baseline original (0,577) con un 13,6 % menos de parámetros. La compresión no es un coste; está actuando como regularización implícita.

Three epochs of fine-tuning later, F1 macro 0.591. Above the original baseline (0.577) with 13.6 % fewer parameters. Compression isn't a cost; it's behaving like implicit regularisation.

La ganancia se concentra donde más hace falta. Embarrassment pasa de F1 0,267 a 0,509 — un 90 % relativo más, en una emoción con sólo 303 ejemplos en entrenamiento. Desire sube un 16 %, excitement un 9,5 %, realization un 10 %. La explicación más plausible es que la SVD ha eliminado direcciones ruidosas en las que el baseline había memorizado patrones espurios para emociones con poca masa de datos.

The gain concentrates where it's needed most. Embarrassment goes from F1 0.267 to 0.509 — a 90 % relative jump, on an emotion with only 303 training examples. Desire goes up 16 %, excitement 9.5 %, realization 10 %. The most plausible reading is that SVD removed noisy directions where the baseline had memorised spurious patterns for emotions with little data behind them.

Es un resultado preliminar — un modelo, una tarea, sin grupo de control con epochs adicionales. Pero la dirección es la opuesta a lo que se asume sobre compresión.

It's a preliminary result — one model, one task, no control group with extra epochs. But the direction is the opposite of what's typically assumed about compression.

Datos: notebook 9, finetuning_recovery.csv. F1 baseline / comprimido (greedy_90) / fine-tuneado por emoción.

Data: notebook 9, finetuning_recovery.csv. F1 baseline / compressed (greedy_90) / fine-tuned per emotion.

Fig. 07.3

Parte 08

Part 08

Cinco predicciones falsables

Five falsifiable predictions

Todo lo anterior se ha hecho sobre un modelo (BERT-base) y una tarea (GoEmotions). Que las observaciones generalicen no es algo que este trabajo pueda demostrar: lo que sí puede hacer es formular predicciones contrastables.

Everything above runs on one model (BERT-base) and one task (GoEmotions). Whether the observations generalise isn't something this work can prove: what it can do is spell out testable predictions.

Cinco predicciones cuantitativas, cada una refutable bajo condiciones distintas. Su valor no está en estar acertadas — está en marcar de antemano qué evidencia las rompería.

Five quantitative predictions, each refutable under different conditions. Their value isn't in being right — it's in stating ahead of time what evidence would break them.

§ 08.1 · Diagrama

§ 08.1 · Diagram

Lo que el framework se la juega a predecir

What the framework commits to predicting

Cinco hipótesis ortogonales: cada una falla bajo condiciones distintas.

Five orthogonal hypotheses: each one fails under different conditions.

Las cinco predicciones se derivan del framework de la memoria. Cualquier réplica futura sobre otros modelos o tareas puede confirmarlas o refutarlas, delimitando empíricamente qué hallazgos son específicos de este caso y cuáles son propiedades estructurales más generales.

The five predictions follow from the framework of the thesis. Any future replication on other models or tasks can confirm them or refute them, empirically delimiting which findings are specific to this case and which are more general structural properties.

Cada predicción tiene una condición precisa bajo la cual se rompe. La quinta es la más útil prácticamente: si una heurística cualitativa salta por encima de la frontera ciega en otro dominio, refuta la observación metodológica central del trabajo y libera a la comunidad de pivotar obligatoriamente a métodos data-driven.

Each prediction has a precise condition under which it breaks. The fifth one is the most practically useful: if a qualitative heuristic jumps ahead of the blind frontier in some other domain, it falsifies the work's central methodological observation and frees the community from having to pivot to data-driven methods.

01 ArquitectónicaArchitectural

Asimetría espectral en BERT-large Spectral asymmetry in BERT-large

PredicciónPrediction El cociente k₉₅(Q)/k₉₅(FFN) debe caer en [0,55 ; 0,75]. En BERT-base es 0,64. The k₉₅(Q)/k₉₅(FFN) ratio should land in [0.55 ; 0.75]. BERT-base is 0.64.

Se rompe siFalsified if Un cociente cercano a 1 falsa la generalizabilidad. A ratio near 1 falsifies the generalisability.

02 GeométricaGeometric

Decoder-only ≠ encoder-only en patching Decoder-only ≠ encoder-only in patching

PredicciónPrediction En GPT-2/LLaMA, la restauración por patching debe repartirse entre varias capas tardías, no concentrarse en una. In GPT-2/LLaMA, patching restoration should spread across several late layers, not concentrate in one.

Se rompe siFalsified if Si una capa sola recupera el 100 %, la explicación geométrica del [CLS] se cae. If a single layer recovers 100 %, the [CLS]-based geometric explanation falls.

03 SemánticaSemantic

Cristalización robusta a arquitectura Crystallisation robust to architecture

PredicciónPrediction La distribución 9/8/6 emociones por banda y ρ = −0,67 con F1 máximo deben replicarse en RoBERTa-base. The 9/8/6 emotion-per-band split and ρ = −0.67 with max F1 should replicate on RoBERTa-base.

Se rompe siFalsified if Si el orden cambia mucho, el corpus de pre-train decide, no la semántica. If the order changes much, pre-training corpus drives it, not semantics.

04 EstadísticaStatistical

Regularización en clases raras Regularisation on rare classes

PredicciónPrediction Compresión + fine-tuning debe favorecer clases infrarrepresentadas también en NER o Reuters-21578. Compression + fine-tuning should help under-represented classes on NER or Reuters-21578 too.

Se rompe siFalsified if Ausencia del efecto refuta la regularización implícita por SVD. Absence of the effect refutes the implicit SVD regularisation.

05 MetodológicaMethodological

Heurística cualitativa = baseline ciego Qualitative heuristic = blind baseline

PredicciónPrediction Cualquier regla cualitativa para asignar rangos convergerá sobre la frontera ciega, salvo que use datos cuantitativos. Any qualitative rule for assigning ranks will collapse onto the blind frontier, unless it uses quantitative data.

Se rompe siFalsified if Una heurística que domine Pareto en otro dominio rompe la observación central. A heuristic dominating Pareto elsewhere breaks the central observation.

Las cinco predicciones son ortogonales: cada una falla bajo condiciones distintas. La confirmación conjunta es un test estricto del framework; la refutación individual delimita qué parte del análisis pertenece a este caso. The five predictions are orthogonal: each fails under different conditions. Joint confirmation is a strict test of the framework; individual refutation marks off which part of the analysis belongs to this case.

Las cinco predicciones del §7.4 de la memoria, agrupadas por el tipo de evidencia que las rompería: arquitectónica, geométrica, semántica, estadística y metodológica.

The five predictions from §7.4 of the thesis, grouped by the kind of evidence that would break them: architectural, geometric, semantic, statistical and methodological.

Fig. 08.1

Cierre

Closing

Lo que emergió

What emerged

Recapitulación honesta: lo que se ha encontrado, lo que no se ha podido demostrar, y por dónde seguir.

An honest recap: what was found, what couldn't be demonstrated, and where to go next.

01 · Contribuciones

01 · Contributions

Tres hallazgos sustantivos

Three substantive findings

Arquitectura funcional de tres niveles. Las capas tempranas hacen señal léxica cruda; L0 sola absorbe el 61 % de la separabilidad final del modelo. Las medias computan una transición que el probing ve subir pero la cabeza de clasificación no sabe todavía leer. Las tardías hacen la rotación final, y restaurar sólo la FFN de la capa 11 ya recupera el 100 % del F1 desde un colapso total. No hace falta toda la capa: hace falta ese sub-componente concreto.

A three-level functional architecture. The early layers do raw lexical work, with L0 alone absorbing 61 % of the model's final separability. The middle layers compute a transition that probing sees rise but the classifier head can't yet read. The late ones do the final rotation, and restoring just the FFN of layer 11 already recovers 100 % of F1 from total collapse. You don't need the whole layer; you need that specific sub-component.

Sensibilidad muy desigual entre componentes. Q a rango 128 conserva el 99,4 % del F1; la FFN Intermediate al mismo rango cae al 6,9 %. Catorce veces más de retención absoluta, setenta y dos veces si se normaliza por parámetros eliminados. La compresión uniforme cae por un acantilado entre rango 384 y 256, y por debajo de 128 el F1 es exactamente cero. Que yo sepa, este factor de 14×–72× no estaba cuantificado en la literatura previa de SVD sobre Transformers.

Wildly uneven sensitivity across components. Q at rank 128 keeps 99.4 % of F1; FFN Intermediate at the same rank drops to 6.9 %. Fourteen times the absolute retention, seventy-two times once you normalise by parameters eliminated. Uniform compression falls off a cliff between rank 384 and 256, and below 128 F1 is exactly zero. As far as I'm aware, this 14×–72× factor wasn't quantified in the prior literature on SVD for Transformers.

El algoritmo greedy se queda con 8 de los 9 puntos Pareto-óptimos. Al 80 % de parámetros retiene el 87 % del F1, frente al 43 % de la compresión uniforme al mismo ratio. Y descubre la jerarquía que la interpretabilidad había encontrado por otra vía: comprime Q y K primero, no toca la FFN Intermediate de las capas tardías. Tras tres épocas de fine-tuning, el modelo comprimido al 86,4 % de parámetros llega a F1 0,591 — por encima del baseline (0,577). La ganancia se concentra en emociones infrarrepresentadas: embarrassment pasa de 0,267 a 0,509.

The greedy algorithm takes 8 of the 9 Pareto-optimal points. At 80 % parameters it keeps 87 % of F1, versus 43 % for uniform at the same ratio. And it rediscovers the hierarchy that interpretability had found from a different angle: compress Q and K first, never touch the FFN Intermediate of the late layers. After three epochs of fine-tuning, the compressed model at 86.4 % of parameters reaches F1 0.591 — above the baseline (0.577). The gain concentrates on underrepresented emotions: embarrassment goes from 0.267 to 0.509.

02 · Limitaciones

02 · Limitations

Lo que no se ha podido demostrar

What couldn't be demonstrated

Un modelo, una tarea. Todo lo que aparece aquí se ejecuta sobre BERT-base y GoEmotions. Que la arquitectura sea encoder-only puede estar favoreciendo la concentración tardía que documentamos; con BERT-large, RoBERTa, GPT-2 o LLaMA podría no replicar igual. Es la prioridad número uno del trabajo futuro, pero hoy no está verificada.

One model, one task. Everything in here runs on BERT-base and GoEmotions. The encoder-only architecture might be helping the late-layer concentration we documented; with BERT-large, RoBERTa, GPT-2 or LLaMA it might not replicate the same. It's the top item on the future-work list, but as of today it's not verified.

Compresión post-hoc, no durante el entrenamiento. La SVD entra cuando el modelo ya está fine-tuneado. Otros caminos (pruning estructurado, cuantización, destilación) interactúan de forma distinta y no se evalúan en combinación. Lo razonable sería un pipeline que los apile, pero queda fuera del alcance.

Post-hoc compression, not during training. SVD comes in once the model is already fine-tuned. Other paths (structured pruning, quantisation, distillation) interact differently and aren't evaluated in combination. The reasonable next step is a pipeline that stacks them, but it's out of scope here.

Potencia estadística limitada. Las correlaciones se calculan sobre n = 23 emociones; eso da confianza para detectar efectos grandes (ρ > 0,556) pero no efectos moderados. La regularización que se observa al fine-tunear el modelo comprimido (+90 % en embarrassment) carece de grupo de control con épocas extra sobre el baseline sin comprimir, así que se reporta como observación consistente con la hipótesis y no como causalidad establecida.

Limited statistical power. Correlations are computed over n = 23 emotions; that gives confidence for detecting large effects (ρ > 0.556) but not moderate ones. The regularisation observed when fine-tuning the compressed model (+90 % on embarrassment) doesn't have a control with extra epochs on the uncompressed baseline, so it's reported as an observation consistent with the hypothesis rather than established causation.

Distorsión espectral, no ruido neutral. El activation patching parte de una corrupción estructurada (SVD a rango 64), no de ruido gaussiano como en el causal tracing original de Meng et al. Las conclusiones que se sacan son funcionales —qué componentes bastan para revivir el modelo desde el colapso— más que estrictamente causales en el sentido de Pearl. Es una distinción que conviene tener clara.

Spectral distortion, not neutral noise. Activation patching starts from a structured corruption (SVD to rank 64), not Gaussian noise as in Meng et al.'s original causal tracing. What you can conclude is functional — which components are enough to revive the model from collapse — rather than strictly causal in Pearl's sense. Worth keeping that distinction in mind.

03 · Trabajo futuro

03 · Future work

Predicciones falsables

Falsifiable predictions

Generalización a otros modelos y tareas. La predicción concreta: el ratio de compresibilidad espectral k₉₅(Q)/k₉₅(FFN) en BERT-large debería caer en el rango [0,55, 0,75]. En BERT-base es 0,64. En decoder-only la restauración por activation patching debería repartirse entre varias capas tardías en lugar de concentrarse tanto en L11. Si no replica, hay que matizar la hipótesis de la jerarquía funcional.

Generalisation to other models and tasks. The concrete prediction: the spectral compressibility ratio k₉₅(Q)/k₉₅(FFN) in BERT-large should land in [0.55, 0.75]. BERT-base sits at 0.64. In decoder-only models, activation-patching restoration should spread across several late layers instead of concentrating so much on L11. If it doesn't replicate, the functional-hierarchy claim needs softening.

Verificar causalmente la regularización por compresión. Tres condiciones: (i) baseline con 3 épocas adicionales, (ii) baseline + greedy + 3 épocas (lo de aquí), (iii) baseline + 3 épocas con dropout y weight-decay subidos. Si (ii) supera a (i) y (iii) en F1 macro y, sobre todo, en emociones infrarrepresentadas, la hipótesis de regularización implícita queda bien apoyada.

Causally verifying the regularisation-from-compression effect. Three conditions: (i) baseline with 3 extra epochs, (ii) baseline + greedy + 3 epochs (this work), (iii) baseline + 3 epochs with raised dropout and weight decay. If (ii) beats (i) and (iii) on macro F1 and especially on underrepresented emotions, the implicit-regularisation hypothesis stands on firmer ground.

Compresión por cabeza individual. Tenemos 38 cabezas prescindibles más 21 interferentes identificadas; son candidatas directas a eliminación. Apilar pruning de cabezas + greedy SVD + cuantización post-hoc + fine-tuning recovery podría componer reducciones multiplicativas sin perder F1. Es la línea más práctica.

Per-head compression granularity. We have 38 dispensable plus 21 interfering heads identified; those are direct elimination candidates. Stacking head pruning + greedy SVD + post-hoc quantisation + fine-tuning recovery could compose multiplicative reductions without losing F1. It's the most practical line of work.

Tuned lens y dinámica de entrenamiento. Aprender una transformación T_ℓ : ℝ^d → ℝ^d por capa que minimice la divergencia KL contra L11 y ver si el patrón en U sobrevive a esa calibración. Y monitorizar cristalización y especialización neuronal durante el fine-tuning, para saber en qué momento de la optimización aparece cada propiedad estructural.

Tuned lens and training dynamics. Learn a per-layer transformation T_ℓ : ℝ^d → ℝ^d that minimises KL divergence against L11 and check whether the U pattern survives that calibration. And track crystallisation and neural specialisation during fine-tuning itself, to find out at what point in optimisation each structural property shows up.

BERT no fue diseñado para clasificar emociones. La arquitectura funcional que aparece aquí — cristalización progresiva, dominio de la FFN tardía, la U del logit lens, los seis clusters con coherencia psicológica — no se programó. Salió sola. Lo que hace la interpretabilidad mecánica es documentar lo que el gradiente decidió, no lo que nadie prescribió.

Cómo está hecho · Stack y créditos

BERT wasn't designed for emotion classification. The functional architecture that shows up here — progressive crystallisation, late-FFN dominance, the logit-lens U, six clusters with psychological coherence — wasn't programmed. It came out on its own. What mechanistic interpretability does is document what gradient descent decided, not what anyone prescribed.

How it's built · Stack and credits

Discusión

Discussion

Comentarios

Comments

Si algo te ha llamado la atención, o discrepas, o quieres preguntar — abajo. Comentar requiere cuenta de GitHub.

If something caught your eye, or you disagree, or you'd like to ask — below. Commenting requires a GitHub account.