Paper Reproduction Tutorials¶
This section contains tutorials that reproduce key mechanistic interpretability research papers using mlxterp. Each tutorial demonstrates the library's capabilities through concrete, educational examples that match the original papers' findings.
Why Paper Reproductions?¶
- Validation: Verify that mlxterp produces the same results as established research
- Education: Learn mechanistic interpretability concepts through hands-on implementation
- Reference: Use these as starting points for your own research
Tutorials¶
| # | Paper | Difficulty | Time | Key Concept |
|---|---|---|---|---|
| 1 | Logit Lens | Beginner | 1-2h | Intermediate layer predictions |
| 2 | Tuned Lens | Beginner-Int | 3-4h | Learned prediction probes |
| 3 | Causal Tracing (ROME) | Intermediate | 3-4h | Knowledge localization |
| 4 | Steering Vectors (CAA) | Intermediate | 2-3h | Behavior control |
| 5 | Induction Heads | Int-Advanced | 4-5h | Pattern completion circuits |
| 6 | Sparse Autoencoders | Advanced | 5-6h | Feature decomposition |
Prerequisites¶
Before starting these tutorials, ensure you have:
- mlxterp installed with all extras:
pip install mlxterp[dev,docs,viz] - A Mac with Apple Silicon (M1/M2/M3/M4) for optimal performance
- Basic familiarity with transformers and neural networks
- Python knowledge (intermediate level)
Getting Started¶
We recommend following the tutorials in order, as concepts build upon each other:
- Start with Logit Lens - Introduces the core concept of examining intermediate predictions
- Move to Tuned Lens - Shows how to improve upon the basic logit lens
- Then Causal Tracing - Learn to localize where information is stored
- Continue with Steering - Apply what you've learned to control model behavior
- Explore Induction Heads - Understand a fundamental transformer circuit
- Finish with SAEs - The frontier of interpretability research
Running the Examples¶
Each tutorial has an accompanying Python script in examples/tutorials/:
# Run the Logit Lens tutorial
python examples/tutorials/01_logit_lens/logit_lens_tutorial.py
# Run the Causal Tracing tutorial
python examples/tutorials/02_causal_tracing/causal_tracing_tutorial.py
References¶
These tutorials are based on the following papers:
-
Logit Lens: nostalgebraist (2020). interpreting GPT: the logit lens
-
Tuned Lens: Belrose et al. (2023). Eliciting Latent Predictions from Transformers with the Tuned Lens. NeurIPS 2023.
-
ROME / Causal Tracing: Meng et al. (2022). Locating and Editing Factual Associations in GPT. NeurIPS 2022.
-
Contrastive Activation Addition: Rimsky et al. (2024). Steering Llama 2 via Contrastive Activation Addition. ACL 2024.
-
Induction Heads: Olsson et al. (2022). In-context Learning and Induction Heads. Anthropic.
-
Sparse Autoencoders: Anthropic (2023). Towards Monosemanticity: Decomposing Language Models With Dictionary Learning.