Paper Reproduction Tutorials¶

This section contains tutorials that reproduce key mechanistic interpretability research papers using mlxterp. Each tutorial demonstrates the library's capabilities through concrete, educational examples that match the original papers' findings.

Why Paper Reproductions?¶

Validation: Verify that mlxterp produces the same results as established research
Education: Learn mechanistic interpretability concepts through hands-on implementation
Reference: Use these as starting points for your own research

Tutorials¶

#	Paper	Difficulty	Time	Key Concept
1	Logit Lens	Beginner	1-2h	Intermediate layer predictions
2	Tuned Lens	Beginner-Int	3-4h	Learned prediction probes
3	Causal Tracing (ROME)	Intermediate	3-4h	Knowledge localization
4	Steering Vectors (CAA)	Intermediate	2-3h	Behavior control
5	Induction Heads	Int-Advanced	4-5h	Pattern completion circuits
6	Sparse Autoencoders	Advanced	5-6h	Feature decomposition

Prerequisites¶

Before starting these tutorials, ensure you have:

mlxterp installed with all extras: pip install mlxterp[dev,docs,viz]
A Mac with Apple Silicon (M1/M2/M3/M4) for optimal performance
Basic familiarity with transformers and neural networks
Python knowledge (intermediate level)

Getting Started¶

We recommend following the tutorials in order, as concepts build upon each other:

Start with Logit Lens - Introduces the core concept of examining intermediate predictions
Move to Tuned Lens - Shows how to improve upon the basic logit lens
Then Causal Tracing - Learn to localize where information is stored
Continue with Steering - Apply what you've learned to control model behavior
Explore Induction Heads - Understand a fundamental transformer circuit
Finish with SAEs - The frontier of interpretability research

Running the Examples¶

Each tutorial has an accompanying Python script in examples/tutorials/:

# Run the Logit Lens tutorial
python examples/tutorials/01_logit_lens/logit_lens_tutorial.py

# Run the Causal Tracing tutorial
python examples/tutorials/02_causal_tracing/causal_tracing_tutorial.py

References¶

These tutorials are based on the following papers:

Logit Lens: nostalgebraist (2020). interpreting GPT: the logit lens
Tuned Lens: Belrose et al. (2023). Eliciting Latent Predictions from Transformers with the Tuned Lens. NeurIPS 2023.
ROME / Causal Tracing: Meng et al. (2022). Locating and Editing Factual Associations in GPT. NeurIPS 2022.
Contrastive Activation Addition: Rimsky et al. (2024). Steering Llama 2 via Contrastive Activation Addition. ACL 2024.
Induction Heads: Olsson et al. (2022). In-context Learning and Induction Heads. Anthropic.
Sparse Autoencoders: Anthropic (2023). Towards Monosemanticity: Decomposing Language Models With Dictionary Learning.