Tutorial 6: Sparse Autoencoders¶
Paper: Towards Monosemanticity: Decomposing Language Models With Dictionary Learning by Anthropic (2023)
Difficulty: Advanced | Time: 5-6 hours
Status: Coming Soon
Overview¶
This tutorial will demonstrate how to train and analyze Sparse Autoencoders (SAEs) to decompose polysemantic neurons into interpretable features.
Planned Content¶
- The superposition hypothesis and polysemanticity
- Sparse autoencoder architecture and training objective
- Training an SAE on model activations using mlxterp's built-in SAE support
- Identifying and interpreting learned features
- Feature steering: using SAE features for targeted interventions
- Limitations and future directions
Implementation Status¶
This tutorial is tracked in GitHub Issue #6. Contributions welcome!
mlxterp already has SAE training and analysis built-in:
- model.train_sae() for training
- get_top_features_for_text() for feature analysis
References¶
-
Anthropic (2023). Towards Monosemanticity: Decomposing Language Models With Dictionary Learning.
-
Cunningham et al. (2023). Sparse Autoencoders Find Highly Interpretable Features in Language Models.