Tutorial 6: Sparse Autoencoders¶

Paper: Towards Monosemanticity: Decomposing Language Models With Dictionary Learning by Anthropic (2023)

Difficulty: Advanced | Time: 5-6 hours

Status: Coming Soon

Overview¶

This tutorial will demonstrate how to train and analyze Sparse Autoencoders (SAEs) to decompose polysemantic neurons into interpretable features.

This tutorial is tracked in GitHub Issue #6. Contributions welcome!

mlxterp already has SAE training and analysis built-in: - model.train_sae() for training - get_top_features_for_text() for feature analysis