Interpretable Neurons in Toy Models

homeprojectsblog

June 23, 2024

This post is based on two great papers out of Anthropic:

  1. Toy Models of Superposition (Elhage, et al.)
  2. Towards Monosemanticity: Decomposing Language Models With Dictionary Learning (Bricken, et al.)

The reason interpretability is so hard

It's common to see large models treated as "black boxes," where at the macro level we have some notion of what's going on, but the moment you begin to drill deeper, everything becomes all tangled and ugly. A given neuron might activate on certain Hebrew characters, numbers over 20, and colons in URLs. There is a reason for this, and it really comes down to the data on which our models are trained.

When our data is sparse, meaning that a given feature1 is quite rare in the scope of all training data (how often does a sentence about the Golden Gate Bridge come up?), models pack in more features than there are neurons,2 a phenomenon called superposition. When a given neuron in a model corresponds to more than one feature we call it polysemantic, and makes interpreting the neuron vastly more difficult.

For a long time, the ideas of superposition and polysemanticity floated around as theories of how the models can compress so much data, but weren’t given true validation until Anthropic’s Toy Models of Superposition paper, which is a really fun read. Once it was clear that superposition was happening, at least in small toy models, the next step was to figure out what could be done to reverse it.

What to do about superposition

A cool way to think about superposition is to imagine that your model is simulating a larger model, one whose neurons are mostly monosemantic (1 neuron = 1 feature). When you think about it this way, a solution to superposition comes in the form of expanding your model back out to its imaginary/original larger form. For a given layer of our model, this is exactly what a sparse autoencoder does.

Training a sparse autoencoder (SAE) is something we do after the base model is trained, and we apply it to the layer we wish to interpret. The architecture is very simiple: a linear layer, a non-linear activation, and another linear layer back to the original dimension size. The dimension of the hidden layer will be much bigger than that of the input/output, since it is supposed to represent the "larger" model we are simulating.

sparse autoencoder

We train the SAE to reproduce its input exactly (which is what makes it an auto encoder), meaning that the actual feature information we want lives within the hidden layer. What makes it a sparse autoencoder is that when some data is passed in, only a few neurons should light up significantly, in contrast to a typical layer where the activations of neurons appear relatively diffuse. This sparsity is achieved by adding an L1 penalty to the loss function. As it turns out, when the hidden-layer neuron activations become sufficiently sparse, they also become interpretable (i.e. specific neurons will light up for unique and distinct features that can be easily explained).

Why it even matters

If you're like me, you'll find the idea of being able to understand specific neurons in LLMs really cool. If not, there are plenty of reasons to be interested in sparse autoencoders—so far, I've only talked about understanding what the model is doing, but not changing it.

Imagine you have a feature in your SAE that activates when the input is about dogs. Big deal! Think again, because we can take that neuron (the one in the sparse autoencoder) and artificially activate it, even when our input has nothing to do with four-legged friends. We can then decode back into model activation space, and replace the current activation with our new, partially edited one. When we look at the resulting output, we see that our model is now fixated on dogs (or whatever feature we wish, like the Golden Gate)! This is not some gimmicky trick or hidden prompt that can be easily jailbroken, this is an internal procedure that was done within the model.

But making the model focus on a single topic is just the tip of the iceberg. Think about a "western" feature that allows you to roleplay, or an "instructions for nuclear bombs" feature that you can artificially zero to make the model safer. Imagine finetuning on some specific data, finding the feature that corresponds to that specific data, and manually adjusting the activations so that you get the equivalent of highly reliable system prompts. By editing the underlying activations via SAE neurons, it's finally possible to make fine-grain edits to models. And this is by no means limited to language models!

Replication!!!

After reading the aforementioned papers, I got really excited about the ideas above and wanted to try finding some interpretable neurons myself. To begin, I needed a language model to interpret, so I trained my own. Staying consistent with the Towards Monosemanticity paper, I trained a small, single transformer-block network on around 2B tokens. With only a single layer, and ~50M parameters the model is pretty terrible. Nevertheless, the output is generally coherent:

In Fremont County is a lush green town named according to an article published by Smithsonian magazine.

It is only recently that he was compelled to return to Australia to prosper from self-government to wholesome and to cultures of central Australia.

I then fed this model another ~2B tokens, and saved the activations of the transformer block's MLP layer (dim=512). With these activations I trained a sparse autoencoder whose internal feature dimension was 32x (dim=16384) larger than the activations. After letting the model run for a very long time (long after the loss seemed to plateau, something the authors noted as being crucial), my SAE turned out pretty well!3

To figure out what each neuron does, I fed the SAE a bunch of data (more specifically, I fed in the activations of tokens from the transformer), and kept note of the tokens which made a given neuron light up most. A reasonable summary of my findings would be that most neurons correspond to specific, common words (e.g. "what" or "to"), though some are more general and will fire for any synonym of a given word. Looking at different features was really fun, below are some of my favorites. Highlighted in orange is the token on which the neuron fired, and I have included some of the surrounding tokens for context:

Neuron 512 — indefinite pronouns

  • ...because they only earn their own living and nobody helps them with their harvest. They...
  • She wanted to pay him more than anyone else. He took the offer...
  • When a monk or someone passes through seeking refuge on a mountain...

Neuron 565 — tenses of "do"

  • Even if we take the laser away, this does not affect the physics of the situation...
  • ...significantly higher levels of vitamin E than those who did not. The study found no association...
  • It is important to note that we could do it the other way: that is, by...

Neuron 642 — lengths of time

  • Over the past two centuries the Senate has probed issues such as interstate...
  • ...few birds nesting there. Last year the birds were recorded breeding at Grande...
  • ...personal note to Hillary Clinton about the agenda for next week's meeting, you'd...

I only looked at a small fraction of features, so I'm sure there are a bunch of cool ones that I missed. Anthropic labeled their features algorithmically using Claude, but I chose to just do it manually for a random assortment of features. At some point I'd like to replicate this on a larger, open-source, pre-trained model (perhaps something like Mistral 7B), and when/if I do, I'd probably use a similar approach (using LLMs for labeling) since I think it would be cool to have a big dictionary of all the features.

I should have my code for this on my GitHub soon. Thanks for reading, and let me know what you think!



  1. Feature is a very loose term for an interpretable thing, at any level of specificity. Think dog or golden retriever or the word to.
  2. This works because if n neurons encode features in directions not completely orthogonal to each other, they can embed many more than n features. This, of course, means that there will be some noise/interference between features, though the sparsity of the data (combined with non-linear activations) makes this noise largely irrelevant, since the likelyhood that two distinct and uncorrelated features will occur together is very small.
  3. Training a sparse autoencoder is actually really difficult, and it took me a couple of days to finally get everything right to the point where I had interpretable neurons. Even then, I estimate that only around a third of the neurons are actually interpretable. This is because most of the neurons end up being totally dead (they don't fire at all) or still polysemantic (they fire too often and on unrelated things). Dead neurons are mainly a defect caused by the L1 penalty, wich the authors solved using a fairly complicated resampling strategy of which I implemented a much simpler, and it seems less effective, version. Why there are still polysemantic neurons is less clear, though I theorize it's mostly a combination of lack of training time and too few tokens.