An auto-encoder is trained on [activation] -AV-> [text] -AR-> [activation], where [activation] belongs to one layer in the LLM model M.
Architecture:
Model being analyzed (M): >|||||>
Verbalizer (AV) same as M: >|||||>
Reconstructor (AR) truncated up to the layer being analyzed: ||>
The AV, AR models are initialized using supervised learning on a summarization task. The assumption being that model thoughts are similar to context summary.
The AR is trained on a simple reconstruction loss.
The AV is trained using an RL objective of reconstruction loss with a KL penalty to keep the verbalizations similar to the initial weights (to maintain linguistic fluency).
- Authors acknowledge, and expect, confabulations in verbalizations: factually incorrect or unsubstantiated statements. But, the internal thought we seek is itself, by definition, unsubstantiated. How can we tell if it is not duplicitous?
- They test this on a layer 2/3 deep into the models. I wonder how shallow and deep abstractions affect thought verbalization?
> We also release an interactive frontend for exploring NLAs on several open models through a collaboration with Neuronpedia.
Whatever they did on LLama didn't work, nothing makes sense in their example where they ask the model to lie about 1+1. Either the model is too old, or whatever they used isn't working, but whatever the autoencoder outputs is nothing like their examples with claude. Gemma is similarly bad.
Anthropic Research going from strength to strength in interpretability. Publicly releasing the code so other labs can benefit from it is also a great move - very values aligned, and improves the overall AI safety ecosystem.
Beautiful idea, an autoencoder must represent everything without hiding if is to recover the original data closely. So it trains a model to verbalize embeddings well. This reveals what we want to know about the model (such as when it thinks it is being tested, or other hidden thoughts).
It could just invent its own secret language embedded into English akin to steganography. The explanation would not lose information but would remain uninterpretable by humans
This is exactly what I first thought. “The user appears to be attempting to decode my previous thought process, …”, the question is whether or not the model will be able to internalize this in such a way that is undetectable to the aforementioned technique.
finally a something interesting but this only makes me think that the last judgement is still in human hands to judge claude inner thoughts is correct or not
I mean who knows if those are really claude thoughts or claude just think that is his thoughts because humans wants it
Please check my understanding:
An auto-encoder is trained on [activation] -AV-> [text] -AR-> [activation], where [activation] belongs to one layer in the LLM model M.
Architecture:
The AV, AR models are initialized using supervised learning on a summarization task. The assumption being that model thoughts are similar to context summary.The AR is trained on a simple reconstruction loss.
The AV is trained using an RL objective of reconstruction loss with a KL penalty to keep the verbalizations similar to the initial weights (to maintain linguistic fluency).
- Authors acknowledge, and expect, confabulations in verbalizations: factually incorrect or unsubstantiated statements. But, the internal thought we seek is itself, by definition, unsubstantiated. How can we tell if it is not duplicitous?
- They test this on a layer 2/3 deep into the models. I wonder how shallow and deep abstractions affect thought verbalization?
Anthropic has released open weight models for translating the activations of existing models, viz. Qwen 2.5 (7B), Gemma 3 (12B, 27B) and Llama 3.3 (70B) into natural language text. https://github.com/kitft/natural_language_autoencoders https://huggingface.co/collections/kitft/nla-models This is huge news and it's great to see Anthropic finally engage with the Hugging Face and open weights community!
> We also release an interactive frontend for exploring NLAs on several open models through a collaboration with Neuronpedia.
Whatever they did on LLama didn't work, nothing makes sense in their example where they ask the model to lie about 1+1. Either the model is too old, or whatever they used isn't working, but whatever the autoencoder outputs is nothing like their examples with claude. Gemma is similarly bad.
same. i'm trying to trigger the 'mom is in the next room' russian thing but the model thinks the sentence is from american reddit.
it seems that the examples they showed off with haiku work. i'd guess llama is just too bad
Anthropic Research going from strength to strength in interpretability. Publicly releasing the code so other labs can benefit from it is also a great move - very values aligned, and improves the overall AI safety ecosystem.
Beautiful idea, an autoencoder must represent everything without hiding if is to recover the original data closely. So it trains a model to verbalize embeddings well. This reveals what we want to know about the model (such as when it thinks it is being tested, or other hidden thoughts).
It could just invent its own secret language embedded into English akin to steganography. The explanation would not lose information but would remain uninterpretable by humans
It will inevitably learn how to think in a way that translates to one (moral) meaning and back but has an ulterior meaning underneath.
This is exactly what I first thought. “The user appears to be attempting to decode my previous thought process, …”, the question is whether or not the model will be able to internalize this in such a way that is undetectable to the aforementioned technique.
That shouldn't happen as long as the autoencoder isn't used as an RL reward. It will happen (due to Goodhart's law) if it is.
Of course, if you use it to make any decision that can still happen eventually.
finally a something interesting but this only makes me think that the last judgement is still in human hands to judge claude inner thoughts is correct or not
I mean who knows if those are really claude thoughts or claude just think that is his thoughts because humans wants it