To better understand how neural networks function, researchers trained a toy 512-node neural network on a text dataset and then tried to identify features within the network that are semantically meaningful. The key observation is that while individual neurons are difficult to attribute specific functionality to, you can find groups of neurons that collectively do seem to fire in response to human-legible features and concepts. By some metric, the 4096-feature decomposition of the 512-node toy model explains 79% of the information within it. The researchers used an AI nicknamed Claude to automatically annotate all the features by guessing how a human would describe them, like for example feature #3647 “Abstract adjectives/verbs in credit/debt legal text”, or the “sus” feature #3545. Browse through the visualization and see for yourself!
The researchers called the ability of neural networks to encode more information than they have neurons for as “superposition”, and single neurons being responsible for multiple, sometimes seemingly unrelated, concepts as being “polysemantic”.
Full paper: https://transformer-circuits.pub/2023/monosemantic-features/index.html
also discussed at: https://www.astralcodexten.com/p/god-help-us-lets-try-to-understand
and hackernews: https://news.ycombinator.com/item?id=38438261
I like the “god help us” article and although I wish the first example had the representative colors the article described, the entire article helps makes sense of the monoemanticity and intuitively sounds similar to how human intelligence works and what scientists used to talk about when they questioned human consciousness.