Large transformer language models are extremely poorly understood. Even the most basic statistics about their internal weights and activations are unknown. In this research, we harvest some of the extremely low hanging fruit.
We surveyed the Conjecture team about advanced AI:
- Respondents estimated a 70% chance of human extinction from advanced AI getting out of control.
- Respondents estimated an 80% chance of human extinction from advanced AI in general (loss of control + misuse risk)
Direct interpretability of transformer weights, especially MLP layers, has yielded poor results until now: we found the principal directions of weights of MLP layers and output of attention layers in transformers are highly interpretable and monosemantic.
Our CEO Connor Leahy [
@NPCollapse
] gave his 'AGI in Sight' talk at
@CogX_Festival
.
He spoke about how AGI companies are in a multi-way standoff - racing towards the precipice.
But it doesn't have to be this way. We can just stop, and build the institutions for a good future.
Our research also considers issues in the context of self-supervised learning. Early alignment theory focused on optimized agents, but this frame seems confused when applied to modern LLMs.
@repligate
proposed simulators as an alternative frame.
In our second basic facts about LLMs post, we extend our analysis to the distributions of weights, activations, & gradients through training based on the open-source checkpoints of the
@AiEleuther
Pythia model suite.
In March,
@ConjectureAI
CEO,
@NPCollapse
, discussed how mechanistic interpretability still faces steep barriers, and may even be net dangerous for AGI/ASI safety at the
@FLIxrisk
conference at
@MIT
.
Watch it here:
Conjecture and
@ArayaPress
are co-hosting the first Japan AI Alignment Conference, in Tokyo this weekend on March 11 and 12. We’ll be discussing interpretability, conceptual alignment, AI applications, strategy, and field building. See the agenda here:
.
@ConjectureAI
's CEO
@NPCollapse
: if humanity wanted to tackle the problem head on, "world governments would come together and say '...we can't let superhuman AI systems run around.'"
2 years ago, a small group of hackers came together to form Conjecture, and we've been busy! We are incredibly excited for the progress we have been making on our CoEm agenda and think 2024 will be an exciting year to say the least. Read our short update post here:
Last year, we focused on improving our models of ML systems and what’s blocking us from reliably understanding and controlling them. One central obstacle to understanding neural nets is polysemanticity – the phenomenon of individual neurons representing multiple concepts.
CEO Connor Leahy (
@NPCollapse
) attended the AI Safety Summit at
@BletchleyPark
on behalf of
@Conjecture
.
In an interview with
@SciTechgovuk
Connor spoke about how the US and China were now “addressing risks from AI as an international global priority.”
“I think AI can do things that we are just dreaming now.”
On Day 1 of the
#AISafetySummit
we asked some of the influential decisionmakers in attendance to share their thoughts on the Summit, Frontier AI & what the future could look like if we safely harness its power for good 👇
Another obstacle to understanding neural nets is nonlinearity. We looked into LayerNorm and found that although it is supposed to just stabilize training, it can be used to implement non-trivial functions just like a typical non-linearity (e.g. ReLU).
Respondents expected AGI to be developed in 2030 (7-year timelines), one year sooner than Metaculus users predicted. Predictions ranged from 2027 to 2035.
Our CEO
@NPCollapse
will give his talk "AGI in Sight" at 2023's
@CogX_Festival
next week.
Alongside other great speakers including
@harari_yuval
, Professor Stuart Russell, Jaan Tallinn, and more.
Book your tickets here:
We find many interesting behaviors emerge during training and study how the model shifts from its Gaussian initialisation to its final, quite different, endpoint.
.
@NPCollapse
added: "They'd create an intergovernmental project, fund it, and have the best minds come together to work on the hard safety problem until we're sure we've solved it.
LLMs thus shift their representations from Gaussian initialisation to closely match the basic distribution of the input data and maintain this throughout the deep network.
We're still far from reliably interpretable AI. Unaligned AI may have incentives to make its thoughts difficult for us to interpret, and there are many ways sufficiently capable AIs could circumvent interpretability methods, as we wrote about last June.
An especially interesting result is we show the spectrum of the weight matrices is clearly power-law with clear structure across blocks which precisely match the statistics of the data distribution.
There were 22 survey respondents for the probability of extinction questions and 18 respondents for time-to-AGI questions (the vast majority of our team). The estimates above are medians.
Our CEO, Connor Leahy, (
@NPCollapse
) at the
@UKHouseofLords
last week discussing the risks from the development of artificial general intelligence and the legislation needed to address them.
I had a great time addressing the House of Lords about extinction risk from AGI. They were attentive and discussed some parallels between where we are now and non-nuclear proliferation efforts during the Cold War. It certainly provided me with some food for thought, and some
We compile and analyze the distributions of weights, activations, gradients, & spectra of GPT2 series and make a number of novel discoveries including the prevailing power law nature of weight and data distribution, and consistent patterns of norms across depth and model-sizes.
This means we can provide a kind of static analysis of transformer weights to determine the principal action of each block. Moreover, we show that we can use these directions to provide targeted semantic edits of specific concepts.
We found polytopes are less polysemantic and there are more between semantically different than semantically similar regions, suggesting that neural networks use polytope boundaries to define semantic boundaries.
We first show the evolution of the weights, activation and gradients across training. We demonstrate the distribution shifts from a Gaussian initialisation to an extremely heavy tailed distribution. This happens early in training and often appears to be a rapid phase transition.
We also investigate the formation of representations in the embedding across training. We observe a clear and rapid emergence of structure in the first 256 dimensions (corresponding to the ascii characters) of the correlation of the embedding matrix across training.
Our Head of Strategy and Governance
@_andreamiotti
has published his Priorities for the UK Foundation Models Taskforce.
This institution will be tasked with 'the greatest governance challenge of our times: keeping humanity in control of a future with increasingly powerful AIs.'
The new UK's Foundation Models Taskforce, led by
@soundboy
, has a chance to shape bold domestic AI policy & set examples to be emulated abroad. The enormous AI challenge needs ambitious policymaking!
Here are a few ideas about how to make it a success:
We’re working on charting a path to useful, controllable AI with Cognitive Emulation: a research agenda that makes use of task-specific models that do not require thousands of H100s to train.
In some toy examples, we're able to understand what LayerNorm is doing via geometric visuals, and this could be promising for interpreting non-linearities in general.
But understanding what the problem is only the first step. Now we must correctly remedy the problem.
While the Summit provided a good starting point in terms of international coordination, it provided few concrete policy steps to put humanity on a safer long-term trajectory.