Something new 🐠. Prev: reasoning lead
@metaai
, LLaMA 2/3,
@paperswithcode
co-creator, Galactica LLM lead, Atlas ML (acq by Meta), sports betting, UK Treasury
Here’s my ICML talk on teaching LLMs to reason (video and slides).
Like everyone else, I can’t talk about what I’m working on right now, but I tried to provide a useful overview of the history of LLMs and reasoning, current areas of focus, and potential directions.
Enjoy!
I am the first author of the Galactica paper and have been quiet about it for a year. Maybe I will write a blog post talking about what actually happened, but if you want the TLDR:
1. Galactica was a base model trained on scientific literature and modalities.
2. We approached
One year ago — 2 weeks before
@OpenAI
released ChatGPT —
@Meta
released Galactica. The LLM was public for only 3 days, but its lessons led to decisions around Llama's release. Thanks to
@jpineau1
for chatting w/ me and h/t to
@ylecun
Read here: ⏬
I left Meta yesterday. Nothing but positive things to say: FAIR and GenAI are great places to do research and engineering. Will miss my colleagues!
LLMs have shown how magical deep learning can be in a data-rich regime. But many domains remain data-constrained, which prevents
Why are LLMs bad at reasoning?
One theory says this is due to weaknesses in maximum likelihood, where the probability mass “overgeneralises” to low quality solutions.
Because our pretraining objective (likelihood) doesn’t transfer to our evaluation objective (accuracy), the
Controversial take: open LLM leaderboards have been a net negative to the field as they’ve encouraged leaderboard hacking, training on in-domain datasets (likely test sets too), GPT distillation and other practices that confound comparison.
On
@paperswithcode
we never allowed
Our mission at
@paperswithcode
is to to index all scientific information and then convert this information into useful knowledge. The datasets index for ML is getting really comprehensive! What’s next? Stay tuned 🙃.
🎉 We've just crossed 5000 Datasets! 🎉
We now index and organize more than 5000 research datasets for machine learning. A huge thanks to the research community for their ongoing contributions.
Browse the full catalogue here:
A year’s journey; glad to get this out! The vision is to build a megafunction that models all of Nature. Small steps with this work. We broke a few rules about training LLMs on the way, with some great results.
🪐 Introducing Galactica. A large language model for science.
Can summarize academic literature, solve math problems, generate Wiki articles, write scientific code, annotate molecules and proteins, and more.
Explore and get weights:
Fun fact, when Attention is All You Need came out, Twitter was mostly focused on the SeLU paper (and its appendix). Not a one off either; there are countless other examples of Twitter picking the wrong “winner”.
The best way to predict the future is to do the work yourself and
Not sure we ever detailed our experience training the 120B Galactica model, but tldr: have a mechanism in place for skipping batches, lower LR after periods of sustained instability, use longer warmup to hedge against bad init, pray to the AI Gods 🙏.
What’s worse is that bad public evaluation leads to a race to the bottom. Because of Prisoners’ Dilemma, if everyone else is benchmark hacking, you have a strong incentive to benchmark hack yourself.
Case in point:
- There’s no such thing as a “base model” anymore as people are
ICML is my first ever ML conference (I’m antisocial 😅). Some observations:
- People are really excited about LLaMA 2 and open source LLMs in general.
- RLHF seems to be working well for everyone in most LLM domains (chatbots, code, reasoning).
- Hawaiian shirt % is pretty low. I
It’s disingenuous to speak of the risks of open source AI without acknowledging the risks of closed source AI.
Is it wise to have concentration of extreme power with a few large, unaccountable organisations? If there are guardians, who chooses the guardians?
Was not expecting a hastily written, early morning Galactica retrospective to bounce so much - given it happened 3000 years ago in AI time.
To close the topic, here’s a great talk below by Sean Murray on No Man’s Sky and how they “engoodened” things (after some initial missteps)
@DrJimFan
Nice writeup! Quick correction though: ORMs still learn a per token loss, and can learn to assign credit to intermediate steps. Covered an example here in my talk :
100% the worst LLM take is that weights shouldn’t store knowledge. They might not store *all* knowledge, but their compiled knowledge in weight memory is the source of their creativity (and value) over retrieval approaches.
Internal reasoning tokens, circa 2022. <work> 🙂
(It’s a shame people aren’t exploring similar ideas for visual reasoning. Good reasoning requires a visual and a propositional code)
This was always going to happen given the hype and money in the field + poor evaluation standards.
People need to be way more sceptical of new model releases. Far too many instances of “benchmark hill-climbing grift” going on as a vehicle to create hype.
This case is
A story about fraud in the AI research community:
On September 5th, Matt Shumer, CEO of OthersideAI, announces to the world that they've made a breakthrough, allowing them to train a mid-size model to top-tier levels of performance. This is huge. If it's real.
It isn't.
@coppola_ai
Good research, bad launch.
Bad launch not because of stupidity but because of loss of situational awareness due to excessive workload.
To fix: if your team is operating above capacity, then make sure to have good internal feedback sources that can help you see the wood from the
@ChurchillMic
Yes, to be clear I think the commentary was completely overblown. We were directionally right (and early) with what we were doing, but the way the demo was executed was wrong. So it’s not really apologetic, more like “here’s what happened in case you were wondering”
What people get wrong about foundation models for science: too much focus on generating new discoveries and ideas - way too little focus on instrumentation.
The biggest driver of scientific progress is new instruments. So the biggest impact of deep learning will be accelerating
Every time we tried this it had little benefit, but still had plenty of people asserting to me as a fact that “training on code helps with reasoning”.
Reality: the biggest gains come from training on the same domain. If your downstream task involves LaTeX math, then the biggest
What happened, three possibilities:
1. They both use the same annotation provider for instruction tuning, who has provided the same prompt/answer data to two different companies…
2. The annotators themselves are using the same language models to help annotate (eg GPT-4), which
The fundamental tension of doing ambitious projects:
Long periods in the wilderness trying to make new things work. Meanwhile, the world continues to move around you, harvesting the low-hanging fruit.
Delayed gratification is hard, but worth it!
My 2024 wishes for for open source / open science in ML:
- Less free-riding on GPT outputs; instead more open innovation in post-training to obtain outputs of a similar or better quality.
- More understanding that RLHF is not a capability “nerf” but a capability enhancer:
LLMs are trained to imitate human outputs; not human latents. This explains the majority of the alignment problem, as well as deficits in areas such as reasoning - where we do not observe mental scaffolding (internal context) on the internet - we only observe output context.
Congrats to
@OpenAI
on the amazing results!
o1-mini results particularly impressive. My guess: larger models are more “wasteful” in spending more time in the forward pass on easier tokens. Therefore it’s better to have a smaller model that can more efficiently allocate compute
We're releasing a preview of OpenAI o1—a new series of AI models designed to spend more time thinking before they respond.
These models can reason through complex tasks and solve harder problems than previous models in science, coding, and math.
@deliprao
To be fair to OpenAI, the early GPT papers were a gold mine and I’m not sure a lot of the open LLM efforts would have been possible without them. So I think “parasitic” is way too harsh!
(My bigger gripe: their fuelling of anon hype accounts which have polluted the X feed and
Thanks to everyone for the kind words on the Galactica paper. Will not be commenting on the wider release, but I am optimistic that LLMs will evolve from tools of association (creativity + search + idea generation) into aligned models that preserve factualness. Stay tuned!
I don’t subscribe to the view that “ideas are cheap, execution is everything”, in the sense that good intuition is hard to come by.
But what is true is that new ideas are incredibly fragile. At an organisation that prioritises reactive plays, these ideas will die without
Great paper. Last year we had similar problem where Galactica could do SMILES -> IUPAC but not the inverse. Solution was to augment the data and shuffle the PubChem layout (lol).
Simple rephrasing of existing datasets is likely to yield large benefits for generalisation.
Does a language model trained on “A is B” generalize to “B is A”?
E.g. When trained only on “George Washington was the first US president”, can models automatically answer “Who was the first US president?”
Our new paper shows they cannot!
I don’t think anyone has a grasp on what society will look like with each person’s intelligence increased by an order of magnitude. But it does not immediately follow that restricting power and access to an elite minority is better than decentralising this power. Not obvious.
System 2 is fiction. It does not exist as an area in the brain. It might be a useful psychological abstraction, but it doesn’t tell you anything useful about how to achieve goal-directed behaviour computationally.
Great paper with lots of ablations breaking down the benefits of Mamba layers versus self-attention.
TLDR: Mamba layers worse than self-attention on in-context learning and context-based information retrieval. Hybrid model seems to get best of both worlds*.
Also nice tidbit:
A 8B-3.5T hybrid SSM model gets better accuracy than an 8B-3.5T transformer trained on the same dataset:
* 7% attention, the rest is Mamba2
* MMLU jumps from 50 to 53.6%
* Training efficiency is the same
* Inference cost is much less
I signed. Closed source AI is more dangerous than open source AI in the short-medium run due to lack of transparency about how models are built and developed.
Letting just a few companies develop advanced AI makes the problem worse, not better. We need more eyeballs on weights
What would you build if you took deep learning seriously?
Ten years ago the answer was AGI: a term that would make you look kooky in research circles. Now every post-ChatGPT startup has AGI as their mission. It’s no longer a sign of ambition for a new company.
Maybe a slightly
One of great joys of life: meeting lots of talented people (new and old faces) working on big problems.
Thanks NY & SF - now back to London for a bit 🇬🇧.
Introducing Meta Llama 3: the most capable openly available LLM to date.
Today we’re releasing 8B & 70B models that deliver on new capabilities such as improved reasoning and set a new state-of-the-art for models of their sizes.
Today's release includes the first two Llama 3
If you optimised AI releases for Twitter traction a few years ago, you could expect it to be fairly targeted, well-informed feedback. But nowadays feedback is dominated by a low-quality hype factor. This leads to difficult tradeoffs about how you talk about your work, and
@_xjdr
Everyone on X assumes that good LLM performance is due to “fancy secret method”, but more often than not it’s just solid execution of well-established recipes.
Credit rating agencies had misaligned incentives in the 2000s: the providers of the products they rated were the ones paying them. (My first job was regulating CDOs post-crisis, lol)
Similarly a company that sells data to frontier firms for LLMs is probably not the right one to
There is a surprising amount of value just by taking common assertions people make, ask “Why?” a couple of times, and find out that it has no grounding other than the fact that lots of people keep saying it.
The introspective version of this: if you’ve believed anything for 5
@OctothorpeVoid
Yep similar to PaLM, when you have a bad spike, record the batch index and skip it. You can also sometimes reduce the LR by a very small magnitude and the spike won't happen (which suggests spikes are due to an unstable loss surface, not the data per say).
@JPobserver
This is a popular way to explain things.
I’m a bit sceptical of system 1/system 2 as it’s a psychological abstraction that doesn’t actually tell you how the brain works.
I think you can get a better idea of what LLMs might be missing by looking at how the prefrontal cortex
AGI test: When will AI be able to work as a human-level wedding planner?
Specs:
- Talks to couple, understands their preferences
- Rings up venues, negotiates prices etc
- Hires musicians, catering, arranges transport
- Designs and pushes out the website
- Sends out invites
-
We evaluated regularly to see the effect of spikes. While val loss usually recovered quickly, downstream eval showed it would sometimes forget a task (e.g. Yes/No QA) and take a long while to recover. So when debugging spikes, don’t just look at val loss…
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
abs:
pdf:
- a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry;
- experts who have or are pursuing PhDs in the
X is a study on how most people form expectations adaptively rather than rationally. Explains everything from “e/acc” internal contradictions on AI safety, to “LLMs can’t do x” arguments that are outdated in a few months.
The big changes are when progress undermines current
People doing reasoning self-correction papers: please ablate against simple majority voting. Thanks 🙏
As with many reasoning papers this year, the bulk of the performance increase below comes from GPT-4 distillation, not the method that is introduced — which is likely beaten
Learning From Mistakes Makes LLM Better Reasoner
paper page:
Large language models (LLMs) recently exhibited remarkable reasoning capabilities on solving math problems. To further improve this capability, this work proposes Learning from Mistakes (LeMa),
1/12 We are happy to announce the release of our language models for optimizing small organic molecules. Built on top of Galactica and Gemma, they come in three sizes: Chemlactica-125M, Chemlactica-1.3B, and Chemma-2B. The preprint is on arxiv:
We have trained ESM3 and we're excited to introduce EvolutionaryScale.
ESM3 is a generative language model for programming biology. In experiments, we found ESM3 can simulate 500M years of evolution to generate new fluorescent proteins.
Read more:
@ylecun
Yes, I’m with you on the inference process (simulating consequences, evaluating them, and planning), and the idea we can use inference compute for policy improvement.
I think what irks me about system 1/2 as a metaphor is:
1) When we overuse two-factor metaphors, people end up
@AlbalakAlon
Nice paper! I understand this through an NFL type argument. There is no such thing as “data quality” because everything (even spam) is a task to be learnt.
However in practice there is a subset of tasks you care about more. That means there is a “free lunch” by reweighting the
@agihippo
I wanted to scream when I saw that post this morning. Massive elephant in the room called “pretraining data” that influencers couldn’t seem to see 🐘…
@soumithchintala
The Act of Creation by Arthur Koestler is the best philosophy read on this.
My favourite anecdote on how far away AI is: the structure of benzene was discovered by someone having a dream of a snake eating its own tail 🙂.
Update: it looks like the user posted the Claude response in another conversation with Pi, so it echoed the same response, per
@inflectionAI
.
Guessing game over…
What happened, three possibilities:
1. They both use the same annotation provider for instruction tuning, who has provided the same prompt/answer data to two different companies…
2. The annotators themselves are using the same language models to help annotate (eg GPT-4), which
Had a lot of fun talking to Nathan a few weeks back about various topics in LLM land!
Note: my thought-to-speech module was a little off due to jet lag — so the text transcript is good if you prefer that medium ☺️.
Finally got to chat with
@rosstaylor90
-- exactly why I started this series. So much signal on the LLM life cycle from training to demos.
Reasoning, Galactica, demo backlash, post-training, sft vs rlhf, LLMs for science, realistic agents, PRMs, and other topics
> Ed Witten finds Claude's insights on quantum mechanics meh
Humans will reach superhuman performance in goalpost-moving before AI gets superhuman at science
One of the problems with LLM creativity is that new methods and tools have relatively few mentions in the literature, so creativity can’t rely as much on (flexible) weight memory; and instead must be in-context using some form of retrieval.
We often think of creativity as the
If you believe LLMs don’t have meaning because they are “ungrounded”, then you also believe there is no such thing as history. My take on the English civil war is “ungrounded” because it is based on inference from historical texts, not 17th century “world knowledge”.
@_xjdr
Traditional CoT data doesn’t have steps like “alternatively, maybe we could try x” (branching) or “actually, that’s wrong” (self-correction) —> so the easiest thing people should try is making a SFT dataset with these types of step, then do PPO on top, as the model will now have
Finally got a chance to visit the Computer History Museum in Mountain View. So inspiring!
Was fun to think of the future sections that will be added. The GenAI boom of the 2020s…and the more important things that come after :)
Basically, it was determined to treat this problem in the same way as the famous 8x8-board-with-two-corners-removed problem, and nothing I said could shake its conviction that that was the style of the correct answer. 5/5
On point. Amazing that last November with Galactica some people thought “not always accurate” implied “has no use case” in math or science… The whole point is to have a human in the loop with an artificial association cortex that makes useful connections.
Terence Tao on his experience with GPT4 in mathematical research:
"The 2023-level AI can already generate suggestive hints and promising leads to a working mathematician and participate actively in the decision-making process."
@Francis_YAO_
We found that with a well curated niche corpus, with Galactica, you can outperform OPT and BLOOM in general benchmarks… so something was up with the pre-training data for those models.
@_arohan_
@paperswithcode
Canary questions are a great idea.
Maybe this is possible to measure with some of the existing benchmarks where there are (undeliberate) incorrect labels.
@tegmark
@RishiSunak
@vonderleyen
Altman, Hassabis, and Amodei are the ones doing massive corporate lobbying at the moment.
They are the ones who are attempting to perform a regulatory capture of the AI industry.
You, Geoff, and Yoshua are giving ammunition to those who are lobbying for a ban on open AI R&D.
If
New preprint post! We show that motor commands in the superior colliculus shift the internal representation of heading during REM sleep despite the immobility of sleeping mice. Thus, the brain simulates actions and their consequences during REM sleep.🧵1/7
For too long, users have lived under the software lottery tyranny of fused attention implementations.
No longer.
Introducing FlexAttention, a new PyTorch API allowing for many attention variants to enjoy fused kernels in a few lines of PyTorch.
1/10
The “CoT capability comes from code” theory makes no sense, especially for MMLU. The reasoning subset of that benchmark depends on equation recall and substitution/rearranging. Code is not relevant; similar pre-training data is. Hence strong Minerva and Galactica results…
People think domain specific models -> smaller models, but actually its the opposite. If you are data constrained you need to have *larger* models for fixed compute. (Why we went for 120B Galactica not 70B).
Hallucination is a problem of metacognition - knowing what you don’t know. The solution is not designing models to know less…Talk about throwing the baby out with the bathwater.
@giffmana
@BlancheMinerva
Even if you exclude the WikiHow extracts in the test set, you still get a big boost from using the rest of the corpus.
I mention this example because it’s public with the Gemini paper, but there are lots of other examples like this where you can find a small, highly in-domain
Treat the rate of each as Poisson distribution. Then B - G is skellam for each hospital. Mean is 0 for each hospital as boys and girls equally likely , but variance of larger hospital (B + G) is higher, which means P(k=0) is lower. So the likelihood is higher for the larger
A silly math question posed in our discord: There are two hospitals in a city, a small and a bigger one. In which of them is the likelihood higher of more boys than girls being born in one day? (Provided that girls and boys are on average born equally frequently).
Toolformer: Language Models Can Teach Themselves to Use Tools
Presents Toolformer, a model trained to decide which APIs to call, when to call them, what arguments to pass, and how to best incorporate the results into future token prediction.
abs:
@cunha_tristan
I think it’s more productive to think in terms of benchmarks where you think some definition of reasoning is necessary.
For example, most people would agree mathematics requires reasoning. If LLMs do well at unseen mathematics tasks, then that would suggest they can at least
Btw this is not to dismiss efforts like
@scale_AI
‘s SEAL leaderboard, which are welcome and well-intentioned.
But worth mentioning the incentive problem now as it shows problems with evaluation are much deeper than they appear on the surface.
This is really cool. They use Galactica base with their LMX approach to perform symbolic regression. Unlike other SR methods, which require *a lot* of hand-crafted design choices, Galactica just acts as the generator and achieves comparable results. 🤙