Neural network speedrunner and community-funded open source researcher. Set the CIFAR-10 record several times. Send me consulting/contracting work! she/they❤️
After taking ~3.5 months to pursue a personal dream of mine, I'm jumping back into the open source research scene! If you want to help support me, I'm looking to pick up contract/consulting work to help cover the cost/time of the work I do. DMs welcome! 😊
My beliefs on what Q* is doing (and why):
- Train a population of agents to maximize interagent orthogonality and density coverage over a large text corpus.
- Self-play with RL reward signal being any agent returning to original passage at any point.
Reasoning in thread.
The current crop of auto-LLM algorithms is unfortunately incapable of achieving AGI due to a few fundamental limitations, which scaling can't effectively overcome.
That being said, they are a crucial piece of one potential kind of AGI system. I share my perspective below.
Explicit residual layers may no longer be needed for high performance neural networks.
Introducing hlb-CIFAR10 v0.7.0, which no longer requires residual layers and achieves a new 94% WR in just under a blistering 6.3 seconds on a single A100.
You don't need to backprop through discrete samples to learn an effective network.
Introducing an architecture that achieves an impressive 93.11% on CIFAR10 just by predicting its own future state.
This intends to be one key step in replacing RL w/ cross-entropy objectives. 🧵
Just to clarify, LLMs necessarily learn an implicit world model in their performance limit. Every token in a training set is informatically related to every other token, even if transitively. Under an appropriate compression scheme (aggressive weight decay), the maximally
We are closing in on beating the 16x A100 40GB transformer baseline from Tri Dao w/ just 1 A100 40GB. Roughly ~18.89 perplexity (give or take a little) @ 1024 eval length (same as baseline) in ~5 hours or so.
This is huge! Trained at 1024 ctx length, w/ excellent extrapolation.
Ever wanted to train your language models at lightspeed?
New LatentAttention block replaces separate MLP+Attention layers, new microbatch+weight decay schedulers,&scale 46M -> 1.5B w/ one param change (in alpha). 40% faster than v0.3.0
hlb-gpt v0.4.0.🧵
Speed up your LLM research exploration with a lightweight, quick-training toolbench! Introducing hlb-gpt, a lean, mean, trimmed-down (<350 lines!) version of
@karpathy
's excellent nanoGPT.
See more in the thread for tech details & tradespace discussion.
As of yesterday,
@kellerjordan0
is the new CIFAR10 world record holder, with an unbelievable 5.48 second runtime. 🎉🎊🎉🎊
Another digit barrier falls!!!🎊🎉🎊🎉
His code is available at
Insane stuff. Brief summary and future integration deets in thread!
Announcing hlb-gpt v0.3.0! In this release, we achieve another reduction in training time (~18-22s for ~138s now!) with the addition of SiGLU, a move to pure bf16, and some misc hyperparameter tuning. Misc details in the thread! <3 :) :D <3
Repo is at
First, population of agents.
GPT already has to learn internal representations of multitudinous agents in the extreme limit of performance as a natural consequence of next token prediction. It also has to learn to switch between them in a context dependent manner.
The first weakness is this -- while it appears very cool on first blush, it becomes quite aimless in how it orders and executes tasks after a certain point. It feels undirected in some planning phase space. For lack of a better phrase, it feels very "Brownian".
Only a matter of time. ~18.89 @ 1024 ctx (previous best) -> ~18.68 (new best) @ 1024 ctx. 8.8 hours, 1 A100 (of course).
Cherrypicked and Rotten-tomatopicked examples in comments. This should already be the fastest-training small LM by far.
Not too much more needed before ~18.6
We are closing in on beating the 16x A100 40GB transformer baseline from Tri Dao w/ just 1 A100 40GB. Roughly ~18.89 perplexity (give or take a little) @ 1024 eval length (same as baseline) in ~5 hours or so.
This is huge! Trained at 1024 ctx length, w/ excellent extrapolation.
4 important updates:
-- I am 40-50% confident that this is 60-70% of what OpenAI is doing in their recently-hyped Q* method.
-- An improvement to my original proposal would be using an inference-optimized, large frozen model and multiple trainable LoRAs as virtual agents.
-
My beliefs on what Q* is doing (and why):
- Train a population of agents to maximize interagent orthogonality and density coverage over a large text corpus.
- Self-play with RL reward signal being any agent returning to original passage at any point.
Reasoning in thread.
@pleasantiser
(Hihi! Thanks for asking. Also, I'm an enby gal <3 :') :') <3333)
Generating text from a person or a model is like balancing on a bike.
Large language models are trained to ride the bike with someone holding it the whole time, so they never learn to actually balance in the way
Here is the critical thing:
The. Model. Is. Not. Trained. To. Recover. From. These. Errors.
This is a critical 'flaw' of any next-token GPT.
This is because we use teacher-forcing when training GPTs, an ideal we never achieve.
Sebastian did a really cool ablation in the latest hlb-gpt release, showing how significant the GeGLU activation is for the linear value of attention. It's a really cool and well thought out post probing that side of the changes, give it a read!
Following up on some of the
hlb-gpt release coming soon. new fused, efficient module replaces separate mlp+attention blocks & converges faster. new schedulers & wd strategy. thorough code simplicity overhaul. massively decreased convergence time.
might be one of the most impactful releases for me so far.
Want another way to train your models faster? Introducing hlb-CIFAR10 v0.6.0 -- the Dirac update. We now converge to >94% on average in ~6.84 seconds or so after warmup, making the world record a tad bit faster.
Code is at , deets in the thread! <3 :)
Up front, I'd like to acknowledge the three primary sponsors for this work:
@natfriedman
(),
@danielgross
( and Patreon), and
@go2carter
(Patreon).
Without their generous sponsorship, this work would not have been possible.
What my work was going to be was instead of necessarily doing next token prediction, instead, train a model to return to the original text distribution after some perturbation (and online inference). This lets us break the domain gap between training and inference.
Merry Christmas Eve! <3 :')))) 🎄🤶🎄🎅🎄🤶🎄
Here's a rough working prototype of a ternary (~1.58 bit) version of hlb-CIFAR10 that trains to >~91.5% in <10 seconds and can fit on just over half of a floppy drive.
Happy tinkering! <3 Have some cocoa for me! <3 :)
When we learn a distribution, oftentimes reducing the direct A:B correlation helps force us to learn a higher level representation of the scenario. This is, say, how cutout, dropout, etc, work
Splitting generative knowledge across agents can accomplish a similar thing.
If we want to set up a self-play problem to model a generative scenario, this offers two things for us: a natural splitting point for learning agents, and a constraint on our model's knowledge (which will come in handy for the temporal part of this).
Alright, y'all, stop all this frippery. 'Hallucinations' are literally just ~half of the variance in the Cramér–Rao bound.
They'll basically _always_ exist.
Unless you horribly bias your estimator, i.e. RLHF.
Info theory basics, people. Please.
Wiki:
@finbarrtimbers
Ay this reminds me does anyone remember having to constantly compile TensorFlow over several hours from source with Bazel after doing the janky sign-up-gated pick-your-own-cudnn process? That stuff was 🔥🔥🔥🔥
Introducing hlb-gpt 0.1.0! In this update, we take an experimental tack and introduce alpha versions of two novel methods. With this, we cut our training time almost in *half* (~3 minutes from ~6 minutes). Deets in the thread as usual.
Code is at
Now, if multiple agents are engaged in self-play to generate a passage of text, we need a reward function for this. And normal, next-token generation may not work, as it is extremely dynamically constrained.
And RLHF induces bias, which will almost certainly reduce the final performance of the model (and the speed that I understand many RLHF-like processes convergence indicates it's more of a downselection process of features, as it were, than actually learning something new).
Hi! Looking for some help. What's the best pareto-frontier 125M transformer model you know of trained exclusively on WikiText-103? Doesn't matter if it's a 1024 GPU run or not, I need the best.
Definitely haven't made any breakthroughs or anything like that, just asking for a
The feeling is very similar to what many of us I believe had with the constant 'close but no cigar' wrangling of LLMs pre-RHLF, where very quickly one could get 80% of the way there, but even the next 10% of progress felt nearly fruitless.
It's likely (IMO) the same problem!
A day or two ago, I got to play with AutoGPT(), and poked and prodded a bit to see what was possible. It's one of several developing frameworks, and could end up on top or not.
It is a cool demo, and very promising, but also shows a few current weaknesses.
However, we are not limited to solely predicting the next token in a passage, we can use any series of next tokens, it just gets exponentially harder the further we go.
@natfriedman
@danielgross
@go2carter
However, there is a simple solution that bridges both worlds, in a sense (please forgive my crude drawing skills):
We can _implicitly_ create residual connections for the information flowing through our network _in the weights themselves_, instead of imposing them externally.
(Side note, determining orthogonality (and the appropriate amount of it, I think) is hard, here).
With multiple agents, we add the missing ingredient in LLMs up until this point: reactive cooperation.
LLMs aren't directly trained to do this with pure next token prediction.
@natfriedman
@danielgross
@go2carter
Back in the last release of hlb-cifar10, I found that we could improve the convergence speed of the network by initializing the layers in the residual branch with torch.nn.init.dirac_() -- basically, a 'no-op' that copied the inputs to the outputs at initialization.
@natfriedman
@danielgross
@go2carter
There was a bug w/ the F.normalize() on the weights. I don't believe it was doing anything as it wasn't in-place (big ol' egg-on-face moment, for sure), so I removed that + cleaned up the function definitions.
Look at how beautiful this is. Feels close to Kolmogorov-minimality❤️
Models perform poorly out of distribution. LLMs can never perfectly estimate the original distribution. _In inference_, LLMs are out of distribution on the first step. _LLMs are never explicitly trained to recover from this_.
We desperately need a better solution than this.
However, remember that each token of text in a sequence contains information from _every single token preceding it_. Of course we do not have all of the information preceding, and that is where the concept of noise in information theory comes from.
Now, why multiple agents with a degree of orthogonality with each other? Well, you introduce a new kind of regularization that forces you to begin learning the temporal dynamics of the text.
What's more, with multiple agents, it becomes dynamic, and something magical happens.
I generally do not release unfinished research until I have a bow on it and things are nice, clean, and self contained, but I feel it is relevant to this problem.
The way we inference LLMs has one critical flaw that we're not doing anything about, to my knowledge, at least.
1. The cumulative agent population learns a function that keeps on topic, effectively -- it learns state-space denoising.
2. We learn intrinsic cooperation patterns in conversation between _information-limited agents_, which is a far better match to humans than standard LLMs.
However, we have not necessarily lost information on the whole, as the entire agent population should cumulatively have coverage of our target distribution.
This split allows us an opportunity to learn implicit agent-agent communication and cooperation.
@natfriedman
@danielgross
@go2carter
What we can do is initialize part of the transition layer with a Dirac init -- which again is just repeats the incoming layer, and initialize the rest with random filters.
This creates several different 'lanes' of varying depth throughout the network at the start of training.
🧵
Announcing v0.4.0 of hlb-CIFAR10! With this update, you can train CIFAR10 to >94% on average in ~7.68-7.71 seconds or so, or ~95.84% in ~172 seconds. Crazy!
Link to patch notes here, summary notes in thread.
However, with a comically-robust pretraining process, we can end up with a distribution of models that has sufficient coverage over the text distribution that such long range dependencies actually produce signal with meaning for our loss function.
So, naturally, we want a generative process that naturally includes this interactive dynamic in it, without completely butchering our data distribution or resorting to any kinds of hacks.
This version of the method does that.
There is another hidden benefit as well.
I see lots of people talking about Mojo, but almost nobody talking about Hidet and hidetscript.🧑💻
Recently announced by the PyTorch team, it's available as a backend or as a separate library for flexible GPU programming. Even lower level than Triton. (!)
I could maybe write a small book's worth of unsolicited thoughts on ML research methodology. I really appreciated reading Carmack's opinions here. I agree.
I'd like to share some of my thoughts as to why this motif works so well for ML research, as opposed to software
For the last several years at Id Software, almost all of my coding was inside a single monolithic codebase, with the utilities integrated. Visual Studio was a delight to work with.
Armadillo Aerospace only had two modest programs that evolved over a decade. VS for client and raw
One example that drove it home for me was that I asked it to do a research task where it compiled a summary of a dense topic for me. The agent very quickly got stuck in shallow actions, constantly creating new agents and asking them to do things they weren't capable of doing.
Feel free to DM me or reply to any of these tweets if you have thoughts or questions, this is an interesting topic! <3 ;D :)))) :"D
Lemme hear from ya!
In the same way that a new car loses value as soon as you drive it off the lot, the token you generate (effectively) is (in a manner of speaking) already out of distribution! You're doomed from the start.
No context size scaling, model size, etc is going to fix this.
But the math is the same -- it's well-grounded I think in the basics of information theory, and it moves us to the temporal domain. It's very similar to next-token prediction, but adds much more leverage to it through a different domain.
3. We are _still_ learning a generative function, but _now_, the generative element extends _directly to the dynamic state space interactions with others_!!!!
This is unbelievably cool. Seriously. I think it opens a lot of possibilities.
If anyone is wondering what logging tools/metrics I use to optimize model training times, I almost exclusively use wall clock time (in seconds), num_microbatches (for LLMs), and validation loss, sometimes viewed as perplexity. That's basically it.
This variance results in a kind of noise, as at each step, the model will sample from a distribution ever-so-slightly-different from the true distribution, even if perfectly converged, it is a finite model, so there will be noise.
This results in errors that accumulate.
@natfriedman
@danielgross
@go2carter
I consider this particular release to be rather significant, as getting around residual layers in neural networks has long been something I've wanted to do.
Let's get into the some of the 'what' that has changed to enable this, first.
I hadn't put it enough together beyond the rough temporal component/domain mismatch business.
Something I firmly believe is that the "return to text distribution" loss is really one of the only ways you can do this.
Why?
G-Maximization (link in post)
This excellent, and very frustratingly seemingly-overlooked, 1986 paper by Pearlmutter (
@BAPearlmutter
) and Hinton (
@geoffreyhinton
) has only been cited 59 times in nearly the last 30 years. Yet I think some of the concepts, and at the very least,
When we learn a distribution, we learn an estimator of it. An unbiased estimator, as you can guess, is unbiased. For text generation, this is a good thing.
Unless you have a perfect model, however, if you have an unbiased estimator, you will still have some variance.
@natfriedman
@danielgross
@go2carter
In our case, it's likely a bad thing. This is because neural networks are a bit of an odd thing when it comes to information flow. The more noise we have in our forward pass, the less useful our gradients backwards are. This is made even worse by the presence of nonlinearities.
@natfriedman
@danielgross
@go2carter
So, we basically get a residual init 'for free', at the start of training, and the network can eventually learn to undo it if that serves the objective better. But this is an extremely computationally-efficient way to do so, especially as it fuses these ops into a single kernel.
Now, the solution to this maybe isn't too hard -- just specifically training the models on being able to execute tasks, plan ahead, and tackle and recover from dead-ends in the NP-hard problem space that is whatever one would define this open-ended task-assigment space to be.
@karpathy
Aren't approximate KNN lookups really efficient too? Seeing the new libs feels like we're redoing research from years past, sorta groundhog day all over again.
I guess this and the SVM trick are all using the same underlying projection foundation.
@natfriedman
@danielgross
@go2carter
I'd like to refer to this as a kind of soft-architecture, or basically a bias over the information flow within a network via the structure of the weights themselves.
@iliekcomputers
So I recommend setting loss smoothing to .2 when you're working with CIFAR10 like you are in this picture! :D
Honestly the loss looks quite healthy but the network is a bit big with too many degrees of freedom.
That's my best read from the numbers, hope this helps some! <3 :)
One of the questions I got was if the upcoming (~50M) hlb-gpt model with the new attention block scales.
I poked around, one thing led to another, and first time yesterday, I trained my first 1.5B model. It got ~21 perplexity on W103 in 3 hours, w/ 1 A100.
I did it again twice.
This is really interesting because RHLF feels very much an end-all-be-all solution just because it mimics our human communication interface. But it turns out that it results in an LLM that's very good in talking nicely when executing tasks, with very little goal-directed behavior
First off, the code gist:
I am releasing this as what I call a 'network sketch' -- something not polished to the maximum, but (hopefully) functional enough to prove out an idea.
The network runs in the same number of steps as the baseline, 1.7x wallclock
Secondly, within this paradigm, we're seeing lots of attempts at having duplicate agents with different state work together, which can be created and destroyed by other agents. For brevity, I'll refer to this kind of AGI as an Automated Task Manager framework (ATM).
I saw a post talking about a "mysterious and unexplored hack" known as EMA that got 250k views recently.
It is neither of those, it is quite well-known and useful! One of my favorite works involving it is
Favorite adjacent work is
What I see in my feed:
-- Wild speculations about Q*
-- Historical digging through past interviews
-- People complaining about wild speculations about Q*
And the world goes on, as it does.
Well, remember next-token prediction? In this modality, we need access to the underlying stream of information encoded in the tokens.
A temporal dependency is the only way to do this.
One of the only ways to learn state space recovery requires inference.
Now, let's get to the magic.
Back to the limited agents. If each agent was a GPT with the same distribution, there would be only symmetry, nothing would be learned.
The more information is shared across agents, the more inherent cooperation is needed.
Additionally, it fulfils I think a lot of what we are wanting from LLMs, today, as it better matches the human expectations with interacting with LLM agents.
Additionally, it should scale better as we're no longer futilely trying to reduce model variance at all costs.
@natfriedman
@danielgross
@go2carter
With a random initialization, we are basically choosing a completely random set of assumptions for the neural network -- and one of them is the weighting mappings of information from one channel to another. This could be a good or a bad thing.
@natfriedman
@danielgross
@go2carter
When we use a dirac initialization on the network branches, we're basically reducing the noise at the start of training, because there's not as much diffusion of the input signal of the network from start to finish.
@karpathy
I'm honored and a bit stunned. Wow, thank you, Andrej! You got me into ML via CharRNN (Cookbooks specifically). MinGPT helped inspire this repo, I've needed something similar for years. It's also a (mild overkill) living resume to help pick up some part time niche specialty work.
If we have some future-return-to-the-original-text-prediction objective, and several, mostly-orthogonal agents (think Billy Crystal in the Princess Bride) we now force several things to happen:
@kittytronic
@cremieuxrecueil
It's not an atypical experience. In fact, my understanding is that most gifted children suffer these issues.
The two studies that the OP linked are about University adjustment -- which is a problem for some, but the effect tends to peak in the mid-20's apparently. I'm hoping OP
This is really, really important.
I hope you enjoyed this thread. It feels weird to put out partially-unfinished work for the world to see, but it feels relevant to the movement of the moment.
Writing this down helped clarify my thoughts on this. The grad norm spikes all seem to follow a function with a sharp spike up and a long tail of decay, and it looks like it has some exponential component to it in both the ramp and decay. If you take a look at it, I'm sure you
Experienced LLM-trainers of Twitter, I am looking for your help with a question! I am currently working on optimizing up the end-tail of training , and am unfortunately currently wrangling with the dreaded Loss Spikes.
I need some help for hints as to where to look to address
@natfriedman
@danielgross
@go2carter
The addition operator in the residual of ResNets, for example, basically forces a consistent structure on the latent space of the residual -- this reduces the degrees of freedom of the network, and shortens the effective length that the gradients have to travel in the network.
Maybe that was short on the math or such, and I think everyone and their parent is talking about AGI/etc for now, so this is from someone who's done applied research for a little while, and this is what I'd focus on to improve it. I believe it's viable. Hopefully it brings good.
The GOAT of tennis
@DjokerNole
said: "35 is the new 25.” I say: “60 is the new 35.” AI research has kept me strong and healthy. AI could work wonders for you, too!
@natfriedman
@danielgross
@go2carter
For example, one modification that may or may not be needed for this to transfer to larger networks is to weight decay towards a network architecture that enforces identity connections, instead of just zeros.
We also have the problems of asynchronous communication and the exponential divergence of information within a system that is chaotic (IIRC) due to information asymmetry between unique agents. There is a new Amdahl's Law waiting to be specified for this particular framework.
Sadly one of the downsides is that we're opening a new generative dimension -- the state space of LLM interactions simulating human interactions, and like most high dimensional things, esp w/o fixed points of data, is likely to be extremely compute-intensive (unfortunately).
Now as we go into the solid climbing phase of the LLM research S curve, I believe companies will get more and more like hedge fund companies, where tiny pieces of information are critical and highly guarded.
This will also result in binding legal morass for skilled researchers.
@natfriedman
@danielgross
@go2carter
This is why DenseNets and ResNets changed the game so much when they were introduced, they shortened the effective length of the network when it came to the information flowing through the network, but still allowed for some of the benefits of depth.
When they responded that they weren't, instead of integrating that information, the LLM basically tried to politely brute force a different way. I remember watching it encounter 403 after 403 on the nih website, without changing strategy or really reacting to the environment.
@natfriedman
@danielgross
@go2carter
So, we are down from 8 convolutional layers to 7, of which only 6 are trainable (!!!!!).
There are no more explicit residuals, so now it's simply called SpeedyConvNet (for now ;PPPP)