Fern Profile
Fern

@hi_tysam

2,340
Followers
204
Following
142
Media
1,723
Statuses

Neural network speedrunner and community-funded open source researcher. Set the CIFAR-10 record several times. Send me consulting/contracting work! she/they❤️

Joined January 2023
Don't wanna be here? Send us removal request.
Pinned Tweet
@hi_tysam
Fern
21 days
After taking ~3.5 months to pursue a personal dream of mine, I'm jumping back into the open source research scene! If you want to help support me, I'm looking to pick up contract/consulting work to help cover the cost/time of the work I do. DMs welcome! 😊
Tweet media one
2
1
22
@hi_tysam
Fern
1 year
My beliefs on what Q* is doing (and why): - Train a population of agents to maximize interagent orthogonality and density coverage over a large text corpus. - Self-play with RL reward signal being any agent returning to original passage at any point. Reasoning in thread.
23
43
784
@hi_tysam
Fern
2 years
The current crop of auto-LLM algorithms is unfortunately incapable of achieving AGI due to a few fundamental limitations, which scaling can't effectively overcome. That being said, they are a crucial piece of one potential kind of AGI system. I share my perspective below.
Tweet media one
18
90
572
@hi_tysam
Fern
1 year
Explicit residual layers may no longer be needed for high performance neural networks. Introducing hlb-CIFAR10 v0.7.0, which no longer requires residual layers and achieves a new 94% WR in just under a blistering 6.3 seconds on a single A100.
9
34
281
@hi_tysam
Fern
11 months
You don't need to backprop through discrete samples to learn an effective network. Introducing an architecture that achieves an impressive 93.11% on CIFAR10 just by predicting its own future state. This intends to be one key step in replacing RL w/ cross-entropy objectives. 🧵
Tweet media one
14
19
254
@hi_tysam
Fern
11 months
Just to clarify, LLMs necessarily learn an implicit world model in their performance limit. Every token in a training set is informatically related to every other token, even if transitively. Under an appropriate compression scheme (aggressive weight decay), the maximally
Tweet media one
23
17
206
@hi_tysam
Fern
7 months
We are closing in on beating the 16x A100 40GB transformer baseline from Tri Dao w/ just 1 A100 40GB. Roughly ~18.89 perplexity (give or take a little) @ 1024 eval length (same as baseline) in ~5 hours or so. This is huge! Trained at 1024 ctx length, w/ excellent extrapolation.
Tweet media one
6
9
155
@hi_tysam
Fern
8 months
Ever wanted to train your language models at lightspeed? New LatentAttention block replaces separate MLP+Attention layers, new microbatch+weight decay schedulers,&scale 46M -> 1.5B w/ one param change (in alpha). 40% faster than v0.3.0 hlb-gpt v0.4.0.🧵
Tweet media one
5
20
121
@hi_tysam
Fern
2 years
Speed up your LLM research exploration with a lightweight, quick-training toolbench! Introducing hlb-gpt, a lean, mean, trimmed-down (<350 lines!) version of @karpathy 's excellent nanoGPT. See more in the thread for tech details & tradespace discussion.
3
13
85
@hi_tysam
Fern
11 months
The more I work in ML the more I feel like nearly any loss objective can, and should, be rephrased as its cross-entropy-based analog.
6
6
74
@hi_tysam
Fern
11 months
As of yesterday, @kellerjordan0 is the new CIFAR10 world record holder, with an unbelievable 5.48 second runtime. 🎉🎊🎉🎊 Another digit barrier falls!!!🎊🎉🎊🎉 His code is available at Insane stuff. Brief summary and future integration deets in thread!
1
4
59
@hi_tysam
Fern
2 years
Announcing hlb-gpt v0.3.0! In this release, we achieve another reduction in training time (~18-22s for ~138s now!) with the addition of SiGLU, a move to pure bf16, and some misc hyperparameter tuning. Misc details in the thread! <3 :) :D <3 Repo is at
1
9
57
@hi_tysam
Fern
1 year
First, population of agents. GPT already has to learn internal representations of multitudinous agents in the extreme limit of performance as a natural consequence of next token prediction. It also has to learn to switch between them in a context dependent manner.
1
0
49
@hi_tysam
Fern
2 years
The first weakness is this -- while it appears very cool on first blush, it becomes quite aimless in how it orders and executes tasks after a certain point. It feels undirected in some planning phase space. For lack of a better phrase, it feels very "Brownian".
2
1
44
@hi_tysam
Fern
7 months
Only a matter of time. ~18.89 @ 1024 ctx (previous best) -> ~18.68 (new best) @ 1024 ctx. 8.8 hours, 1 A100 (of course). Cherrypicked and Rotten-tomatopicked examples in comments. This should already be the fastest-training small LM by far. Not too much more needed before ~18.6
Tweet media one
@hi_tysam
Fern
7 months
We are closing in on beating the 16x A100 40GB transformer baseline from Tri Dao w/ just 1 A100 40GB. Roughly ~18.89 perplexity (give or take a little) @ 1024 eval length (same as baseline) in ~5 hours or so. This is huge! Trained at 1024 ctx length, w/ excellent extrapolation.
Tweet media one
6
9
155
2
2
38
@hi_tysam
Fern
1 year
4 important updates: -- I am 40-50% confident that this is 60-70% of what OpenAI is doing in their recently-hyped Q* method. -- An improvement to my original proposal would be using an inference-optimized, large frozen model and multiple trainable LoRAs as virtual agents. -
@hi_tysam
Fern
1 year
My beliefs on what Q* is doing (and why): - Train a population of agents to maximize interagent orthogonality and density coverage over a large text corpus. - Self-play with RL reward signal being any agent returning to original passage at any point. Reasoning in thread.
23
43
784
2
0
34
@hi_tysam
Fern
1 year
@pleasantiser (Hihi! Thanks for asking. Also, I'm an enby gal <3 :') :') <3333) Generating text from a person or a model is like balancing on a bike. Large language models are trained to ride the bike with someone holding it the whole time, so they never learn to actually balance in the way
6
2
33
@hi_tysam
Fern
1 year
Here is the critical thing: The. Model. Is. Not. Trained. To. Recover. From. These. Errors. This is a critical 'flaw' of any next-token GPT. This is because we use teacher-forcing when training GPTs, an ideal we never achieve.
1
2
32
@hi_tysam
Fern
8 months
Sebastian did a really cool ablation in the latest hlb-gpt release, showing how significant the GeGLU activation is for the linear value of attention. It's a really cool and well thought out post probing that side of the changes, give it a read! Following up on some of the
@omouamoua
Sebastian
8 months
The effect of using GeGLU on the value in Attention, as done in the new hlb-gpt release by @hi_tysam is pretty significant:
Tweet media one
5
4
28
0
2
31
@hi_tysam
Fern
8 months
hlb-gpt release coming soon. new fused, efficient module replaces separate mlp+attention blocks & converges faster. new schedulers & wd strategy. thorough code simplicity overhaul. massively decreased convergence time. might be one of the most impactful releases for me so far.
6
5
32
@hi_tysam
Fern
2 years
Want another way to train your models faster? Introducing hlb-CIFAR10 v0.6.0 -- the Dirac update. We now converge to >94% on average in ~6.84 seconds or so after warmup, making the world record a tad bit faster. Code is at , deets in the thread! <3 :)
5
3
31
@hi_tysam
Fern
1 year
Up front, I'd like to acknowledge the three primary sponsors for this work: @natfriedman (), @danielgross ( and Patreon), and @go2carter (Patreon). Without their generous sponsorship, this work would not have been possible.
1
1
30
@hi_tysam
Fern
1 year
What my work was going to be was instead of necessarily doing next token prediction, instead, train a model to return to the original text distribution after some perturbation (and online inference). This lets us break the domain gap between training and inference.
3
0
29
@hi_tysam
Fern
11 months
Merry Christmas Eve! <3 :')))) 🎄🤶🎄🎅🎄🤶🎄 Here's a rough working prototype of a ternary (~1.58 bit) version of hlb-CIFAR10 that trains to >~91.5% in <10 seconds and can fit on just over half of a floppy drive. Happy tinkering! <3 Have some cocoa for me! <3 :)
Tweet media one
4
3
27
@hi_tysam
Fern
1 year
When we learn a distribution, oftentimes reducing the direct A:B correlation helps force us to learn a higher level representation of the scenario. This is, say, how cutout, dropout, etc, work Splitting generative knowledge across agents can accomplish a similar thing.
1
0
27
@hi_tysam
Fern
1 year
If we want to set up a self-play problem to model a generative scenario, this offers two things for us: a natural splitting point for learning agents, and a constraint on our model's knowledge (which will come in handy for the temporal part of this).
1
0
27
@hi_tysam
Fern
8 months
Alright, y'all, stop all this frippery. 'Hallucinations' are literally just ~half of the variance in the Cramér–Rao bound. They'll basically _always_ exist. Unless you horribly bias your estimator, i.e. RLHF. Info theory basics, people. Please. Wiki:
Tweet media one
2
2
26
@hi_tysam
Fern
1 year
@finbarrtimbers Ay this reminds me does anyone remember having to constantly compile TensorFlow over several hours from source with Bazel after doing the janky sign-up-gated pick-your-own-cudnn process? That stuff was 🔥🔥🔥🔥
2
0
27
@hi_tysam
Fern
2 years
Introducing hlb-gpt 0.1.0! In this update, we take an experimental tack and introduce alpha versions of two novel methods. With this, we cut our training time almost in *half* (~3 minutes from ~6 minutes). Deets in the thread as usual. Code is at
2
2
28
@hi_tysam
Fern
10 months
happy 75th anniversary to the invention of the next-token-prediction rnn by claude shannon
Tweet media one
1
1
27
@hi_tysam
Fern
1 year
Now, if multiple agents are engaged in self-play to generate a passage of text, we need a reward function for this. And normal, next-token generation may not work, as it is extremely dynamically constrained.
1
0
25
@hi_tysam
Fern
1 year
And RLHF induces bias, which will almost certainly reduce the final performance of the model (and the speed that I understand many RLHF-like processes convergence indicates it's more of a downselection process of features, as it were, than actually learning something new).
1
0
24
@hi_tysam
Fern
7 months
Hi! Looking for some help. What's the best pareto-frontier 125M transformer model you know of trained exclusively on WikiText-103? Doesn't matter if it's a 1024 GPU run or not, I need the best. Definitely haven't made any breakthroughs or anything like that, just asking for a
1
4
24
@hi_tysam
Fern
2 years
The feeling is very similar to what many of us I believe had with the constant 'close but no cigar' wrangling of LLMs pre-RHLF, where very quickly one could get 80% of the way there, but even the next 10% of progress felt nearly fruitless. It's likely (IMO) the same problem!
2
1
24
@hi_tysam
Fern
2 years
A day or two ago, I got to play with AutoGPT(), and poked and prodded a bit to see what was possible. It's one of several developing frameworks, and could end up on top or not. It is a cool demo, and very promising, but also shows a few current weaknesses.
@SmokeAwayyy
Smoke-away
2 years
Auto-GPT. "An experimental open-source attempt to make GPT-4 fully autonomous."
32
101
629
3
2
24
@hi_tysam
Fern
1 year
However, we are not limited to solely predicting the next token in a passage, we can use any series of next tokens, it just gets exponentially harder the further we go.
1
0
22
@hi_tysam
Fern
1 year
@natfriedman @danielgross @go2carter However, there is a simple solution that bridges both worlds, in a sense (please forgive my crude drawing skills): We can _implicitly_ create residual connections for the information flowing through our network _in the weights themselves_, instead of imposing them externally.
Tweet media one
2
2
23
@hi_tysam
Fern
1 year
(Side note, determining orthogonality (and the appropriate amount of it, I think) is hard, here). With multiple agents, we add the missing ingredient in LLMs up until this point: reactive cooperation. LLMs aren't directly trained to do this with pure next token prediction.
1
0
22
@hi_tysam
Fern
1 year
@natfriedman @danielgross @go2carter Back in the last release of hlb-cifar10, I found that we could improve the convergence speed of the network by initializing the layers in the residual branch with torch.nn.init.dirac_() -- basically, a 'no-op' that copied the inputs to the outputs at initialization.
1
1
22
@hi_tysam
Fern
1 year
@natfriedman @danielgross @go2carter There was a bug w/ the F.normalize() on the weights. I don't believe it was doing anything as it wasn't in-place (big ol' egg-on-face moment, for sure), so I removed that + cleaned up the function definitions. Look at how beautiful this is. Feels close to Kolmogorov-minimality❤️
Tweet media one
2
1
23
@hi_tysam
Fern
1 year
Models perform poorly out of distribution. LLMs can never perfectly estimate the original distribution. _In inference_, LLMs are out of distribution on the first step. _LLMs are never explicitly trained to recover from this_. We desperately need a better solution than this.
1
0
22
@hi_tysam
Fern
1 year
However, remember that each token of text in a sequence contains information from _every single token preceding it_. Of course we do not have all of the information preceding, and that is where the concept of noise in information theory comes from.
1
0
22
@hi_tysam
Fern
1 year
Now, why multiple agents with a degree of orthogonality with each other? Well, you introduce a new kind of regularization that forces you to begin learning the temporal dynamics of the text. What's more, with multiple agents, it becomes dynamic, and something magical happens.
2
0
21
@hi_tysam
Fern
1 year
I generally do not release unfinished research until I have a bow on it and things are nice, clean, and self contained, but I feel it is relevant to this problem. The way we inference LLMs has one critical flaw that we're not doing anything about, to my knowledge, at least.
1
0
21
@hi_tysam
Fern
11 months
@go2carter Hey if I'm going to wisp my soul away to the trends of vapidity might as well be honest about it
1
0
21
@hi_tysam
Fern
1 year
1. The cumulative agent population learns a function that keeps on topic, effectively -- it learns state-space denoising. 2. We learn intrinsic cooperation patterns in conversation between _information-limited agents_, which is a far better match to humans than standard LLMs.
1
0
21
@hi_tysam
Fern
1 year
However, we have not necessarily lost information on the whole, as the entire agent population should cumulatively have coverage of our target distribution. This split allows us an opportunity to learn implicit agent-agent communication and cooperation.
1
0
21
@hi_tysam
Fern
1 year
@natfriedman @danielgross @go2carter What we can do is initialize part of the transition layer with a Dirac init -- which again is just repeats the incoming layer, and initialize the rest with random filters. This creates several different 'lanes' of varying depth throughout the network at the start of training.
1
0
20
@hi_tysam
Fern
2 years
🧵 Announcing v0.4.0 of hlb-CIFAR10! With this update, you can train CIFAR10 to >94% on average in ~7.68-7.71 seconds or so, or ~95.84% in ~172 seconds. Crazy! Link to patch notes here, summary notes in thread.
1
1
19
@hi_tysam
Fern
1 year
However, with a comically-robust pretraining process, we can end up with a distribution of models that has sufficient coverage over the text distribution that such long range dependencies actually produce signal with meaning for our loss function.
1
0
19
@hi_tysam
Fern
1 year
So, naturally, we want a generative process that naturally includes this interactive dynamic in it, without completely butchering our data distribution or resorting to any kinds of hacks. This version of the method does that. There is another hidden benefit as well.
1
0
19
@hi_tysam
Fern
1 year
I see lots of people talking about Mojo, but almost nobody talking about Hidet and hidetscript.🧑‍💻 Recently announced by the PyTorch team, it's available as a backend or as a separate library for flexible GPU programming. Even lower level than Triton. (!)
Tweet media one
0
5
19
@hi_tysam
Fern
9 months
I could maybe write a small book's worth of unsolicited thoughts on ML research methodology. I really appreciated reading Carmack's opinions here. I agree. I'd like to share some of my thoughts as to why this motif works so well for ML research, as opposed to software
@ID_AA_Carmack
John Carmack
9 months
For the last several years at Id Software, almost all of my coding was inside a single monolithic codebase, with the utilities integrated. Visual Studio was a delight to work with. Armadillo Aerospace only had two modest programs that evolved over a decade. VS for client and raw
103
228
4K
2
1
19
@hi_tysam
Fern
2 years
One example that drove it home for me was that I asked it to do a research task where it compiled a summary of a dense topic for me. The agent very quickly got stuck in shallow actions, constantly creating new agents and asking them to do things they weren't capable of doing.
1
1
19
@hi_tysam
Fern
2 years
Feel free to DM me or reply to any of these tweets if you have thoughts or questions, this is an interesting topic! <3 ;D :)))) :"D Lemme hear from ya!
1
2
19
@hi_tysam
Fern
2 years
Agent strategies: Immediate cost, lifetime cost, node-to-neighbor information throughput Agent management: Arbitrary node-to-node information throughput, agent operational independence trustworthiness (!!!!), asynchronicity of agent goals (!!!), information/task thrash
1
1
17
@hi_tysam
Fern
1 year
In the same way that a new car loses value as soon as you drive it off the lot, the token you generate (effectively) is (in a manner of speaking) already out of distribution! You're doomed from the start. No context size scaling, model size, etc is going to fix this.
1
0
16
@hi_tysam
Fern
1 year
But the math is the same -- it's well-grounded I think in the basics of information theory, and it moves us to the temporal domain. It's very similar to next-token prediction, but adds much more leverage to it through a different domain.
1
0
17
@hi_tysam
Fern
1 year
3. We are _still_ learning a generative function, but _now_, the generative element extends _directly to the dynamic state space interactions with others_!!!! This is unbelievably cool. Seriously. I think it opens a lot of possibilities.
2
0
17
@hi_tysam
Fern
8 months
If anyone is wondering what logging tools/metrics I use to optimize model training times, I almost exclusively use wall clock time (in seconds), num_microbatches (for LLMs), and validation loss, sometimes viewed as perplexity. That's basically it.
2
2
17
@hi_tysam
Fern
1 year
This variance results in a kind of noise, as at each step, the model will sample from a distribution ever-so-slightly-different from the true distribution, even if perfectly converged, it is a finite model, so there will be noise. This results in errors that accumulate.
1
0
17
@hi_tysam
Fern
1 year
@natfriedman @danielgross @go2carter I consider this particular release to be rather significant, as getting around residual layers in neural networks has long been something I've wanted to do. Let's get into the some of the 'what' that has changed to enable this, first.
1
0
17
@hi_tysam
Fern
1 year
I hadn't put it enough together beyond the rough temporal component/domain mismatch business. Something I firmly believe is that the "return to text distribution" loss is really one of the only ways you can do this. Why?
1
0
16
@hi_tysam
Fern
11 months
G-Maximization (link in post) This excellent, and very frustratingly seemingly-overlooked, 1986 paper by Pearlmutter ( @BAPearlmutter ) and Hinton ( @geoffreyhinton ) has only been cited 59 times in nearly the last 30 years. Yet I think some of the concepts, and at the very least,
Tweet media one
1
2
17
@hi_tysam
Fern
1 year
When we learn a distribution, we learn an estimator of it. An unbiased estimator, as you can guess, is unbiased. For text generation, this is a good thing. Unless you have a perfect model, however, if you have an unbiased estimator, you will still have some variance.
1
0
16
@hi_tysam
Fern
1 year
@natfriedman @danielgross @go2carter In our case, it's likely a bad thing. This is because neural networks are a bit of an odd thing when it comes to information flow. The more noise we have in our forward pass, the less useful our gradients backwards are. This is made even worse by the presence of nonlinearities.
1
0
17
@hi_tysam
Fern
1 year
@natfriedman @danielgross @go2carter So, we basically get a residual init 'for free', at the start of training, and the network can eventually learn to undo it if that serves the objective better. But this is an extremely computationally-efficient way to do so, especially as it fuses these ops into a single kernel.
1
0
17
@hi_tysam
Fern
2 years
Now, the solution to this maybe isn't too hard -- just specifically training the models on being able to execute tasks, plan ahead, and tackle and recover from dead-ends in the NP-hard problem space that is whatever one would define this open-ended task-assigment space to be.
1
1
17
@hi_tysam
Fern
2 years
@karpathy Aren't approximate KNN lookups really efficient too? Seeing the new libs feels like we're redoing research from years past, sorta groundhog day all over again. I guess this and the SVM trick are all using the same underlying projection foundation.
1
0
16
@hi_tysam
Fern
1 year
@natfriedman @danielgross @go2carter I'd like to refer to this as a kind of soft-architecture, or basically a bias over the information flow within a network via the structure of the weights themselves.
1
0
16
@hi_tysam
Fern
2 years
@iliekcomputers So I recommend setting loss smoothing to .2 when you're working with CIFAR10 like you are in this picture! :D Honestly the loss looks quite healthy but the network is a bit big with too many degrees of freedom. That's my best read from the numbers, hope this helps some! <3 :)
2
0
15
@hi_tysam
Fern
8 months
One of the questions I got was if the upcoming (~50M) hlb-gpt model with the new attention block scales. I poked around, one thing led to another, and first time yesterday, I trained my first 1.5B model. It got ~21 perplexity on W103 in 3 hours, w/ 1 A100. I did it again twice.
3
2
16
@hi_tysam
Fern
2 years
This is really interesting because RHLF feels very much an end-all-be-all solution just because it mimics our human communication interface. But it turns out that it results in an LLM that's very good in talking nicely when executing tasks, with very little goal-directed behavior
3
3
15
@hi_tysam
Fern
11 months
First off, the code gist: I am releasing this as what I call a 'network sketch' -- something not polished to the maximum, but (hopefully) functional enough to prove out an idea. The network runs in the same number of steps as the baseline, 1.7x wallclock
2
0
16
@hi_tysam
Fern
2 years
Secondly, within this paradigm, we're seeing lots of attempts at having duplicate agents with different state work together, which can be created and destroyed by other agents. For brevity, I'll refer to this kind of AGI as an Automated Task Manager framework (ATM).
1
2
16
@hi_tysam
Fern
10 months
I saw a post talking about a "mysterious and unexplored hack" known as EMA that got 250k views recently. It is neither of those, it is quite well-known and useful! One of my favorite works involving it is Favorite adjacent work is
0
0
15
@hi_tysam
Fern
1 year
What I see in my feed: -- Wild speculations about Q* -- Historical digging through past interviews -- People complaining about wild speculations about Q* And the world goes on, as it does.
3
0
16
@hi_tysam
Fern
1 year
Well, remember next-token prediction? In this modality, we need access to the underlying stream of information encoded in the tokens. A temporal dependency is the only way to do this. One of the only ways to learn state space recovery requires inference.
1
0
15
@hi_tysam
Fern
11 months
@cacooleed shanon gud
1
0
15
@hi_tysam
Fern
1 year
Now, let's get to the magic. Back to the limited agents. If each agent was a GPT with the same distribution, there would be only symmetry, nothing would be learned. The more information is shared across agents, the more inherent cooperation is needed.
1
0
15
@hi_tysam
Fern
1 year
Additionally, it fulfils I think a lot of what we are wanting from LLMs, today, as it better matches the human expectations with interacting with LLM agents. Additionally, it should scale better as we're no longer futilely trying to reduce model variance at all costs.
1
0
15
@hi_tysam
Fern
1 year
@natfriedman @danielgross @go2carter With a random initialization, we are basically choosing a completely random set of assumptions for the neural network -- and one of them is the weighting mappings of information from one channel to another. This could be a good or a bad thing.
1
0
16
@hi_tysam
Fern
1 year
@natfriedman @danielgross @go2carter When we use a dirac initialization on the network branches, we're basically reducing the noise at the start of training, because there's not as much diffusion of the input signal of the network from start to finish.
1
0
16
@hi_tysam
Fern
2 years
@karpathy I'm honored and a bit stunned. Wow, thank you, Andrej! You got me into ML via CharRNN (Cookbooks specifically). MinGPT helped inspire this repo, I've needed something similar for years. It's also a (mild overkill) living resume to help pick up some part time niche specialty work.
3
0
14
@hi_tysam
Fern
1 year
If we have some future-return-to-the-original-text-prediction objective, and several, mostly-orthogonal agents (think Billy Crystal in the Princess Bride) we now force several things to happen:
1
0
14
@hi_tysam
Fern
11 months
@kittytronic @cremieuxrecueil It's not an atypical experience. In fact, my understanding is that most gifted children suffer these issues. The two studies that the OP linked are about University adjustment -- which is a problem for some, but the effect tends to peak in the mid-20's apparently. I'm hoping OP
1
1
13
@hi_tysam
Fern
1 year
This is really, really important. I hope you enjoyed this thread. It feels weird to put out partially-unfinished work for the world to see, but it feels relevant to the movement of the moment.
1
0
15
@hi_tysam
Fern
7 months
Writing this down helped clarify my thoughts on this. The grad norm spikes all seem to follow a function with a sharp spike up and a long tail of decay, and it looks like it has some exponential component to it in both the ramp and decay. If you take a look at it, I'm sure you
@hi_tysam
Fern
7 months
Experienced LLM-trainers of Twitter, I am looking for your help with a question! I am currently working on optimizing up the end-tail of training , and am unfortunately currently wrangling with the dreaded Loss Spikes. I need some help for hints as to where to look to address
Tweet media one
2
3
9
2
0
13
@hi_tysam
Fern
1 year
@natfriedman @danielgross @go2carter The addition operator in the residual of ResNets, for example, basically forces a consistent structure on the latent space of the residual -- this reduces the degrees of freedom of the network, and shortens the effective length that the gradients have to travel in the network.
1
0
14
@hi_tysam
Fern
2 years
Maybe that was short on the math or such, and I think everyone and their parent is talking about AGI/etc for now, so this is from someone who's done applied research for a little while, and this is what I'd focus on to improve it. I believe it's viable. Hopefully it brings good.
2
1
14
@hi_tysam
Fern
10 months
He does not use an L2 norm on these weights.
@SchmidhuberAI
Jürgen Schmidhuber
10 months
The GOAT of tennis @DjokerNole said: "35 is the new 25.” I say: “60 is the new 35.” AI research has kept me strong and healthy. AI could work wonders for you, too!
Tweet media one
162
157
3K
0
0
12
@hi_tysam
Fern
1 year
@natfriedman @danielgross @go2carter For example, one modification that may or may not be needed for this to transfer to larger networks is to weight decay towards a network architecture that enforces identity connections, instead of just zeros.
1
1
14
@hi_tysam
Fern
2 years
We also have the problems of asynchronous communication and the exponential divergence of information within a system that is chaotic (IIRC) due to information asymmetry between unique agents. There is a new Amdahl's Law waiting to be specified for this particular framework.
2
2
13
@hi_tysam
Fern
1 year
Sadly one of the downsides is that we're opening a new generative dimension -- the state space of LLM interactions simulating human interactions, and like most high dimensional things, esp w/o fixed points of data, is likely to be extremely compute-intensive (unfortunately).
2
0
13
@hi_tysam
Fern
1 year
Now as we go into the solid climbing phase of the LLM research S curve, I believe companies will get more and more like hedge fund companies, where tiny pieces of information are critical and highly guarded. This will also result in binding legal morass for skilled researchers.
1
0
14
@hi_tysam
Fern
1 year
Feel free to reach out with any questions, thoughts, etc. Happy to chat. It's a fun topic.
6
0
14
@hi_tysam
Fern
1 year
@natfriedman @danielgross @go2carter This is why DenseNets and ResNets changed the game so much when they were introduced, they shortened the effective length of the network when it came to the information flowing through the network, but still allowed for some of the benefits of depth.
1
0
14
@hi_tysam
Fern
2 years
When they responded that they weren't, instead of integrating that information, the LLM basically tried to politely brute force a different way. I remember watching it encounter 403 after 403 on the nih website, without changing strategy or really reacting to the environment.
2
2
14
@hi_tysam
Fern
1 year
What you do get from context size scaling, model size hyperscaling, etc, is less noise. You buy yourself time. So. Let's put this all together, now.
1
0
13
@hi_tysam
Fern
1 year
@natfriedman @danielgross @go2carter So, we are down from 8 convolutional layers to 7, of which only 6 are trainable (!!!!!). There are no more explicit residuals, so now it's simply called SpeedyConvNet (for now ;PPPP)
1
0
13