today,
@sfcompute
is releasing 3.5 years of audio narrations based on TinyStories, as a publicly available dataset, which we hope will help will help folks explore multimodal models
Announcing our EvoGrad library for grad-based evolution + our Evolvability ES meta-learning algorithm: can scale to deep nets, compete w/ MAML in RL. With
@jeffclune
,
@kenneth0stanley
, and
@joelbot3000
. Blog: , paper:
We think our next SF compute cluster is the largest h100 cluster in the world that can support bursts right now
It's 1k h100s coming in March, and it can do bursts as short as 3 months
We’re releasing Wanderer 2 today! It learned a much fuzzier search function than Google, so you can search in very abstract terms.
We put lots of examples on the website, and we’re serving the model live so you can experiment with your own searches:
Today we’re releasing Wanderer 2, a large language model trained to search over the 2.5 million pages that have been posted to Hacker News! You can play with it here:
it's just that none of the cloud providers will give you a bunch of compute for a short time, so you have to buy the compute / rent for 3 years
buying 128 A100s is closer to $2m
there's this sense that pretraining is super expensive and out of reach unless you raise $40m, but that's not really true
you ought to be able to train stable diffusion on probably 128 A100s in a month, about $100k
the lower the economic barrier to doing large scale pretraining, the more companies and labs will do it
I'm hoping that over the next few years, there will be super diverse and interesting scaled up models that people try
"
@HoneyHiveAI
is an example of companies working to help developers iterate on underlying model architectures, modify prompts, filter & add novel training sets, & even distill models."
Great write up
@palak_go
&
@jturow
! Thread 👇
In 3 months with 1k h100s, you could train a model approaching gpt-4 quality, for $6m
(Buying 1k h100s outright is about $38m, excluding power)
So this is about a factor of 6 improvement in the cost of training a big model
the same is true at larger scale, say if you want to do something on the order of a llama 2 (70b):
- reserving a cluster for 3 years is ~50m
- 3 shots at a one month training run is ~5m
Today we're officially announcing and opening applications to Campus 🏛️
1/ A “school” for the soul, intellect, and inner-child:
🔭community-created curriculum exploring our curiosities
🌱 50+ circles, salons, juntos and extracurriculars
🤓 "nerd prom" at end of the qtr
We think we can line up financing for bursts on a 25k h100 cluster—if we can pull it off, that will probably be enough to compete with gpt-5, again at a fraction of the cost
This is a cool idea—iirc in the alphazero paper, they did an average instead of minimax because it was more stable for neural networks to learn, but possible that something in between an average and a max would work
@EmilWallner
@evanjconrad
@sfcompute
Slightly different on our smallest cluster and on the bigger ones—smallest has 10G network, bigger have 100G, both have between 1-2TB ram, at least 7TB disk, 8 H100s with NVLink
If you want like a longer spec sheet you can send us an email at team
@sfcompute
.com
The goal is to try to disentangle how much compute it takes to be able to generate the semantic content of these stories from how much compute it takes to generate the audio of someone narrating them
Audio is an interesting modality because it has so many bits—a wav file might have 24,000 floating point numbers per second. Compared to text, which is maybe 3 integers (tokens) per second
Either you've got to downsample your sequences down to a few tokens per second so that a transformer can look at more than a few seconds at a time, or you've got to use something very different like
@EmilWallner
@evanjconrad
@sfcompute
We don’t have a button for postponing instances, but people often ask us to move around their reservations and we do our best to push the calendar around
I think it ought to be possible to get an audio model trained on TinyNarrations that's semantically as good as a text model on TinyStories with ~5x more compute. Maybe 1.5x.
To train a good voice model, you need to somehow get the model to spend most of its effort thinking about the words in the data, and quickly compress all the other stuff going on in the audio