In 2016, when I was working on machine translation, it took me more than a week on a multi-GPU machine to train a competitive system on WMT English-German.
Today, JAX on a TPU v3 supercomputer can train a better model on the same data in 16 seconds!
Can multi-100B param language models be served efficiently? We think so! Today we’re announcing the PaLM inference paper and releasing code for low-latency, high-throughput inference of 8B–540B models on TPU v4.
Paper:
Code: 1/5
Google Colab apparently now gives you one free K80 GPU for up to 12hrs at a time! Note that you have to go to "Runtime" --> "Change runtime type" to add a GPU.
Facebook is announcing a slew of new ML-related software projects and code releases this morning at F8, including Glow (a neural network compiler), PyTorch Translate (a public version of their production NMT code) and the roadmap for PyTorch 1.0
If you missed it, our JAX/Cloud TPU talk is now up!
We announced a new way to access Cloud TPUs, allowing direct SSH access and custom code on the TPU hosts, and gave FOUR demos showing how this supercharges JAX!
Video:
Slides:
Automated essay grading that takes into account factual accuracy and content coherence is several years away. In the meantime, NLP and AI researchers not paid by Pearson should push back against school systems that rely on this deeply flawed technology for standardized testing.
Incredibly excited that Sundar launched the public preview of Cloud TPU v4 Pods at I/O today, with a flythrough video of a datacenter filled with them: ! This is really three separate announcements:
JAX on Cloud TPUs is getting a big upgrade!
Come to our NeurIPS demo Tue. Dec. 8 at 11AM PT/19 GMT to see it in action, plus catch a sneak peek of a new Flax-based library for language research on TPU pods.
Link: ( is still open!)
As for me? I’m excited to join Google Brain later this month to work at the intersection of ML and programming languages. Among other things, I want to help make it easier to build structured NLP models (like those in
@gneubig
’s group’s 9 fantastic EMNLP papers) at Google scale.
DeepMind shares (some of) its distributed RL secrets!
Many of the authors on this paper were early and passionate advocates of JAX, and the Podracer architectures described here have helped inform the design of our parallelism APIs and distributed programming model.
Podracer architectures for scalable Reinforcement Learning
pdf:
abs:
"we argue that TPUs are particularly well suited for training RL agents in a scalable, efficient and reproducible way"
@seb_ruder
That's not the latest adaptive learning rate method any more 😉, the latest adaptive learning rate method is AdaFactor, quietly added three weeks ago to the Tensor2Tensor repository along with a note reading "TODO(noam): write a paper."
The new way to use Cloud TPUs, enabling direct SSH access and custom code, is now in public preview for TF, PyTorch, and JAX!
Read some testimonials from alpha users like
@CohereAI
and
@KenoFischer
or check it out for yourself with
If you missed it, our JAX/Cloud TPU talk is now up!
We announced a new way to access Cloud TPUs, allowing direct SSH access and custom code on the TPU hosts, and gave FOUR demos showing how this supercharges JAX!
Video:
Slides:
Facebook's fairseq MT engine is really, really fast... Like, 50% faster than
@marian_nmt
(which is itself way faster than Sockeye/OpenNMT/Tensor2Tensor/xnmt/Nematus/etc) at generating from the same Transformer model
“Image Transformer” from
@nikiparmar09
and the rest of the Transformer team extends self-attention to 2D and provides a substantial quality improvement over the state of the art for image generation and super-resolution
I’m honestly astounded how much cluster babysitting work this (and BigScience) took on GPU systems. TPUs have their own problems, but in my experience they’re MUCH easier from a “how many ways can things go wrong” perspective.
Yesterday was my last day at Salesforce Research. I’m incredibly proud of what the team has accomplished: we built a world-class deep learning research team from scratch, and helped make Salesforce Einstein the most powerful set of AI capabilities in enterprise software.
Something like half the appendix of the DALL-E paper () describes work the authors had to do on GPUs that they wouldn't have had to do on TPUs:
- scaling fp16 mixed precision
- reducing gradient all-reduce comms w/ PowerSGD
- manual optimizer sharding
Did you know? Reading a paper signed by the author doubles your learning rate!
Today we are launching to share our beloved arXiv of signed machine learning papers with the world
All proceeds go to charity 💖
This is one of the most off-base threads I’ve ever seen on this hellhole of a website. 100s of researchers at Brain, FAIR, and other industry ML labs are doing science (not engineering, and not grad student descent, though there’s lots of that) without regard to corporate goals.
A bit about PaLM () infrastructure:
- trained on 6144 TPU v4 chips across two pods, without pipelining
- first use of the Pathways runtime at scale
- achieves 46.2% end to end matmul FLOPs utilization (or 57.8% including rematerialization)
Martin Popel has been the most active non-Googler on the Tensor2Tensor repository for months, and has posted a series of very interesting experiments about training and convergence in issue comments. Very happy to see he's turned them into an arXiv paper!
Anthropic arguably initiated the growing consensus around non-publication of “capabilities” research—they publish openly, but only on safety/interpretability. Don’t lump them with FAIR 😜
9/ Already, many other LLM players like Adept, Character, and Cohere have not published the details of their models.
Just blog posts.
FAIR and Anthropic might remain as the only large open research labs.
And I’ll be speaking tomorrow at 2pm about Matchbox, my brand new package for automatic batching in PyTorch. As
@JeffDean
says, manual batching “makes my head hurt”—never worry about padding and masking again!
Second, we’re publicizing TPU v4 specs for the first time! In addition to what’s in this table, TPU v4 also has one logical core with a full 32 GiB of HBM (vs. two w/16) and all slices with 64+ chips have wraparound on all three ICI axes, improving collective throughput vs. v3.
First, we’re bringing eight TPU v4 Pods to Google Cloud, in a single datacenter with 90% carbon-free energy. If this were used as a single supercomputer (along the lines of PaLM multi-pod training) we think it’d be the world’s fastest public ML system (9 exaflops peak bfloat16)!
And we’re publishing some Transformer language model benchmarks we’ve been working on, which show that JAX + GSPMD + XLA + TPU v4 can achieve exceptionally high FLOPs utilization with two different scaling patterns (“optimal” here means Chinchilla-like):
NVIDIA chief scientist Bill Dally at
#SysML
: fast memory is expensive for the same reason that Palo Alto real estate is expensive―there isn’t much space close to where the compute happens
Researchers at Columbia and DeepMind have independently shown that artificial NNs can learn representations qualitatively similar to grid cells in biological brains
@soumithchintala
@PreferredNetJP
@ChainerOfficial
Something I just realized the other day is that, since version 2 when CuPy split into a separate package, Chainer is now 100% pure Python other than the NumPy dependency and therefore runs unmodified on iOS/Pythonista!
@seb_ruder
That's not the latest adaptive learning rate method any more 😉, the latest adaptive learning rate method is AdaFactor, quietly added three weeks ago to the Tensor2Tensor repository along with a note reading "TODO(noam): write a paper."
@TaliaRinger
PyTorch chose Python after trying very hard, over several years, to make Lua work instead. IIRC issues included lack of native OOP, a 32-bit JIT, poor support for large codebases, and the Lua core team’s preference against evolving the language for industry ML needs.
Researchers at MSR seem to have localized the engram (memory image) of certain pieces of world knowledge in a handful of neurons in a pretrained Transformer:
@orthonormalist
Yep apparently kwon from the 1st paper got the news that China was starting to make rudimentary lk99 and published the paper even if he hasn’t been part of the lab since march. Apparently the process has since changed and all we got is the old recipe in the paper.
@AdaptiveAgents
Can someone explain to me how this is different from the well-known result that doing linear PCA on place cells gives you grid cells? You can get this without the LSTM component.
The original TensorFlow control flow ops (the ones underlying `cond` and `while_loop`) are detailed in a fun new paper:
But they were likely a mistake: if/for/while can be lowered to functional control flow instead, which is easier to implement+parallelize
@MattHaneySF
SF should:
- allow vaccination inside drugstores
- allow more people to give shots and pay them more
- open 24hr public sites
- protect good faith vaccinators
- throw an ice cream party this summer for the supe district that vaxxes the fastest
...and dare the governor to stop us
Strong endorse: “When investing in terms of scaling in terms of data, model parameters and compute, we should think of an additional axis which is _data diversity_.”
(Narrow self-supervision datasets cause downstream task performance to saturate.)
I totally missed that the principal example for TensorFlow Eager was ported from my PyTorch SPINN code! It's impressive how one-to-one the conversion is; the framework convergence is real :)
@tszzl
dojo has dramatically more interconnect bandwidth, letting you scale smaller batch sizes on larger systems with simpler/lower-overhead parallelism. I suspect it’s more expensive per flop though (even vs. A100 and almost definitely vs. H100), but I’m assuming I don’t have to pay.
Number of talks I've given by year (excluding teaching).
I'm trying to cut down and get some real work done now.
2018 17 <- as of June 21. Doing better.
2017 56 <- having no life.
2016 54 <- over 1 talk/week
2015...
@jeremyphoward
@NvidiaAI
Behind the scenes (starting ~a year ago) NVIDIA has also set up a dedicated engineering team to work on XLA:GPU. (Something Jeremy might get a kick out of is that one of them is Frederic Bastien, the creator of Theano! )
JAX on Cloud TPUs is getting a big upgrade!
Come to our NeurIPS demo Tue. Dec. 8 at 11AM PT/19 GMT to see it in action, plus catch a sneak peek of a new Flax-based library for language research on TPU pods.
Link: ( is still open!)
"most of MKL-DNN’s performance is lost during framework integration (Tensorflow in this case) for various reasons such as the lack of fusion, inefficient scratch memory allocation, or thread scheduling"
"Anatomy Of High-Performance Deep Learning Convolutions On SIMD Architectures," Georganas et al., Intel:
"This proves that CPUs can be a competitive alternative when training neural nets." 🧐
@ClementDelangue
@MSFTResearch
@GoogleAI
@nvidia
@OpenAI
@BigscienceW
the TPU approach (matches f32 training) is to perform all matmuls in bf16*bf16->f32, perform all vector math in f32, and truncate every value stored to HBM to bf16 EXCEPT:
- optimizer state (incl. primary copy of params)
- layernorm intermediates
- attention logits
- final logits
Kinda weird to me to call this a “transition from Codex to GPT-3.5” when code-davinci-002 _is_ the big 3.5 base model…it feels more like a product safety decision (that I think I grudgingly support?) to not have a base model available
OpenAI is discontinuing Codex.
GPT-3.5 outperforms Codex, and GPT-4 blows it out of the water.
I think the takeaway here is that eventually everything converges to one general purpose model.
kinda sounds like copilot is trying to follow the license terms and people are ignoring it
seriously though, between gpt-3 using copyrighted books and codex using gpl’ed code, openai is tempting fate, and it would be pretty amusing if it’s linus rather than the authors guild
github copilot has, by their own admission, been trained on mountains of gpl code, so i'm unclear on how it's not a form of laundering open source code into commercial works. the handwave of "it usually doesn't reproduce exact chunks" is not very satisfying
@EigenGender
My preferred counterfactual here is “IBM gets into convnets in the 90s and builds an ASIC-based NN training supercomputer instead of Deep Blue”
@andy_l_jones
This feels (to me) like a review by someone who feels like they’ve been left behind by the pace/changes in the modern ML conference ecosystem (and a strong reject is not an great way to react to that!)
I think you’d have better luck at NeurIPS—IMO your paper meets their bar.
Third, we use multidimensional partitioning with overlapped collective communication and other low-level optimizations, many of which we believe are new in the literature. Learn more in our paper or consider adapting our code to your own models! 4/5
It's pretty disappointing that Douglas Hofstadter—of all people!—is almost completely incurious about what deep learning is and what our goals and methods as MT researchers actually are.
Slides from my talk last week at Uber Science Day:
Skip about halfway through if you’re more interested in “future” than “past”...
Thanks
@savvyRL
for the invitation!
Unfortunately Sally Lieber didn’t stand up for SB50 at the climate forum today—only Shelly Masur did. SB50 isn’t a radical bill; it’s table stakes. If you won’t support this first step in the place in the state that needs it most, you’re not a YIMBY.
@penforeveryone
@PaloAltoYimby
@yimbyaction
@cayimby
Fmr Assembly Speaker Pro Tem Sally Lieber: "I consider myself a YIMBY. I think we all should be."
Redwood City Councilmember and all-around badass Shelly Masur: IDs as a YIMBY, saying "The need for housing is critical to addressing climate change and displacement."
😍😍😍
"Attention Solves Your [Traveling Salesman Problem]" from W. W. M. Kool and
@wellingmax
at UvA takes
@IrwanBello
's work on neural combinatorial optimization and swaps the RNN for a Transformer—very cool results!
This is the best article I've seen about how AI research in industry actually works, and the business case for openness and participation in the community
This is a pretty incredible story: former Google eng "alleges that [Pinscreen] submitted false results to SIGGRAPH" and that he was fired and "Pinscreen employees, under [CEO] Li’s commands...physically attacked him" after he pointed out the fraud
Essay translation 🇨🇳➡️🇺🇸 h/t
@jjding99
: Zhao Tingyang: "Near-term Worries" and "Long-term Concerns" of the Artificial Intelligence "Revolution": An Analysis of Ethics and Ontology.
@giffmana
@elonmusk
@askerlee
I think it’s pretty straightforwardly true at the hardware level: both Cerebras and Dojo, and to a lesser extent Graphcore, have very high bandwidth (relative to flops) for their parameter/activation memory, and more flexible matmul structure.
Happy to release our work on Language Model Cascades. Read on to learn how we can unify existing methods for interacting models (scratchpad/chain of thought, verifiers, tool-use, …) in the language of probabilistic programming.
paper: