Sr. Research Scientist
@AWS
Labs (ph-D
@LTIatCMU
) working on unnatural language processing, speaking πtorch & C++. Opinions sampled from MY OWN 100T param LM.
🧵📢Attention folks working on LONG-document ranking & retrieval! We found evidence of a PROFOUND issue in existing long-document collections, most importantly MS MARCO Documents. It can potentially affect all papers comparing different architectures for long document ranking.⏩
"We tested GPT-3, GPT-Neo/J, GPT-2 and a T5-based model. The best model was truthful on 58% of questions, while human perf. was 94%. The LARGEST models were generally the LEAST truthful. Models generated many false answers that mimic popular misconceptions
I am excited 🥳🎉🎈to announce that our
@NeurIPSConf
paper was rejected. I am not sad or distressed. I thank the reviewers and I agree we should do a more thorough evaluation in our future submissions.
Fun fact about Python. You can use a sum function to flatten nested lists.
l=[['a', 'b', 'c'], ['1', '2'], ['#']]
sum(l, [])
Result:
['a', 'b', 'c', '1', '2', '#']
Everything you wanted to know about activation functions, but were afraid to ask!
Turns out there has been at least 4⃣ 0⃣0⃣ activation functions proposed over 3⃣decades and there is a paper that reviews all of them!
🧵 A longish thread on big models, ChatGPT, and emergence of non-trivial intelligence from big data & compute. 1st, please, do not misinterpret some of my tweets,
@OpenAI
made another amazing breakthrough and pushed the boundaries of conversational AI.
"Bring sheep indoors, and they’re labeled as cats. Pick up a sheep (or a goat) in your arms, and they’re labeled as dogs. Paint them orange, and they become flowers. And if goats climb trees, they become birds. "
One of the most comprehensive books on the artificial intelligence by Russel and Norvig has a new edition! For so many years this book is being updated to cover key topics in AI!
"Google is somewhat isolated within ML community because of its lack of use of
@PyTorch
and GPUs in favor of its own software stack and hardware. In typical Google fashion they even have a 2nd framework called Jax that competes directly with TensorFlow."
🧵Attention the IR community! The era of cheap UNSUPERVISED domain adaptation has begun! Let me introduce the InPars-Light training recipe enabling a small MiniLM model (with 30M parameters) to consistently outperform BM25 on all datasets used in the original InPars study.
🧵Big announcements from
@awscloud
🎉🎈🍾. First, the coding assistant (that our team worked on) has become generally available. It is is *FREE* for individual use. I have been personally using it recently and find it quite useful (typing effort reduced!).
What I find interesting in the GPT-4 paper is that the paper has to be cited by using OpenAI as a single author. Basically
@OpenAI
denies you individual authorship if you work for them. This reminded me about Cirque du Soleil where all stars are anonymous and interchangeable.
Transformer architecture reality check. Most modifications do not lead to improved performance, unless the # of parameters is increased dramatically. One exception is decoupling input and output embedding parameters. h/t
@seb_ruder
While having little visibility 3 years ago
@Bosch_AI
published > 100 papers in top tier venues in 2019-2020, including 32 papers at NEURIPS, ICML and ICLR in 2020.
👇When u mention beam search, please, don't cite Sutskever or any other modern author (unless it's some specific variant). The term beam-search was possibly coined by the Carnegie Mellon professor Raj Reddy (who received the Turing award for his contributions to AI).
How does
@Stanford
@stanfordnlp
release a fine-tuned LLAMA model whereas the original LLAMA model is not freely available (unless you use Torrent 😂). This should raise some interesting legal questions or do I miss something?
🧵 This is an awesome paper with a very clever and effective idea about how one should implement a global (rather than local) convolution, but I think there are nuances that I explain in the thread. First authors build upon success of S4 model.
They iteratively train several models. Largest is trained for 3.5 days on 2048-v3 TPU pod. It seems that one hour of this pod is ~$2.5K. Thus, we get an insane price tag ~$200K. The full experiment might cost you about half a MILLION USD if u want to reproduce it.
Example predictions on robustness benchmarks ImageNet-A, C and P. Black texts are correct predictions made by our model and red texts are incorrect predictions by our baseline model.
👇Never-ever using a boolean type in Python argparse unless you use action='store_true'. Inspired by four hours of hunting a bug in a famous library.
Imagine you have a program
🧵
@OpenAI
open-sourced their tokenizer's code which is claimed to be ridiculously faster than
@huggingface
tokenizers. Does anybody have HF benchmarks? Here the page claims the speed of 50 MB/sec but # of threads is unknown.
🧵Some back of the envelope calculations on feasibility of using LLM as an addition to search. Let's assume that a somewhat smallish ~100GB params model fits on an 8 GPU 320 GB GPU memory p4d.24xlarge. A 3 year reserved price is $12/hour.
"We present gen. Parameter-Efficient Finetuning framework for tuning LLM with only 𝟬.𝟭%-𝟬.𝟮% of parameters using mix of adaptation modules -> achieve new 𝗦𝗢𝗧𝗔 > standard tuning on both NLU & NLG tasks.
Paper:
Code&Models:
@rasbt
"Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar."
Now that FlashAttention is flashing and we can have very long now input windows, is anybody training a replacement for BERT? Or do we first need to wait till FlashAttention is fully integrated into
@huggingface
?
Some interesting/notable things that Huang et al did here:
1. A somewhat non-standard Transformer variant (which is claimed to be more stable in training).
2. Both text and images are embedded, however, only text tokens are predicted!
3. Training includes both next-token
Here we go!
Microsoft introduces a multimodal large language model called Kosmos-1.
Achieves great performance on language understanding, OCR-free NLP, perception-language tasks, visual QA, and more.
I am extremely excited to join the Artificial Intelligence lab
@Bosch_AI
and
@zicokolter
team! I am looking forward towards new challenges and opportunities! 2 years with M*Modal (also as a part of
@3MHISNews
) were good and productive, but it's time for new adventures!
👇I have 2 internship positions in 2021: CV & NLP.
@Bosch_AI
is an up-and-coming research lab. Bosch is an equal opportunity employer and we welcome candidates from underrepresented groups. More details are in the thread, please, RT. 🧵
"Head-to-Tail: How Knowledgeable are LLMs? Will LLMs Replace Knowledge Graphs? We show that existing LLMs are still *FAR* from being perfect in terms of their grasp of factual knowledge, especially for facts of torso-to-tail entities."
👇I am 😊 to announce that our retrieval FlexNeuART (used to produce strong MS MARCO & TREC runs) is now available as pypi package (Linux/MacOS).
It's fully-fledged modular retrieval framework that has a number of SOTA neural models and traditional models.
👇Mad-X a beautiful idea for effective multi-lingual transfer.
1. Start from multiling BERT. It works ok for many languages, but you can still get better results from unsupervised (LM-mask) fine-tuning for a target language. Downside: we have multiple-language models. Solution?
I see people are being stressed out about things changing too quickly recently. I had somewhat similar feelings, but I also have a feeling of relief that largely outweighs anxiety. Imagine all that old NLP/ML stuff you had to learn before to do well: CRFs and their training
"At Netflix, personalization plays a key role in several aspects of our user experience, from ranking titles to constructing an optimal Homepage. "
No BERT: good old DNNs is all your need.
👇Today is an efficient Transformer day. Previously, I tweeted about a 30x speed-up on a 32-core CPU:
There is a paper getting 233x on 80 CPU cores or roughly 93x speed-up scaled to 32-cores (this is a very rough comparison).
👇Running BERT in Pytorch on CPU efficiently!
1. Can't use GPU, b/c low-latency requests prevent batching.
2. torch.set_num_threads(1) or else each PyTorch instance tries to use most CPU cores
3. Use DistillBert (2x speed up)
4. No padding (batch size==1)!
5. 8-bit Quantization
Betting on a dead horse. The Moore's law is effectively dead and it's not clear when/if it can be resuscitated. Even using a conservative estimate (favorable for tech), we are still SIX orders of magnitude behind the brain.
Jupyther notebooks may be good for protyping, but eventually obe needs proper modular code without weird memory effects. Using them for production grade work seems to be poor engineering.
I was accused that my software is merely a research toy. There's nothing better than a good toy. My condolences, if your inner child has reached the age of senility.
GPT-4all with 20K starts in a very short period is a crazily successful project. In contrast,
@huggingface
has ~90K.
GPT-4all: Demo, data, and code to train an assistant-style large language model with ~800k GPT-3.5-Turbo Generations based on LLaMa
In summary, I am convinced there is clearly an emergence of behaviors from apparently "dumb" training procedures, which seem to contradict (to some degree) the poverty of stimulus hypothesis. As a scientist, I am looking forward to learning more about these phenomena.
👇Recently we had a discussion about different approaches to apply layer normalization in Transformers. Unlike the original postnorm approach, prenorm has a skip connection and was found to train in a more stable way. There's a recent update with residual attention Transformers.
👇 mBERT-like models can have great cross-lingual 0-shot transfer abilities. Just train it on English => apply to another lang. Cao, Kitaev, & Klein showed that mBERT can be improved (for NLI) via cross-lingual adjustment with a small || corpus. We find this may not always work.
Bing uses a 3-layer BERT-like transformer for every query. In addition, to model simplification and possibly distillation, they implement more efficient GPU code.
“With these GPU optimizations, we were able to use 2000+ Azure GPU Virtual Machines across four regions to serve over 1 million BERT inferences per second worldwide”
Bing, using (distilled, 3-layer) BERT in production.
via
@rangthan
🧵A lot of multi-modal models were announced recently. A little summary, b/c they all seem to be similar in key aspects:
1. combining language & vision embeddings using (mostly) pre-trained backbones.
2. training composite Transformer model using next token prediction.
Uncomfortable truth: working with (only) corporate APIs that you don't control isn't reproducible research and it has other issues too. In particular, if your strongest technical skill is writing API calls, how much value does it create for workplaces where heavylifting is still
👇Turns out that one can tune hyper parameters using a smaller version of the model (which is cheaper compared to the full-size model) and transfer this params to the full-size model => boost in accuracy.
Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer
By transferring from 40M parameters, µTransfer outperforms the 6.7B GPT-3, with tuning cost only 7% of total pretraining cost.
abs:
repo:
Also another point to make. Let's face it, to a large degree reviewing is non-scientific. A bunch of overworked people, many of whom haven't had any hands on experiences for a while, make hasty decisions by skimming a bunch of papers (on the topics they do NOT care about). What
I think one of the problems of academic publishing/peer reviews is that there is zero downside of rejecting good ideas.
The reviewers of the DecaNLP paper - which introduced prompt engineering for multitask NLP with large nnets - were so wrong & held back the field for years.
We've finally put out a detailed IEEE/ACM paper on
@Google
's multi-year effort to ease the burden of code review with ML. Google engineers now resolve 7.5% of all code review comments with an ML-suggested edit. But the path to that number has been a fun ML and UX journey!
Best ACL2020 paper:
"While traditional benchmarks indicate that models on these tasks are as accurate as humans, CheckList reveals a variety of severe bugs, where commercial and research models do not effectively handle basic linguistic phenomena"
This is freaking insane [thread]: "we utilized GPipe [4] to train 128-layer Transformers with over 6 billion parameters. Increasing the model capacity resulted in significantly improved performance across all languages by an average of 5 BLEU points."
A reminder for people who want to use Python multiprocessing with CUDA and other tricky to use resources (e.g., Java libraries). It's very likely you need to set multiprocessing.set_start_method('spawn')
👇 "End-to-End learning sounds like a good idea on paper, but for most deployment scenarios, pipelined architectures that are piecewise optimized will continue to stay. "
Google GED a very interesting QA explainability benchmark. I can help noticing though authors postpone the issue of faithfulness to future work, buy the problem is likely unsolvable for complete black boxes.
From Awni Hannun (the first author on the 1st Baidu's end-to-end recognition paper): Speech-recognition isn't solved : " The recent improvements on conversational speech are astounding. But, the claims about human-level performance are too broad."
🧵This is a nice adversarial attack on LLMs, where redefining built-in functions (len becomes print and vice versa) confuses the model. It shows the model does not fully grasp semantics and heavily relies on reasoning shortcuts.
New paper:
LLMs are good at programing tasks, and they are now being widely used to assist code generation.
But do they actually understand the semantics of programming languages, or do they rely on superficial, "shortcut" correlations?
Let's find out!
1/
👇Great talk:
1. You need better data not just more data
2. Increasing data has diminishing returns
2. Feature engineering is still important
3. Start with simpler models
4. Don't just increase complexity of models: add more features
5. Be aware about biases of the data
Nice work: Mixtral MoE on small-memory GPUs via offloading weights to RAM. It has 2 ingredients:
1. LeastRU cache.
2. Speculative loading: estim. expert by applying current layer gating func. to previous layer H.
2-3x speed up compared to naive offload.
👇What I learned recently about Python, CUDA, and/or
@spacy_io
multi-processing.
1. Python lack of multi-threading is super-annoying but there're crutches that can still be useful.
2. You need to call multiprocessing.set_start_method('spawn') or your CUDA is gonna be screwed up
Evaluating ELMO embeddings: Meaningful gains on many tasks. "We also show, however, that we are still far away from a universal encoder that can perform consistently across several downstream tasks." by Perone et al 2018 …
Short summary of a discussion (but it's worth reading all of it). LLAMA team (as well as many other teams) evaluates a model withOUT actually revealing prompts (at least this is my understanding). Another team tests LLAMA with a different set of prompts and gets noticeably worse
👇BERT does *NOT* rediscover the classic NLP pipeline."We verify
via non-parametric probes that the permutations do in fact make the model worse at syntax-dependent
tasks."
Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little
MLMs pretrained on sentences with randomly shuffled word order surprisingly achieves high accuracy after fine-tuning on many downstream tasks.
🧵Well-known company released a multi-threaded eval script. It starts 16 threads but 15 have 2% CPU usage.
A big reminder: do *NOT* waste your time in trying to harness ThreadPoolExecutor (or any Python "thread" API). It's a pointless exercise b/c CPython does not have threads.
👇"A major methodological concern with language models pretrained on a broad swath of internet data, particularly large models with the capacity to memorize vast amounts of content, is potential contamination of downstream tasks by having their test or development sets ... "
GPT-3 with 175B params,10x any prior LM,is out.Setting aside controversies around implications of training such ever-larger models, the fact that they show high zero-shot performance on challenging reasoning tasks is pretty interesting. Ppl often ask me how I feel about all this:
👇"By increasing batch size to the memory limit of TPUv3 Pod, BERT training time can be reduced from 3 days to 76 minutes". Despite this little pod has 1024 GPUs, it makes training possible for many teams. V3 GPU is $8/hour. So, the overall training cost $8K.
"Evaluating the Code Quality of AI-Assisted Code Generation Tools: An Empirical Study on GitHub Copilot, Amazon CodeWhisperer, and ChatGPT"
*CAVEAT* only one small (should I say tiny?) collection is used.
Product search at Amazon via DSSM and (very cool) extreme classification: "This indicates another unique difference between semantic product search problem and the conventional web retrieval problem where BM25 variants are often a strong baseline."
It looks like SOTA in Transformer model efficiency (if you don't use any sparsity / routing / mixture of experts) is a combination of:
1. distillation
2. quantization
3. static or adaptive early exiting / reduced comp. in higher layers
A study finds that ChatGPT is seemingly not at the level of good human annotators as of now. A couple of caveats here:
1. There was no extensive prompt hacking.
2. Performance is measured against existing human annotations.
How noisy and accurate they are is not rechecked.
Can ChatGPT Reproduce Human-Generated Labels?
Nice study on the applicability of ChatGPT for data annotation tasks. Results show that ChatGPT does have the potential to handle different types of data annotation tasks like stance detection, sentiment analysis, and bot detection.
Different interpretation of the same results:
@jacobeisenstein
HMM beats LSTM on small data
@sleepinyourhat
wow, LSTM beats HMM already with 500 sentences. Proof links:
🧵 A nice LLAMA-3 summary by
@natolambert
. In short, performance is better, but largely due to the sheer increase in the pre-training scale. Training data is a secret. ↩️
It looks like the stress is real. In my experience:
1. paradigms shift
2. numbers get higher
3. Yet, unsolved problems remain in big numbers.
Moreover, it looks like new discoveries create new opportunities. It is a bit hard when you are locked in with some topic that you
"Starting from a public multilingual BERT checkpoint,
our final model is 6x smaller and 27x faster, and has higher accuracy than a SOTA multilingual baseline." SOTA is LSTM.
@sarahookr
At the same it is not quite clear what's the value of PhD for most people. The competition to get in heats up but then what? Becoming a junior ad engineer without any research.
A beautiful idea: inject very small networks, e.g., between layers. During fine-tuning, the main BERT "body" remains unmodified. Thus, effective fine-tuning requires only very small modification of the main network and these modifications can be engineered explicitly!
Adapters are a small set of new weights introduced at every layer of pre-trained transformer models (e.g. BERT, RoBERTa). Only the new Adapter weights (within the pink box) are trained while keeping the original transformer parameters frozen/fixed.
The Hitchhiker's Guide to the Galaxy is an early machine learning book! The computer spits out a number (42), but it refuses to provide an explanation!
@TaliaRinger
That statistical NLP work was still closely coupled with understanding the shape of the problem being solved, specifically in feature engineering. Then (2010s) we got the next "invasion" from ML land (deep learning) where the idea was the computer would learn the features!
2/
👇The largest ever IR collection of news items is released recently! What is totally awesome is the size of the collection: 42M news articles & 10K queries. What's a bit disheartening is that queries are created using SQuAD approach, i.e., workers were asked to "create a query".
[thread] A multilingual BERT trained with the shared wordpiece inventory permits zero-shot transfer of various models across languages. Quite crazily it works for POS and NER tasks! Even more crazily it works for pairs of languages with DIFFERENT scripts...
🧵"Although transformer-based models have made
great success in information retrieval tasks, traditional IR methods such as BM25 can still yield robust performance in real-world web search scenarios. "