Leo Boytsov Profile Banner
Leo Boytsov Profile
Leo Boytsov

@srchvrs

7,896
Followers
1,934
Following
225
Media
24,365
Statuses

Sr. Research Scientist @AWS Labs (ph-D @LTIatCMU ) working on unnatural language processing, speaking πtorch & C++. Opinions sampled from MY OWN 100T param LM.

Pittsburgh, PA
Joined November 2009
Don't wanna be here? Send us removal request.
Pinned Tweet
@srchvrs
Leo Boytsov
6 months
🧵📢Attention folks working on LONG-document ranking & retrieval! We found evidence of a PROFOUND issue in existing long-document collections, most importantly MS MARCO Documents. It can potentially affect all papers comparing different architectures for long document ranking.⏩
4
14
128
@srchvrs
Leo Boytsov
2 years
"We tested GPT-3, GPT-Neo/J, GPT-2 and a T5-based model. The best model was truthful on 58% of questions, while human perf. was 94%. The LARGEST models were generally the LEAST truthful. Models generated many false answers that mimic popular misconceptions
17
107
495
@srchvrs
Leo Boytsov
3 years
I am excited 🥳🎉🎈to announce that our @NeurIPSConf paper was rejected. I am not sad or distressed. I thank the reviewers and I agree we should do a more thorough evaluation in our future submissions.
8
6
458
@srchvrs
Leo Boytsov
8 months
Fun fact about Python. You can use a sum function to flatten nested lists. l=[['a', 'b', 'c'], ['1', '2'], ['#']] sum(l, []) Result: ['a', 'b', 'c', '1', '2', '#']
7
14
395
@srchvrs
Leo Boytsov
7 months
Everything you wanted to know about activation functions, but were afraid to ask! Turns out there has been at least 4⃣ 0⃣0⃣ activation functions proposed over 3⃣decades and there is a paper that reviews all of them!
10
84
345
@srchvrs
Leo Boytsov
1 year
" It’s easy to make something cool with LLMs, but very hard to make something production-ready with them."
9
54
328
@srchvrs
Leo Boytsov
2 years
🧵 A longish thread on big models, ChatGPT, and emergence of non-trivial intelligence from big data & compute. 1st, please, do not misinterpret some of my tweets, @OpenAI made another amazing breakthrough and pushed the boundaries of conversational AI.
3
40
300
@srchvrs
Leo Boytsov
7 years
"Bring sheep indoors, and they’re labeled as cats. Pick up a sheep (or a goat) in your arms, and they’re labeled as dogs. Paint them orange, and they become flowers. And if goats climb trees, they become birds. "
8
163
292
@srchvrs
Leo Boytsov
4 years
One of the most comprehensive books on the artificial intelligence by Russel and Norvig has a new edition! For so many years this book is being updated to cover key topics in AI!
3
79
275
@srchvrs
Leo Boytsov
2 years
"Google is somewhat isolated within ML community because of its lack of use of @PyTorch and GPUs in favor of its own software stack and hardware. In typical Google fashion they even have a 2nd framework called Jax that competes directly with TensorFlow."
6
27
274
@srchvrs
Leo Boytsov
2 years
After 2.75 great years with @Bosch_AI I am excited to join @AWS AI labs as a senior research scientist.
23
2
255
@srchvrs
Leo Boytsov
2 years
🧵Attention the IR community! The era of cheap UNSUPERVISED domain adaptation has begun! Let me introduce the InPars-Light training recipe enabling a small MiniLM model (with 30M parameters) to consistently outperform BM25 on all datasets used in the original InPars study.
6
40
247
@srchvrs
Leo Boytsov
1 year
🧵Big announcements from @awscloud 🎉🎈🍾. First, the coding assistant (that our team worked on) has become generally available. It is is *FREE* for individual use. I have been personally using it recently and find it quite useful (typing effort reduced!).
9
45
232
@srchvrs
Leo Boytsov
2 years
What I find interesting in the GPT-4 paper is that the paper has to be cited by using OpenAI as a single author. Basically @OpenAI denies you individual authorship if you work for them. This reminded me about Cirque du Soleil where all stars are anonymous and interchangeable.
24
7
210
@srchvrs
Leo Boytsov
3 years
Transformer architecture reality check. Most modifications do not lead to improved performance, unless the # of parameters is increased dramatically. One exception is decoupling input and output embedding parameters. h/t @seb_ruder
3
44
217
@srchvrs
Leo Boytsov
4 years
While having little visibility 3 years ago @Bosch_AI published > 100 papers in top tier venues in 2019-2020, including 32 papers at NEURIPS, ICML and ICLR in 2020.
3
8
213
@srchvrs
Leo Boytsov
3 years
👇When u mention beam search, please, don't cite Sutskever or any other modern author (unless it's some specific variant). The term beam-search was possibly coined by the Carnegie Mellon professor Raj Reddy (who received the Turing award for his contributions to AI).
2
17
203
@srchvrs
Leo Boytsov
2 years
How does @Stanford @stanfordnlp release a fine-tuned LLAMA model whereas the original LLAMA model is not freely available (unless you use Torrent 😂). This should raise some interesting legal questions or do I miss something?
12
8
184
@srchvrs
Leo Boytsov
2 years
🧵 This is an awesome paper with a very clever and effective idea about how one should implement a global (rather than local) convolution, but I think there are nuances that I explain in the thread. First authors build upon success of S4 model.
@ylecun
Yann LeCun
2 years
A new flavor of ConvNet crushes various flavors of transformers (as well as state-space models) for sequence modeling with long-range dependencies.
16
117
915
2
12
158
@srchvrs
Leo Boytsov
5 years
They iteratively train several models. Largest is trained for 3.5 days on 2048-v3 TPU pod. It seems that one hour of this pod is ~$2.5K. Thus, we get an insane price tag ~$200K. The full experiment might cost you about half a MILLION USD if u want to reproduce it.
@quocleix
Quoc Le
5 years
Example predictions on robustness benchmarks ImageNet-A, C and P. Black texts are correct predictions made by our model and red texts are incorrect predictions by our baseline model.
Tweet media one
3
4
39
4
32
159
@srchvrs
Leo Boytsov
2 years
👇Never-ever using a boolean type in Python argparse unless you use action='store_true'. Inspired by four hours of hunting a bug in a famous library. Imagine you have a program
7
11
147
@srchvrs
Leo Boytsov
2 years
🧵 @OpenAI open-sourced their tokenizer's code which is claimed to be ridiculously faster than @huggingface tokenizers. Does anybody have HF benchmarks? Here the page claims the speed of 50 MB/sec but # of threads is unknown.
4
15
143
@srchvrs
Leo Boytsov
2 years
🧵Some back of the envelope calculations on feasibility of using LLM as an addition to search. Let's assume that a somewhat smallish ~100GB params model fits on an 8 GPU 320 GB GPU memory p4d.24xlarge. A 3 year reserved price is $12/hour.
11
9
132
@srchvrs
Leo Boytsov
2 years
"We present gen. Parameter-Efficient Finetuning framework for tuning LLM with only 𝟬.𝟭%-𝟬.𝟮% of parameters using mix of adaptation modules -> achieve new 𝗦𝗢𝗧𝗔 > standard tuning on both NLU & NLG tasks. Paper: Code&Models:
1
23
134
@srchvrs
Leo Boytsov
2 years
@rasbt "Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar."
5
4
124
@srchvrs
Leo Boytsov
1 year
Now that FlashAttention is flashing and we can have very long now input windows, is anybody training a replacement for BERT? Or do we first need to wait till FlashAttention is fully integrated into @huggingface ?
13
8
119
@srchvrs
Leo Boytsov
2 years
Some interesting/notable things that Huang et al did here: 1. A somewhat non-standard Transformer variant (which is claimed to be more stable in training). 2. Both text and images are embedded, however, only text tokens are predicted! 3. Training includes both next-token
@omarsar0
elvis
2 years
Here we go! Microsoft introduces a multimodal large language model called Kosmos-1. Achieves great performance on language understanding, OCR-free NLP, perception-language tasks, visual QA, and more.
Tweet media one
30
493
2K
2
16
118
@srchvrs
Leo Boytsov
5 years
I am extremely excited to join the Artificial Intelligence lab @Bosch_AI and @zicokolter team! I am looking forward towards new challenges and opportunities! 2 years with M*Modal (also as a part of @3MHISNews ) were good and productive, but it's time for new adventures!
8
3
115
@srchvrs
Leo Boytsov
4 years
A very interesting project, which allows running many traditional ML models on the GPU!
@ml_review
ML Review 🇺🇦
4 years
Hummingbird – compiles trained Scikit-Learn, LightGBM, XGBoost models into PyTorch for faster inference By @scnakandala @MatteInter
Tweet media one
4
87
367
3
26
113
@srchvrs
Leo Boytsov
4 years
👇I have 2 internship positions in 2021: CV & NLP. @Bosch_AI is an up-and-coming research lab. Bosch is an equal opportunity employer and we welcome candidates from underrepresented groups. More details are in the thread, please, RT. 🧵
2
48
113
@srchvrs
Leo Boytsov
1 year
"Head-to-Tail: How Knowledgeable are LLMs? Will LLMs Replace Knowledge Graphs? We show that existing LLMs are still *FAR* from being perfect in terms of their grasp of factual knowledge, especially for facts of torso-to-tail entities."
3
18
104
@srchvrs
Leo Boytsov
3 years
👇I am 😊 to announce that our retrieval FlexNeuART (used to produce strong MS MARCO & TREC runs) is now available as pypi package (Linux/MacOS). It's fully-fledged modular retrieval framework that has a number of SOTA neural models and traditional models.
2
16
103
@srchvrs
Leo Boytsov
4 years
👇Mad-X a beautiful idea for effective multi-lingual transfer. 1. Start from multiling BERT. It works ok for many languages, but you can still get better results from unsupervised (LM-mask) fine-tuning for a target language. Downside: we have multiple-language models. Solution?
2
18
102
@srchvrs
Leo Boytsov
1 year
I see people are being stressed out about things changing too quickly recently. I had somewhat similar feelings, but I also have a feeling of relief that largely outweighs anxiety. Imagine all that old NLP/ML stuff you had to learn before to do well: CRFs and their training
8
8
98
@srchvrs
Leo Boytsov
1 year
"At Netflix, personalization plays a key role in several aspects of our user experience, from ranking titles to constructing an optimal Homepage. " No BERT: good old DNNs is all your need.
3
13
96
@srchvrs
Leo Boytsov
4 years
👇Today is an efficient Transformer day. Previously, I tweeted about a 30x speed-up on a 32-core CPU: There is a paper getting 233x on 80 CPU cores or roughly 93x speed-up scaled to 32-cores (this is a very rough comparison).
@srchvrs
Leo Boytsov
4 years
👇Running BERT in Pytorch on CPU efficiently! 1. Can't use GPU, b/c low-latency requests prevent batching. 2. torch.set_num_threads(1) or else each PyTorch instance tries to use most CPU cores 3. Use DistillBert (2x speed up) 4. No padding (batch size==1)! 5. 8-bit Quantization
1
9
43
5
12
90
@srchvrs
Leo Boytsov
2 years
Betting on a dead horse. The Moore's law is effectively dead and it's not clear when/if it can be resuscitated. Even using a conservative estimate (favorable for tech), we are still SIX orders of magnitude behind the brain.
@ethanCaballero
Ethan Caballero is busy
2 years
. @RichardSSutton estimates 50% probability of Human-Level AI by 2040:
Tweet media one
40
80
620
19
11
84
@srchvrs
Leo Boytsov
6 years
Jupyther notebooks may be good for protyping, but eventually obe needs proper modular code without weird memory effects. Using them for production grade work seems to be poor engineering.
5
9
85
@srchvrs
Leo Boytsov
7 years
[1/3] More insights from people doing real ML: it's VERY hard for us to convince new grads that deep learning isn't the first thing to try.
3
13
79
@srchvrs
Leo Boytsov
3 years
I was accused that my software is merely a research toy. There's nothing better than a good toy. My condolences, if your inner child has reached the age of senility.
3
3
79
@srchvrs
Leo Boytsov
1 year
GPT-4all with 20K starts in a very short period is a crazily successful project. In contrast, @huggingface has ~90K. GPT-4all: Demo, data, and code to train an assistant-style large language model with ~800k GPT-3.5-Turbo Generations based on LLaMa
@andriy_mulyar
AndriyMulyar
1 year
AI winter confirmed
Tweet media one
4
5
76
4
17
74
@srchvrs
Leo Boytsov
4 years
Please don't use the word pioneering with respect to your own work.
11
2
79
@srchvrs
Leo Boytsov
2 years
In summary, I am convinced there is clearly an emergence of behaviors from apparently "dumb" training procedures, which seem to contradict (to some degree) the poverty of stimulus hypothesis. As a scientist, I am looking forward to learning more about these phenomena.
5
6
74
@srchvrs
Leo Boytsov
4 years
👇Recently we had a discussion about different approaches to apply layer normalization in Transformers. Unlike the original postnorm approach, prenorm has a skip connection and was found to train in a more stable way. There's a recent update with residual attention Transformers.
@_akhaliq
AK
4 years
Informer: Transformer Likes Informed Attention pdf: abs:
Tweet media one
0
35
214
1
17
77
@srchvrs
Leo Boytsov
2 years
👇 mBERT-like models can have great cross-lingual 0-shot transfer abilities. Just train it on English => apply to another lang. Cao, Kitaev, & Klein showed that mBERT can be improved (for NLI) via cross-lingual adjustment with a small || corpus. We find this may not always work.
4
14
75
@srchvrs
Leo Boytsov
5 years
Bing uses a 3-layer BERT-like transformer for every query. In addition, to model simplification and possibly distillation, they implement more efficient GPU code.
@julien_c
Julien Chaumond
5 years
“With these GPU optimizations, we were able to use 2000+ Azure GPU Virtual Machines across four regions to serve over 1 million BERT inferences per second worldwide” Bing, using (distilled, 3-layer) BERT in production. via @rangthan
4
78
384
1
21
71
@srchvrs
Leo Boytsov
10 years
Non-Metric Space Library: A cross platform similarity search library & toolkit to evaluate similarity search methods. http://t.co/PCGcHNjaRz
2
20
69
@srchvrs
Leo Boytsov
2 years
🧵A lot of multi-modal models were announced recently. A little summary, b/c they all seem to be similar in key aspects: 1. combining language & vision embeddings using (mostly) pre-trained backbones. 2. training composite Transformer model using next token prediction.
3
8
71
@srchvrs
Leo Boytsov
2 years
Uncomfortable truth: working with (only) corporate APIs that you don't control isn't reproducible research and it has other issues too. In particular, if your strongest technical skill is writing API calls, how much value does it create for workplaces where heavylifting is still
6
6
69
@srchvrs
Leo Boytsov
3 years
👇Turns out that one can tune hyper parameters using a smaller version of the model (which is cheaper compared to the full-size model) and transfer this params to the full-size model => boost in accuracy.
@arankomatsuzaki
Aran Komatsuzaki
3 years
Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer By transferring from 40M parameters, µTransfer outperforms the 6.7B GPT-3, with tuning cost only 7% of total pretraining cost. abs: repo:
Tweet media one
5
43
247
2
4
71
@srchvrs
Leo Boytsov
1 year
Also another point to make. Let's face it, to a large degree reviewing is non-scientific. A bunch of overworked people, many of whom haven't had any hands on experiences for a while, make hasty decisions by skimming a bunch of papers (on the topics they do NOT care about). What
@RichardSocher
Richard Socher
1 year
I think one of the problems of academic publishing/peer reviews is that there is zero downside of rejecting good ideas. The reviewers of the DecaNLP paper - which introduced prompt engineering for multitask NLP with large nnets - were so wrong & held back the field for years.
Tweet media one
Tweet media two
Tweet media three
Tweet media four
26
99
665
8
7
68
@srchvrs
Leo Boytsov
9 months
"Google engineers now resolve 7.5% of all code review comments with an ML-suggested edit. "
@jacobaustin132
Jacob Austin
9 months
We've finally put out a detailed IEEE/ACM paper on @Google 's multi-year effort to ease the burden of code review with ML. Google engineers now resolve 7.5% of all code review comments with an ML-suggested edit. But the path to that number has been a fun ML and UX journey!
Tweet media one
14
143
766
4
3
71
@srchvrs
Leo Boytsov
4 years
Best ACL2020 paper: "While traditional benchmarks indicate that models on these tasks are as accurate as humans, CheckList reveals a variety of severe bugs, where commercial and research models do not effectively handle basic linguistic phenomena"
0
10
69
@srchvrs
Leo Boytsov
5 years
This is freaking insane [thread]: "we utilized GPipe [4] to train 128-layer Transformers with over 6 billion parameters. Increasing the model capacity resulted in significantly improved performance across all languages by an average of 5 BLEU points."
1
15
68
@srchvrs
Leo Boytsov
2 years
@vboykis I think one secret of productivity is not to waste one's time on reading 300 page productivity blog posts.
0
1
66
@srchvrs
Leo Boytsov
6 months
This is probably the first encoder-decoder model in the recent model race.
@RekaAILabs
Reka
6 months
Along with Core, we have published a technical report detailing the training, architecture, data, and evaluation for the Reka models.
Tweet media one
Tweet media two
2
63
368
8
5
68
@srchvrs
Leo Boytsov
3 years
A reminder for people who want to use Python multiprocessing with CUDA and other tricky to use resources (e.g., Java libraries). It's very likely you need to set multiprocessing.set_start_method('spawn')
3
5
64
@srchvrs
Leo Boytsov
1 year
Am I the only one who gets very upset when hearing the term semantic search?
24
1
65
@srchvrs
Leo Boytsov
5 years
👇 "End-to-End learning sounds like a good idea on paper, but for most deployment scenarios, pipelined architectures that are piecewise optimized will continue to stay. "
@deliprao
Delip Rao e/σ
5 years
New post: The Twelve Truths of #MachineLearning for the Real World
6
86
345
1
10
65
@srchvrs
Leo Boytsov
3 years
Google GED a very interesting QA explainability benchmark. I can help noticing though authors postpone the issue of faithfulness to future work, buy the problem is likely unsolvable for complete black boxes.
0
12
63
@srchvrs
Leo Boytsov
7 years
From Awni Hannun (the first author on the 1st Baidu's end-to-end recognition paper): Speech-recognition isn't solved : " The recent improvements on conversational speech are astounding. But, the claims about human-level performance are too broad."
1
28
61
@srchvrs
Leo Boytsov
2 years
"In all honesty, I suspect a lot of researchers see papers as “CV enhancers” more than scientific contributions. "
@EhudReiter
Ehud Reiter
2 years
New blog: Could some NLP research be fraudulent?
9
8
67
2
4
62
@srchvrs
Leo Boytsov
1 year
🧵This is a nice adversarial attack on LLMs, where redefining built-in functions (len becomes print and vice versa) confuses the model. It shows the model does not fully grasp semantics and heavily relies on reasoning shortcuts.
@AVMiceliBarone
Antonio Valerio Miceli Barone
1 year
New paper: LLMs are good at programing tasks, and they are now being widely used to assist code generation. But do they actually understand the semantics of programming languages, or do they rely on superficial, "shortcut" correlations? Let's find out! 1/
Tweet media one
34
241
1K
3
7
60
@srchvrs
Leo Boytsov
5 years
👇Great talk: 1. You need better data not just more data 2. Increasing data has diminishing returns 2. Feature engineering is still important 3. Start with simpler models 4. Don't just increase complexity of models: add more features 5. Be aware about biases of the data
1
14
59
@srchvrs
Leo Boytsov
9 months
Nice work: Mixtral MoE on small-memory GPUs via offloading weights to RAM. It has 2 ingredients: 1. LeastRU cache. 2. Speculative loading: estim. expert by applying current layer gating func. to previous layer H. 2-3x speed up compared to naive offload.
1
12
58
@srchvrs
Leo Boytsov
5 years
👇What I learned recently about Python, CUDA, and/or @spacy_io multi-processing. 1. Python lack of multi-threading is super-annoying but there're crutches that can still be useful. 2. You need to call multiprocessing.set_start_method('spawn') or your CUDA is gonna be screwed up
3
11
60
@srchvrs
Leo Boytsov
6 years
Evaluating ELMO embeddings: Meaningful gains on many tasks. "We also show, however, that we are still far away from a universal encoder that can perform consistently across several downstream tasks." by Perone et al 2018 …
2
16
60
@srchvrs
Leo Boytsov
4 years
An interesting #Neurips2020 tutorial on model interpretations, which covers main approaches in three modalities: text, image, and structured data!
0
24
59
@srchvrs
Leo Boytsov
1 year
Short summary of a discussion (but it's worth reading all of it). LLAMA team (as well as many other teams) evaluates a model withOUT actually revealing prompts (at least this is my understanding). Another team tests LLAMA with a different set of prompts and gets noticeably worse
@GuillaumeLample
Guillaume Lample @ ICLR 2024
1 year
@BlancheMinerva @sam_havens @giffmana @DohmannJeremy @vitaliychiley @MosaicML @HugoTouvron @LavrilThibaut We spent a lot of time investigating what others papers did (Chinchilla, GPT-3, PaLM) but very few of them actually provide any prompt so we just implemented what made sense to us. And we did not observe 20% differences by adding or removing a space in the prompt.
3
1
18
7
6
60
@srchvrs
Leo Boytsov
3 years
👇BERT does *NOT* rediscover the classic NLP pipeline."We verify via non-parametric probes that the permutations do in fact make the model worse at syntax-dependent tasks."
@arankomatsuzaki
Aran Komatsuzaki
3 years
Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little MLMs pretrained on sentences with randomly shuffled word order surprisingly achieves high accuracy after fine-tuning on many downstream tasks.
Tweet media one
10
46
298
2
5
58
@srchvrs
Leo Boytsov
2 years
🧵Well-known company released a multi-threaded eval script. It starts 16 threads but 15 have 2% CPU usage. A big reminder: do *NOT* waste your time in trying to harness ThreadPoolExecutor (or any Python "thread" API). It's a pointless exercise b/c CPython does not have threads.
7
5
56
@srchvrs
Leo Boytsov
4 years
👇"A major methodological concern with language models pretrained on a broad swath of internet data, particularly large models with the capacity to memorize vast amounts of content, is potential contamination of downstream tasks by having their test or development sets ... "
@nasrinmmm
Nasrin Mostafazadeh
4 years
GPT-3 with 175B params,10x any prior LM,is out.Setting aside controversies around implications of training such ever-larger models, the fact that they show high zero-shot performance on challenging reasoning tasks is pretty interesting. Ppl often ask me how I feel about all this:
Tweet media one
Tweet media two
4
61
265
2
6
57
@srchvrs
Leo Boytsov
3 years
@huggingface NLP datasets have won the best demo paper award.
@xwang_lk
Xin Eric Wang
3 years
EMNLP 2021 Best Paper Awards are out:
0
7
41
0
8
56
@srchvrs
Leo Boytsov
4 years
👇"By increasing batch size to the memory limit of TPUv3 Pod, BERT training time can be reduced from 3 days to 76 minutes". Despite this little pod has 1024 GPUs, it makes training possible for many teams. V3 GPU is $8/hour. So, the overall training cost $8K.
1
4
56
@srchvrs
Leo Boytsov
1 year
"Evaluating the Code Quality of AI-Assisted Code Generation Tools: An Empirical Study on GitHub Copilot, Amazon CodeWhisperer, and ChatGPT" *CAVEAT* only one small (should I say tiny?) collection is used.
5
8
53
@srchvrs
Leo Boytsov
3 years
Product search at Amazon via DSSM and (very cool) extreme classification: "This indicates another unique difference between semantic product search problem and the conventional web retrieval problem where BM25 variants are often a strong baseline."
2
3
55
@srchvrs
Leo Boytsov
5 years
Indexing a trillion of vectors with FAISS. This is seriously badass:
0
20
54
@srchvrs
Leo Boytsov
2 years
It looks like SOTA in Transformer model efficiency (if you don't use any sparsity / routing / mixture of experts) is a combination of: 1. distillation 2. quantization 3. static or adaptive early exiting / reduced comp. in higher layers
2
9
53
@srchvrs
Leo Boytsov
1 year
A study finds that ChatGPT is seemingly not at the level of good human annotators as of now. A couple of caveats here: 1. There was no extensive prompt hacking. 2. Performance is measured against existing human annotations. How noisy and accurate they are is not rechecked.
@omarsar0
elvis
1 year
Can ChatGPT Reproduce Human-Generated Labels? Nice study on the applicability of ChatGPT for data annotation tasks. Results show that ChatGPT does have the potential to handle different types of data annotation tasks like stance detection, sentiment analysis, and bot detection.
Tweet media one
3
24
142
8
2
50
@srchvrs
Leo Boytsov
6 years
Different interpretation of the same results: @jacobeisenstein HMM beats LSTM on small data @sleepinyourhat wow, LSTM beats HMM already with 500 sentences. Proof links:
@jacobeisenstein
Jacob Eisenstein
6 years
At #NAACL2018 : @barbara_plank on how the HMM TnT tagger from 1998 outperforms biLSTM on small data
Tweet media one
2
12
61
3
10
51
@srchvrs
Leo Boytsov
2 years
10% of images in ImageNet1K are mislabeled, yet ImageNet SOTA is 90% accuracy.
@davisblalock
Davis Blalock
2 years
Paper: If you like this paper, consider RTing this (or another!) thread to publicize the authors' work, or following the authors: @SashaMTL … [7/8]
1
2
16
6
4
50
@srchvrs
Leo Boytsov
6 months
🧵 A nice LLAMA-3 summary by @natolambert . In short, performance is better, but largely due to the sheer increase in the pre-training scale. Training data is a secret. ↩️
2
4
52
@srchvrs
Leo Boytsov
2 years
It looks like the stress is real. In my experience: 1. paradigms shift 2. numbers get higher 3. Yet, unsolved problems remain in big numbers. Moreover, it looks like new discoveries create new opportunities. It is a bit hard when you are locked in with some topic that you
@mayhewsw
Stephen Mayhew
2 years
when everyone in NLP is having an existential crisis because of GPT-4, but you lived through ELMo/BERT in 2019:
Tweet media one
18
99
1K
5
8
50
@srchvrs
Leo Boytsov
4 years
"Starting from a public multilingual BERT checkpoint, our final model is 6x smaller and 27x faster, and has higher accuracy than a SOTA multilingual baseline." SOTA is LSTM.
0
5
50
@srchvrs
Leo Boytsov
2 years
@sarahookr At the same it is not quite clear what's the value of PhD for most people. The competition to get in heats up but then what? Becoming a junior ad engineer without any research.
0
0
50
@srchvrs
Leo Boytsov
4 years
A beautiful idea: inject very small networks, e.g., between layers. During fine-tuning, the main BERT "body" remains unmodified. Thus, effective fine-tuning requires only very small modification of the main network and these modifications can be engineered explicitly!
@AdapterHub
AdapterHub
4 years
Adapters are a small set of new weights introduced at every layer of pre-trained transformer models (e.g. BERT, RoBERTa). Only the new Adapter weights (within the pink box) are trained while keeping the original transformer parameters frozen/fixed.
Tweet media one
1
6
21
1
6
49
@srchvrs
Leo Boytsov
5 years
The Hitchhiker's Guide to the Galaxy is an early machine learning book! The computer spits out a number (42), but it refuses to provide an explanation!
@mark_riedl
Mark Riedl
5 years
Hinton has the hackles raised on some of my team. One of them produced this gem of wisdom:
Tweet media one
5
10
91
4
8
47
@srchvrs
Leo Boytsov
6 years
My friend trained an extremely mean regression model. All it predicts is a constant.
3
6
47
@srchvrs
Leo Boytsov
4 years
The primary author of the LSTM paper Sepp Hochreiter gets a well-deserved (and LONG OVERDUE) neural networks pioneering award!
@SchmidhuberAI
Jürgen Schmidhuber
4 years
Congrats to the awesome Sepp Hochreiter for the well-deserved 2021 IEEE Neural Networks Pioneer Award! It was my great honor to be Sepp's nominator.
Tweet media one
10
65
747
1
3
48
@srchvrs
Leo Boytsov
3 years
Wasn't that "feature engineering" more like "add a bunch of n grams", literally millions of them, and let L1 regularization figure it out?
@TaliaRinger That statistical NLP work was still closely coupled with understanding the shape of the problem being solved, specifically in feature engineering. Then (2010s) we got the next "invasion" from ML land (deep learning) where the idea was the computer would learn the features! 2/
1
4
30
5
2
46
@srchvrs
Leo Boytsov
4 years
👇The largest ever IR collection of news items is released recently! What is totally awesome is the size of the collection: 42M news articles & 10K queries. What's a bit disheartening is that queries are created using SQuAD approach, i.e., workers were asked to "create a query".
2
9
46
@srchvrs
Leo Boytsov
2 years
@deliprao @thegautamkamath You need to focus on key values.
1
0
46
@srchvrs
Leo Boytsov
4 years
"95% of the original perf with only ~5% remaining weights in the encoder!" is a great result by a team of @huggingface researchers
0
7
45
@srchvrs
Leo Boytsov
5 years
[thread] A multilingual BERT trained with the shared wordpiece inventory permits zero-shot transfer of various models across languages. Quite crazily it works for POS and NER tasks! Even more crazily it works for pairs of languages with DIFFERENT scripts...
2
12
45
@srchvrs
Leo Boytsov
2 years
🧵"Although transformer-based models have made great success in information retrieval tasks, traditional IR methods such as BM25 can still yield robust performance in real-world web search scenarios. "
2
2
44