Leo Boytsov @srchvrs Twitter profile | Pikagi

Pikagi

Leo Boytsov

@srchvrs

7,896

Followers

1,934

Following

225

Media

24,365

Statuses

Sr. Research Scientist @AWS Labs (ph-D @LTIatCMU ) working on unnatural language processing, speaking πtorch & C++. Opinions sampled from MY OWN 100T param LM.

Pittsburgh, PA

https://t.co/Do0zhhKvZi

Joined November 2009

Don't wanna be here? Send us removal request.

Pinned Tweet

@srchvrs

Leo Boytsov

6 months

🧵📢Attention folks working on LONG-document ranking & retrieval! We found evidence of a PROFOUND issue in existing long-document collections, most importantly MS MARCO Documents. It can potentially affect all papers comparing different architectures for long document ranking.⏩

4

14

128

Last Seen Profiles

@sugastopic

@LilDirkNowitzki

@superkuropyon

@LandyCandelario

@TeeSaxx

@kayama_yo

@Abigail_SexGirl

@People4Bernie

@fazlfazl5656

@mbadel_1100

@77MichaelR

@eltcmoe

@__m_y__u

@stw_pdg

@nathanielram18

@kawakawa_chii

@KSeltsam

@lsvdukkan

@mohmmaed2S

@fra_tava

@vallumsoftware

@red_unchained

@Skyyhook

@LV_LMS

@marimeotti

@lucasbeluomini

@reyhanwtf

@yeppojihyo_

@pdownes14

@gzdc_haller

@C9_Fanboy

@stw_pdg

@ChristianAllie7

@OptusStadium

@lucasbeluomini

@SofiaFontanet_

@srchvrs

Leo Boytsov

2 years

"We tested GPT-3, GPT-Neo/J, GPT-2 and a T5-based model. The best model was truthful on 58% of questions, while human perf. was 94%. The LARGEST models were generally the LEAST truthful. Models generated many false answers that mimic popular misconceptions

Tweet card media

TruthfulQA: Measuring How Models Mimic Human Falsehoods

We propose a benchmark to measure whether a language model is truthful in generating answers to questions. The benchmark comprises 817 questions that span 38 categories, including health, law,...

17

107

495

@srchvrs

Leo Boytsov

3 years

I am excited 🥳🎉🎈to announce that our @NeurIPSConf paper was rejected. I am not sad or distressed. I thank the reviewers and I agree we should do a more thorough evaluation in our future submissions.

8

6

458

@srchvrs

Leo Boytsov

8 months

Fun fact about Python. You can use a sum function to flatten nested lists. l=[['a', 'b', 'c'], ['1', '2'], ['#']] sum(l, []) Result: ['a', 'b', 'c', '1', '2', '#']

7

14

395

@srchvrs

Leo Boytsov

7 months

Everything you wanted to know about activation functions, but were afraid to ask! Turns out there has been at least 4⃣ 0⃣0⃣ activation functions proposed over 3⃣decades and there is a paper that reviews all of them!

Tweet card media

Three Decades of Activations: A Comprehensive Survey of 400...

Neural networks have proven to be a highly effective tool for solving complex problems in many areas of life. Recently, their importance and practical usability have further been reinforced with...

10

84

345

@srchvrs

Leo Boytsov

1 year

" It’s easy to make something cool with LLMs, but very hard to make something production-ready with them."

9

54

328

@srchvrs

Leo Boytsov

2 years

🧵 A longish thread on big models, ChatGPT, and emergence of non-trivial intelligence from big data & compute. 1st, please, do not misinterpret some of my tweets, @OpenAI made another amazing breakthrough and pushed the boundaries of conversational AI.

3

40

300

@srchvrs

Leo Boytsov

7 years

"Bring sheep indoors, and they’re labeled as cats. Pick up a sheep (or a goat) in your arms, and they’re labeled as dogs. Paint them orange, and they become flowers. And if goats climb trees, they become birds. "

Do neural nets dream of electric sheep? - AI WeirdnessCommentShareCommentShare

If you’ve been on the internet today, you’ve probably interacted with a neural network. They’re a type of machine learning algorithm that’s used for everything from language translation to finance...

www.aiweirdness.com

8

163

292

@srchvrs

Leo Boytsov

4 years

One of the most comprehensive books on the artificial intelligence by Russel and Norvig has a new edition! For so many years this book is being updated to cover key topics in AI!

3

79

275

@srchvrs

Leo Boytsov

2 years

"Google is somewhat isolated within ML community because of its lack of use of @PyTorch and GPUs in favor of its own software stack and hardware. In typical Google fashion they even have a 2nd framework called Jax that competes directly with TensorFlow."

Tweet card media

How Nvidia’s CUDA Monopoly In Machine Learning Is Breaking - OpenAI Triton And PyTorch 2.0

Over the last decade, the landscape of machine learning software development has undergone significant changes. Many frameworks have come and gone, but most have relied heavily on leveraging Nvidia's...

www.semianalysis.com

6

27

274

@srchvrs

Leo Boytsov

1 year

"Prompting Is Programming: A Query Language For Large Language Models". 😀

Tweet card media

Prompting Is Programming: A Query Language for Large Language Models

Large language models have demonstrated outstanding performance on a wide range of tasks such as question answering and code generation. On a high level, given an input, a language model can be...

5

69

270

@srchvrs

Leo Boytsov

2 years

After 2.75 great years with @Bosch_AI I am excited to join @AWS AI labs as a senior research scientist.

23

2

255

@srchvrs

Leo Boytsov

2 years

🧵Attention the IR community! The era of cheap UNSUPERVISED domain adaptation has begun! Let me introduce the InPars-Light training recipe enabling a small MiniLM model (with 30M parameters) to consistently outperform BM25 on all datasets used in the original InPars study.

6

40

247

@srchvrs

Leo Boytsov

1 year

🧵Big announcements from @awscloud 🎉🎈🍾. First, the coding assistant (that our team worked on) has become generally available. It is is *FREE* for individual use. I have been personally using it recently and find it quite useful (typing effort reduced!).

Tweet card media

Amazon CodeWhisperer, Free for Individual Use, is Now Generally Available | Amazon Web Services

Today, Amazon CodeWhisperer, a real-time AI coding companion, is generally available and also includes a CodeWhisperer Individual tier that’s free to use for all developers. Originally launched in...

9

45

232

@srchvrs

Leo Boytsov

2 years

What I find interesting in the GPT-4 paper is that the paper has to be cited by using OpenAI as a single author. Basically @OpenAI denies you individual authorship if you work for them. This reminded me about Cirque du Soleil where all stars are anonymous and interchangeable.

24

7

210

@srchvrs

Leo Boytsov

3 years

Transformer architecture reality check. Most modifications do not lead to improved performance, unless the # of parameters is increased dramatically. One exception is decoupling input and output embedding parameters. h/t @seb_ruder

Tweet card media

Do Transformer Modifications Transfer Across Implementations and...

The research community has proposed copious modifications to the Transformer architecture since it was introduced over three years ago, relatively few of which have seen widespread adoption. In...

3

44

217

@srchvrs

Leo Boytsov

4 years

While having little visibility 3 years ago @Bosch_AI published > 100 papers in top tier venues in 2019-2020, including 32 papers at NEURIPS, ICML and ICLR in 2020.

3

8

213

@srchvrs

Leo Boytsov

3 years

👇When u mention beam search, please, don't cite Sutskever or any other modern author (unless it's some specific variant). The term beam-search was possibly coined by the Carnegie Mellon professor Raj Reddy (who received the Turing award for his contributions to AI).

2

17

203

@srchvrs

Leo Boytsov

2 years

How does @Stanford @stanfordnlp release a fine-tuned LLAMA model whereas the original LLAMA model is not freely available (unless you use Torrent 😂). This should raise some interesting legal questions or do I miss something?

12

8

184

@srchvrs

Leo Boytsov

2 years

🧵 This is an awesome paper with a very clever and effective idea about how one should implement a global (rather than local) convolution, but I think there are nuances that I explain in the thread. First authors build upon success of S4 model.

@ylecun

Yann LeCun

2 years

A new flavor of ConvNet crushes various flavors of transformers (as well as state-space models) for sequence modeling with long-range dependencies.

16

117

915

2

12

158

@srchvrs

Leo Boytsov

5 years

They iteratively train several models. Largest is trained for 3.5 days on 2048-v3 TPU pod. It seems that one hour of this pod is ~$2.5K. Thus, we get an insane price tag ~$200K. The full experiment might cost you about half a MILLION USD if u want to reproduce it.

@quocleix

Quoc Le

5 years

Example predictions on robustness benchmarks ImageNet-A, C and P. Black texts are correct predictions made by our model and red texts are incorrect predictions by our baseline model.

Tweet media one

3

4

39

4

32

159

@srchvrs

Leo Boytsov

2 years

👇Never-ever using a boolean type in Python argparse unless you use action='store_true'. Inspired by four hours of hunting a bug in a famous library. Imagine you have a program

7

11

147

@srchvrs

Leo Boytsov

2 years

🧵 @OpenAI open-sourced their tokenizer's code which is claimed to be ridiculously faster than @huggingface tokenizers. Does anybody have HF benchmarks? Here the page claims the speed of 50 MB/sec but # of threads is unknown.

Tweet card media

4

15

143

@srchvrs

Leo Boytsov

2 years

🧵Some back of the envelope calculations on feasibility of using LLM as an addition to search. Let's assume that a somewhat smallish ~100GB params model fits on an 8 GPU 320 GB GPU memory p4d.24xlarge. A 3 year reserved price is $12/hour.

11

9

132

@srchvrs

Leo Boytsov

2 years

"We present gen. Parameter-Efficient Finetuning framework for tuning LLM with only 𝟬.𝟭%-𝟬.𝟮% of parameters using mix of adaptation modules -> achieve new 𝗦𝗢𝗧𝗔 > standard tuning on both NLU & NLG tasks. Paper: Code&Models:

This link will take you to a page that’s not on LinkedIn

1

23

134

@srchvrs

Leo Boytsov

2 years

@rasbt "Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar."

5

4

124

@srchvrs

Leo Boytsov

1 year

Now that FlashAttention is flashing and we can have very long now input windows, is anybody training a replacement for BERT? Or do we first need to wait till FlashAttention is fully integrated into @huggingface ?

13

8

119

@srchvrs

Leo Boytsov

2 years

Some interesting/notable things that Huang et al did here: 1. A somewhat non-standard Transformer variant (which is claimed to be more stable in training). 2. Both text and images are embedded, however, only text tokens are predicted! 3. Training includes both next-token

@omarsar0

elvis

2 years

Here we go! Microsoft introduces a multimodal large language model called Kosmos-1. Achieves great performance on language understanding, OCR-free NLP, perception-language tasks, visual QA, and more.

Tweet media one

30

493

2K

2

16

118

@srchvrs

Leo Boytsov

5 years

I am extremely excited to join the Artificial Intelligence lab @Bosch_AI and @zicokolter team! I am looking forward towards new challenges and opportunities! 2 years with M*Modal (also as a part of @3MHISNews ) were good and productive, but it's time for new adventures!

8

3

115

@srchvrs

Leo Boytsov

4 years

A very interesting project, which allows running many traditional ML models on the GPU!

@ml_review

ML Review 🇺🇦

4 years

Hummingbird – compiles trained Scikit-Learn, LightGBM, XGBoost models into PyTorch for faster inference By @scnakandala @MatteInter

Tweet media one

4

87

367

3

26

113

@srchvrs

Leo Boytsov

4 years

👇I have 2 internship positions in 2021: CV & NLP. @Bosch_AI is an up-and-coming research lab. Bosch is an equal opportunity employer and we welcome candidates from underrepresented groups. More details are in the thread, please, RT. 🧵

2

48

113

@srchvrs

Leo Boytsov

1 year

"Head-to-Tail: How Knowledgeable are LLMs? Will LLMs Replace Knowledge Graphs? We show that existing LLMs are still *FAR* from being perfect in terms of their grasp of factual knowledge, especially for facts of torso-to-tail entities."

Tweet card media

Head-to-Tail: How Knowledgeable are Large Language Models (LLMs)?...

Since the recent prosperity of Large Language Models (LLMs), there have been interleaved discussions regarding how to reduce hallucinations from LLM responses, how to increase the factuality of...

3

18

104

@srchvrs

Leo Boytsov

3 years

👇I am 😊 to announce that our retrieval FlexNeuART (used to produce strong MS MARCO & TREC runs) is now available as pypi package (Linux/MacOS). It's fully-fledged modular retrieval framework that has a number of SOTA neural models and traditional models.

Tweet card media

GitHub - oaqa/FlexNeuART: Flexible classic and NeurAl Retrieval Toolkit

Flexible classic and NeurAl Retrieval Toolkit. Contribute to oaqa/FlexNeuART development by creating an account on GitHub.

2

16

103

@srchvrs

Leo Boytsov

4 years

👇Mad-X a beautiful idea for effective multi-lingual transfer. 1. Start from multiling BERT. It works ok for many languages, but you can still get better results from unsupervised (LM-mask) fine-tuning for a target language. Downside: we have multiple-language models. Solution?

2

18

102

@srchvrs

Leo Boytsov

1 year

I see people are being stressed out about things changing too quickly recently. I had somewhat similar feelings, but I also have a feeling of relief that largely outweighs anxiety. Imagine all that old NLP/ML stuff you had to learn before to do well: CRFs and their training

8

8

98

@srchvrs

Leo Boytsov

1 year

"At Netflix, personalization plays a key role in several aspects of our user experience, from ranking titles to constructing an optimal Homepage. " No BERT: good old DNNs is all your need.

Search Personalization at Netflix | Companion Proceedings of the ACM Web Conference 2023

3

13

96

@srchvrs

Leo Boytsov

4 years

👇Today is an efficient Transformer day. Previously, I tweeted about a 30x speed-up on a 32-core CPU: There is a paper getting 233x on 80 CPU cores or roughly 93x speed-up scaled to 32-cores (this is a very rough comparison).

@srchvrs

Leo Boytsov

4 years

👇Running BERT in Pytorch on CPU efficiently! 1. Can't use GPU, b/c low-latency requests prevent batching. 2. torch.set_num_threads(1) or else each PyTorch instance tries to use most CPU cores 3. Use DistillBert (2x speed up) 4. No padding (batch size==1)! 5. 8-bit Quantization

1

9

43

5

12

90

@srchvrs

Leo Boytsov

2 years

Betting on a dead horse. The Moore's law is effectively dead and it's not clear when/if it can be resuscitated. Even using a conservative estimate (favorable for tech), we are still SIX orders of magnitude behind the brain.

@ethanCaballero

Ethan Caballero is busy

@ethanCaballero

2 years

. @RichardSSutton estimates 50% probability of Human-Level AI by 2040:

Tweet media one

40

80

620

19

11

84

@srchvrs

Leo Boytsov

6 years

Jupyther notebooks may be good for protyping, but eventually obe needs proper modular code without weird memory effects. Using them for production grade work seems to be poor engineering.

5

9

85

@srchvrs

Leo Boytsov

7 years

[1/3] More insights from people doing real ML: it's VERY hard for us to convince new grads that deep learning isn't the first thing to try.

3

13

79

@srchvrs

Leo Boytsov

3 years

I was accused that my software is merely a research toy. There's nothing better than a good toy. My condolences, if your inner child has reached the age of senility.

3

3

79

@srchvrs

Leo Boytsov

1 year

GPT-4all with 20K starts in a very short period is a crazily successful project. In contrast, @huggingface has ~90K. GPT-4all: Demo, data, and code to train an assistant-style large language model with ~800k GPT-3.5-Turbo Generations based on LLaMa

@andriy_mulyar

AndriyMulyar

1 year

AI winter confirmed

Tweet media one

4

5

76

4

17

74

@srchvrs

Leo Boytsov

4 years

Please don't use the word pioneering with respect to your own work.

11

2

79

@srchvrs

Leo Boytsov

2 years

In summary, I am convinced there is clearly an emergence of behaviors from apparently "dumb" training procedures, which seem to contradict (to some degree) the poverty of stimulus hypothesis. As a scientist, I am looking forward to learning more about these phenomena.

5

6

74

@srchvrs

Leo Boytsov

4 years

👇Recently we had a discussion about different approaches to apply layer normalization in Transformers. Unlike the original postnorm approach, prenorm has a skip connection and was found to train in a more stable way. There's a recent update with residual attention Transformers.

@_akhaliq

AK

4 years

Informer: Transformer Likes Informed Attention pdf: abs:

Tweet media one

0

35

214

1

17

77

@srchvrs

Leo Boytsov

2 years

👇 mBERT-like models can have great cross-lingual 0-shot transfer abilities. Just train it on English => apply to another lang. Cao, Kitaev, & Klein showed that mBERT can be improved (for NLI) via cross-lingual adjustment with a small || corpus. We find this may not always work.

4

14

75

@srchvrs

Leo Boytsov

5 years

Bing uses a 3-layer BERT-like transformer for every query. In addition, to model simplification and possibly distillation, they implement more efficient GPU code.

@julien_c

Julien Chaumond

5 years

“With these GPU optimizations, we were able to use 2000+ Azure GPU Virtual Machines across four regions to serve over 1 million BERT inferences per second worldwide” Bing, using (distilled, 3-layer) BERT in production. via @rangthan

4

78

384

1

21

71

@srchvrs

Leo Boytsov

10 years

Non-Metric Space Library: A cross platform similarity search library & toolkit to evaluate similarity search methods. http://t.co/PCGcHNjaRz

2

20

69

@srchvrs

Leo Boytsov

2 years

🧵A lot of multi-modal models were announced recently. A little summary, b/c they all seem to be similar in key aspects: 1. combining language & vision embeddings using (mostly) pre-trained backbones. 2. training composite Transformer model using next token prediction.

3

8

71

@srchvrs

Leo Boytsov

2 years

Uncomfortable truth: working with (only) corporate APIs that you don't control isn't reproducible research and it has other issues too. In particular, if your strongest technical skill is writing API calls, how much value does it create for workplaces where heavylifting is still

6

6

69

@srchvrs

Leo Boytsov

3 years

👇Turns out that one can tune hyper parameters using a smaller version of the model (which is cheaper compared to the full-size model) and transfer this params to the full-size model => boost in accuracy.

@arankomatsuzaki

Aran Komatsuzaki

@arankomatsuzaki

3 years

Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer By transferring from 40M parameters, µTransfer outperforms the 6.7B GPT-3, with tuning cost only 7% of total pretraining cost. abs: repo:

Tweet media one

5

43

247

2

4

71

@srchvrs

Leo Boytsov

1 year

Also another point to make. Let's face it, to a large degree reviewing is non-scientific. A bunch of overworked people, many of whom haven't had any hands on experiences for a while, make hasty decisions by skimming a bunch of papers (on the topics they do NOT care about). What

@RichardSocher

Richard Socher

1 year

I think one of the problems of academic publishing/peer reviews is that there is zero downside of rejecting good ideas. The reviewers of the DecaNLP paper - which introduced prompt engineering for multitask NLP with large nnets - were so wrong & held back the field for years.

Tweet media one

Tweet media two

Tweet media three

Tweet media four

26

99

665

8

7

68

@srchvrs

Leo Boytsov

9 months

"Google engineers now resolve 7.5% of all code review comments with an ML-suggested edit. "

@jacobaustin132

Jacob Austin

@jacobaustin132

9 months

We've finally put out a detailed IEEE/ACM paper on @Google 's multi-year effort to ease the burden of code review with ML. Google engineers now resolve 7.5% of all code review comments with an ML-suggested edit. But the path to that number has been a fun ML and UX journey!

Tweet media one

14

143

766

4

3

71

@srchvrs

Leo Boytsov

4 years

Best ACL2020 paper: "While traditional benchmarks indicate that models on these tasks are as accurate as humans, CheckList reveals a variety of severe bugs, where commercial and research models do not effectively handle basic linguistic phenomena"

0

10

69

@srchvrs

Leo Boytsov

5 years

This is freaking insane [thread]: "we utilized GPipe [4] to train 128-layer Transformers with over 6 billion parameters. Increasing the model capacity resulted in significantly improved performance across all languages by an average of 5 BLEU points."

1

15

68

@srchvrs

Leo Boytsov

2 years

@vboykis I think one secret of productivity is not to waste one's time on reading 300 page productivity blog posts.

0

1

66

@srchvrs

Leo Boytsov

5 months

Retrieval might become a cool topic eventually.

Tweet card media

Every LLM company is a search company, and search is hard: the future of LLM retrieval systems

Search is one of the hardest technical problems in computer science. Only a handful of products like Google, Amazon, and Instagram do it well.

www.linkedin.com

3

4

67

@srchvrs

Leo Boytsov

6 months

This is probably the first encoder-decoder model in the recent model race.

@RekaAILabs

Reka

6 months

Along with Core, we have published a technical report detailing the training, architecture, data, and evaluation for the Reka models.

Tweet media one

Tweet media two

2

63

368

8

5

68

@srchvrs

Leo Boytsov

3 years

A reminder for people who want to use Python multiprocessing with CUDA and other tricky to use resources (e.g., Java libraries). It's very likely you need to set multiprocessing.set_start_method('spawn')

3

5

64

@srchvrs

Leo Boytsov

1 year

Am I the only one who gets very upset when hearing the term semantic search?

24

1

65

@srchvrs

Leo Boytsov

5 years

👇 "End-to-End learning sounds like a good idea on paper, but for most deployment scenarios, pipelined architectures that are piecewise optimized will continue to stay. "

@deliprao

Delip Rao e/σ

5 years

New post: The Twelve Truths of #MachineLearning for the Real World

6

86

345

1

10

65

@srchvrs

Leo Boytsov

3 years

Google GED a very interesting QA explainability benchmark. I can help noticing though authors postpone the issue of faithfulness to future work, buy the problem is likely unsolvable for complete black boxes.

Tweet card media

GitHub - google-research-datasets/QED: QED: A Framework and Dataset for Explanations in Question...

QED: A Framework and Dataset for Explanations in Question Answering - google-research-datasets/QED

0

12

63

@srchvrs

Leo Boytsov

7 years

From Awni Hannun (the first author on the 1st Baidu's end-to-end recognition paper): Speech-recognition isn't solved : " The recent improvements on conversational speech are astounding. But, the claims about human-level performance are too broad."

1

28

61

@srchvrs

Leo Boytsov

2 years

"In all honesty, I suspect a lot of researchers see papers as “CV enhancers” more than scientific contributions. "

@EhudReiter

Ehud Reiter

2 years

New blog: Could some NLP research be fraudulent?

9

8

67

2

4

62

@srchvrs

Leo Boytsov

1 year

🧵This is a nice adversarial attack on LLMs, where redefining built-in functions (len becomes print and vice versa) confuses the model. It shows the model does not fully grasp semantics and heavily relies on reasoning shortcuts.

@AVMiceliBarone

Antonio Valerio Miceli Barone

@AVMiceliBarone

1 year

New paper: LLMs are good at programing tasks, and they are now being widely used to assist code generation. But do they actually understand the semantics of programming languages, or do they rely on superficial, "shortcut" correlations? Let's find out! 1/

Tweet media one

34

241

1K

3

7

60

@srchvrs

Leo Boytsov

7 years

" I distinctly remember emailing the authors of a Deepmind paper when I was having difficulty replicating and being told to change the random seed."

Why are other groups having such a hard time reproducing DeepMind's successes?

Tapa Ghosh's answer: Because most groups don't have access to 1000+ TPUs and GPUs. The other factor is that hyperparameters are extremely important and Deepmind, like any major machine learning...

0

11

61

@srchvrs

Leo Boytsov

5 years

👇Great talk: 1. You need better data not just more data 2. Increasing data has diminishing returns 2. Feature engineering is still important 3. Start with simpler models 4. Don't just increase complexity of models: add more features 5. Be aware about biases of the data

1

14

59

@srchvrs

Leo Boytsov

9 months

Nice work: Mixtral MoE on small-memory GPUs via offloading weights to RAM. It has 2 ingredients: 1. LeastRU cache. 2. Speculative loading: estim. expert by applying current layer gating func. to previous layer H. 2-3x speed up compared to naive offload.

Tweet card media

Fast Inference of Mixture-of-Experts Language Models with Offloading

With the widespread adoption of Large Language Models (LLMs), many deep learning practitioners are looking for strategies of running these models more efficiently. One such strategy is to use...

1

12

58

@srchvrs

Leo Boytsov

5 years

👇What I learned recently about Python, CUDA, and/or @spacy_io multi-processing. 1. Python lack of multi-threading is super-annoying but there're crutches that can still be useful. 2. You need to call multiprocessing.set_start_method('spawn') or your CUDA is gonna be screwed up

3

11

60

@srchvrs

Leo Boytsov

6 years

Evaluating ELMO embeddings: Meaningful gains on many tasks. "We also show, however, that we are still far away from a universal encoder that can perform consistently across several downstream tasks." by Perone et al 2018 …

2

16

60

@srchvrs

Leo Boytsov

4 years

An interesting #Neurips2020 tutorial on model interpretations, which covers main approaches in three modalities: text, image, and structured data!

NeurIPS 2020 Tutorial

[“Webpage for AAAI 2021, NeuRIPS 2020, and CPAIOR 2021 tutorials.”]

explainml-tutorial.github.io

0

24

59

@srchvrs

Leo Boytsov

1 year

Short summary of a discussion (but it's worth reading all of it). LLAMA team (as well as many other teams) evaluates a model withOUT actually revealing prompts (at least this is my understanding). Another team tests LLAMA with a different set of prompts and gets noticeably worse

@GuillaumeLample

Guillaume Lample @ ICLR 2024

@GuillaumeLample

1 year

@BlancheMinerva @sam_havens @giffmana @DohmannJeremy @vitaliychiley @MosaicML @HugoTouvron @LavrilThibaut We spent a lot of time investigating what others papers did (Chinchilla, GPT-3, PaLM) but very few of them actually provide any prompt so we just implemented what made sense to us. And we did not observe 20% differences by adding or removing a space in the prompt.

3

1

18

7

6

60

@srchvrs

Leo Boytsov

3 years

👇BERT does *NOT* rediscover the classic NLP pipeline."We verify via non-parametric probes that the permutations do in fact make the model worse at syntax-dependent tasks."

@arankomatsuzaki

Aran Komatsuzaki

@arankomatsuzaki

3 years

Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little MLMs pretrained on sentences with randomly shuffled word order surprisingly achieves high accuracy after fine-tuning on many downstream tasks.

Tweet media one

10

46

298

2

5

58

@srchvrs

Leo Boytsov

2 years

🧵Well-known company released a multi-threaded eval script. It starts 16 threads but 15 have 2% CPU usage. A big reminder: do *NOT* waste your time in trying to harness ThreadPoolExecutor (or any Python "thread" API). It's a pointless exercise b/c CPython does not have threads.

7

5

56

@srchvrs

Leo Boytsov

6 years

Sorry folks I can tolerate these alchemy and pseudo-science claims any more. Here is my take on this:

Is machine learning a pseudo science?

Leonid Boytsov's answer: This is written in response to Sridhar Mahadevan whose major pet peeves are that ML folks: 1. rarely use statistical tests 2. don’t build their work on a rigorous foundatio...

1

16

59

@srchvrs

Leo Boytsov

4 years

👇"A major methodological concern with language models pretrained on a broad swath of internet data, particularly large models with the capacity to memorize vast amounts of content, is potential contamination of downstream tasks by having their test or development sets ... "

@nasrinmmm

Nasrin Mostafazadeh

4 years

GPT-3 with 175B params,10x any prior LM,is out.Setting aside controversies around implications of training such ever-larger models, the fact that they show high zero-shot performance on challenging reasoning tasks is pretty interesting. Ppl often ask me how I feel about all this:

Tweet media one

Tweet media two

4

61

265

2

6

57

@srchvrs

Leo Boytsov

3 years

@huggingface NLP datasets have won the best demo paper award.

@xwang_lk

Xin Eric Wang

3 years

EMNLP 2021 Best Paper Awards are out:

0

7

41

0

8

56

@srchvrs

Leo Boytsov

4 years

👇"By increasing batch size to the memory limit of TPUv3 Pod, BERT training time can be reduced from 3 days to 76 minutes". Despite this little pod has 1024 GPUs, it makes training possible for many teams. V3 GPU is $8/hour. So, the overall training cost $8K.

1

4

56

@srchvrs

Leo Boytsov

1 year

"Evaluating the Code Quality of AI-Assisted Code Generation Tools: An Empirical Study on GitHub Copilot, Amazon CodeWhisperer, and ChatGPT" *CAVEAT* only one small (should I say tiny?) collection is used.

Tweet card media

Evaluating the Code Quality of AI-Assisted Code Generation Tools:...

Context: AI-assisted code generation tools have become increasingly prevalent in software engineering, offering the ability to generate code from natural language prompts or partial code inputs....

5

8

53

@srchvrs

Leo Boytsov

3 years

Product search at Amazon via DSSM and (very cool) extreme classification: "This indicates another unique difference between semantic product search problem and the conventional web retrieval problem where BM25 variants are often a strong baseline."

2

3

55

@srchvrs

Leo Boytsov

5 years

Indexing a trillion of vectors with FAISS. This is seriously badass:

0

20

54

@srchvrs

Leo Boytsov

2 years

It looks like SOTA in Transformer model efficiency (if you don't use any sparsity / routing / mixture of experts) is a combination of: 1. distillation 2. quantization 3. static or adaptive early exiting / reduced comp. in higher layers

Tweet card media

QuaLA-MiniLM: a Quantized Length Adaptive MiniLM

Limited computational budgets often prevent transformers from being used in production and from having their high accuracy utilized. A knowledge distillation approach addresses the computational...

2

9

53

@srchvrs

Leo Boytsov

1 year

A study finds that ChatGPT is seemingly not at the level of good human annotators as of now. A couple of caveats here: 1. There was no extensive prompt hacking. 2. Performance is measured against existing human annotations. How noisy and accurate they are is not rechecked.

@omarsar0

elvis

1 year

Can ChatGPT Reproduce Human-Generated Labels? Nice study on the applicability of ChatGPT for data annotation tasks. Results show that ChatGPT does have the potential to handle different types of data annotation tasks like stance detection, sentiment analysis, and bot detection.

Tweet media one

3

24

142

8

2

50

@srchvrs

Leo Boytsov

6 years

Different interpretation of the same results: @jacobeisenstein HMM beats LSTM on small data @sleepinyourhat wow, LSTM beats HMM already with 500 sentences. Proof links:

@jacobeisenstein

Jacob Eisenstein

@jacobeisenstein

6 years

At #NAACL2018 : @barbara_plank on how the HMM TnT tagger from 1998 outperforms biLSTM on small data

Tweet media one

2

12

61

3

10

51

@srchvrs

Leo Boytsov

2 years

10% of images in ImageNet1K are mislabeled, yet ImageNet SOTA is 90% accuracy.

@davisblalock

Davis Blalock

2 years

Paper: If you like this paper, consider RTing this (or another!) thread to publicize the authors' work, or following the authors: @SashaMTL … [7/8]

1

2

16

6

4

50

@srchvrs

Leo Boytsov

6 months

🧵 A nice LLAMA-3 summary by @natolambert . In short, performance is better, but largely due to the sheer increase in the pre-training scale. Training data is a secret. ↩️

2

4

52

@srchvrs

Leo Boytsov

2 years

It looks like the stress is real. In my experience: 1. paradigms shift 2. numbers get higher 3. Yet, unsolved problems remain in big numbers. Moreover, it looks like new discoveries create new opportunities. It is a bit hard when you are locked in with some topic that you

@mayhewsw

Stephen Mayhew

2 years

when everyone in NLP is having an existential crisis because of GPT-4, but you lived through ELMo/BERT in 2019:

Tweet media one

18

99

1K

5

8

50

@srchvrs

Leo Boytsov

4 years

"Starting from a public multilingual BERT checkpoint, our final model is 6x smaller and 27x faster, and has higher accuracy than a SOTA multilingual baseline." SOTA is LSTM.

0

5

50

@srchvrs

Leo Boytsov

2 years

@sarahookr At the same it is not quite clear what's the value of PhD for most people. The competition to get in heats up but then what? Becoming a junior ad engineer without any research.

0

0

50

@srchvrs

Leo Boytsov

4 years

A beautiful idea: inject very small networks, e.g., between layers. During fine-tuning, the main BERT "body" remains unmodified. Thus, effective fine-tuning requires only very small modification of the main network and these modifications can be engineered explicitly!

@AdapterHub

AdapterHub

4 years

Adapters are a small set of new weights introduced at every layer of pre-trained transformer models (e.g. BERT, RoBERTa). Only the new Adapter weights (within the pink box) are trained while keeping the original transformer parameters frozen/fixed.

Tweet media one

1

6

21

1

6

49

@srchvrs

Leo Boytsov

5 years

The Hitchhiker's Guide to the Galaxy is an early machine learning book! The computer spits out a number (42), but it refuses to provide an explanation!

@mark_riedl

Mark Riedl

5 years

Hinton has the hackles raised on some of my team. One of them produced this gem of wisdom:

Tweet media one

5

10

91

4

8

47

@srchvrs

Leo Boytsov

6 years

My friend trained an extremely mean regression model. All it predicts is a constant.

3

6

47

@srchvrs

Leo Boytsov

4 years

The primary author of the LSTM paper Sepp Hochreiter gets a well-deserved (and LONG OVERDUE) neural networks pioneering award!

@SchmidhuberAI

Jürgen Schmidhuber

4 years

Congrats to the awesome Sepp Hochreiter for the well-deserved 2021 IEEE Neural Networks Pioneer Award! It was my great honor to be Sepp's nominator.

Tweet media one

10

65

747

1

3

48

@srchvrs

Leo Boytsov

4 years

More gloom and doom as Uber closes its AI Labs:

Tweet card media

Coronavirus: Uber announces drastic cuts to secure its future

Uber announces drastic action to scale back its business as its losses balloon amid pandemic lockdowns.

3

11

45

@srchvrs

Leo Boytsov

3 years

Wasn't that "feature engineering" more like "add a bunch of n grams", literally millions of them, and let L1 regularization figure it out?

@emilymbender

@[email protected] on Mastodon

3 years

@TaliaRinger That statistical NLP work was still closely coupled with understanding the shape of the problem being solved, specifically in feature engineering. Then (2010s) we got the next "invasion" from ML land (deep learning) where the idea was the computer would learn the features! 2/

1

4

30

5

2

46

@srchvrs

Leo Boytsov

4 years

👇The largest ever IR collection of news items is released recently! What is totally awesome is the size of the collection: 42M news articles & 10K queries. What's a bit disheartening is that queries are created using SQuAD approach, i.e., workers were asked to "create a query".

2

9

46

@srchvrs

Leo Boytsov

2 years

@deliprao @thegautamkamath You need to focus on key values.

1

0

46

@srchvrs

Leo Boytsov

4 years

"95% of the original perf with only ~5% remaining weights in the encoder!" is a great result by a team of @huggingface researchers

0

7

45

@srchvrs

Leo Boytsov

5 years

[thread] A multilingual BERT trained with the shared wordpiece inventory permits zero-shot transfer of various models across languages. Quite crazily it works for POS and NER tasks! Even more crazily it works for pairs of languages with DIFFERENT scripts...

2

12

45

@srchvrs

Leo Boytsov

2 years

🧵"Although transformer-based models have made great success in information retrieval tasks, traditional IR methods such as BM25 can still yield robust performance in real-world web search scenarios. "

2

2

44