Jimmy Lin @lintool Twitter profile

Pinned Tweet

Jimmy Lin

1 year

With so much noise from ChatGPT, LLMs, and the "we're all going to die" crowd, my students have (understandably) been experiencing existential angst, asking me about the implications for NLP and IR.

10

30

153

Last Seen Profiles

@stw_pdg

@Edwinpri

@BDesniar

@iam_zuby

@bokeplokalmalam

@karim91_karim

@pemuas_ayangmu

@bokeplokalmalam

@NikoEfst

@USAiris009

@Oukhty_Saly

@stw_pdg

@ibubohay2

@moco_haz02

@Capros200

@gwemoiy

@YallipBruni

@issie_tpwk

@bokeplokalmalam

@gfffugh

@Heatonist

@EdCounselLaw

@slutyoung56

@tama11482404

@JimHendren1

@Marek29140089

@rayanaj112

@sumnolence

@heatherwhite48

@InterstellrNews

@PlanetarioMad

@LizhiQtian

@chamatchiw

@BoyaLoso

@JennyGilruth

Jimmy Lin

@lintool

4 years

Reviewers automatically assume that simple is not novel. This is sheer laziness. Yes, it may be simple and obvious in retrospect, but someone had to have that insight first. Simple is good. Simple is robust, easy to implement and reproduce, broadly applicable, etc.

57

505

4K

Jimmy Lin

@lintool

2 years

DAAM... You saw it here first! Attribution maps for Stable Diffusion based on upscaling and aggregating cross-attention activations in the latent denoising subnetwork: example for "an angry, bald man doing research" below - demo at

14

153

901

Jimmy Lin

@lintool

2 years

GPT-4 and its ilk are awesome for rapid prototyping and one-offs, but at the end of the day, enterprises will deploy far smaller distilled models in production. Here's my contrarian take -

28

154

738

Jimmy Lin

@lintool

3 years

So, CV researchers are looking at transformers and NLP researchers are looking at CNNs (again). What a strange world.

16

51

724

Jimmy Lin

@lintool

4 years

Still cropping and modifying BERT diagrams from Devlin et al. (2019)? I spent several hours redrawing BERT in PowerPoint so you don't have to... Perfect for use in presentations, papers, etc.! Happy to share under Releasing under CC BY 4.0

15

134

711

Jimmy Lin

@lintool

7 years

Following the AI Residency program by Google, Facebook, Microsoft, Uber, etc., I'd like to start the Waterloo AI Residency program. It's called grad school.

6

48

349

Jimmy Lin

@lintool

3 years

Look what came in the mail!

3

13

336

Jimmy Lin

@lintool

4 years

"NLP makes IR interesting and IR makes NLP useful!" - slides from my #sigir2020 summer school talk at: Get your rotten tomatoes and eggs out!

9

49

301

Jimmy Lin

@lintool

3 years

BERT is three years old today!

BERT: Pre-training of Deep Bidirectional Transformers for Language...

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is...

arxiv.org

6

23

275

Jimmy Lin

@lintool

4 years

Happy to share an early draft of "Pretrained Transformers for Text Ranking: BERT and Beyond", our forthcoming book (tentatively, early 2021) by @lintool @rodrigfnogueira @andrewyates

Pretrained Transformers for Text Ranking: BERT and Beyond

The goal of text ranking is to generate an ordered list of texts retrieved from a corpus in response to a query. Although the most common formulation of text ranking is search, instances of the...

arxiv.org

3

60

273

Jimmy Lin

@lintool

2 years

My (contrarian?) take: prompt engineering is programming in natural language. We've tried this before, with attempts dating back decades. Recent advances do not change the fact that natural languages are ambiguous, imprecise, under-specified, highly contextual, etc.

10

22

200

Jimmy Lin

@lintool

2 years

Recently, @CohereAI boasted "3X better performance" in multilingual text understanding. We tested that claim by evaluating Cohere embeddings on MIRACL: tl;dr - We weren't able to replicate the 3X claim, but we did observe a 38% improvement over BM25.

5

26

199

Jimmy Lin

@lintool

6 months

RAG is all the RAGe these days, but we (still) don't quite know how to evaluate it properly... This year, we are taking a stab at it in the context of TREC, building on 30+ years of experience in evaluating IR systems.

RAG

TREC RAG Official Website

trec-rag.github.io

2

46

192

Jimmy Lin

@lintool

4 years

Case in point: "Passage Re-ranking with BERT" by @rodrigfnogueira and @kchonyc was never accepted anywhere because of the "too simple, not novel" laziness. Yet that paper is LITERALLY cited in every single BERT-for-ranking paper ever published since.

Passage Re-ranking with BERT

Recently, neural models pretrained on a language modeling task, such as ELMo (Peters et al., 2017), OpenAI GPT (Radford et al., 2018), and BERT (Devlin et al., 2018), have achieved impressive...

arxiv.org

1

10

179

Jimmy Lin

@lintool

1 year

Just how good are commercially available embedding APIs for vector search? An effort led by @ehsk0 evaluated a few of them - @OpenAI @CohereAI @Aleph__Alpha - on BEIR and MIRACL... Check out the results! - forthcoming #ACL2023 industry track paper

Evaluating Embedding APIs for Information Retrieval

The ever-increasing size of language models curtails their widespread availability to the community, thereby galvanizing many companies into offering access to large language models through APIs....

arxiv.org

3

36

180

Jimmy Lin

@lintool

1 year

Presenting... RankVicuna 🦙 - the first fully open-source LLM capable of performing high-quality listwise reranking in a zero-shot setting! Brought to you by @rpradeep42 and @sahel_sharify

RankVicuna: Zero-Shot Listwise Document Reranking with Open-Source...

Researchers have successfully applied large language models (LLMs) such as ChatGPT to reranking in an information retrieval context, but to date, such work has mostly been built on proprietary...

arxiv.org

5

33

173

Jimmy Lin

@lintool

1 year

ACM Fellows, class of 2022 - from the awards banquet last night

10

4

166

Jimmy Lin

@lintool

4 years

Not novel. Not novel. Not novel. Reviews seem to be generated with a simple n-gram LM.

4

9

163

Jimmy Lin

@lintool

4 years

On the subject of current citation formats being, essentially, racist: I have two students (current/former) with surname Wang, 2 x Liu, 3 x Yang + 1 collaborator; ~dozen x Zhang in my collaboration circle. A citation like [Zhang et al. 2019] is unhelpful.

20

30

160

Jimmy Lin

@lintool

2 years

I've thought long and hard in recent years about building a culture of reproducibility in research. Over the holidays, I had a chance to organize my thoughts in a piece that has been on the back burner for many months. I'd welcome your comments!

Building a Culture of Reproducibility in Academic Research

Reproducibility is an ideal that no researcher would dispute "in the abstract", but when aspirations meet the cold hard reality of the academic grind, reproducibility often "loses out". In this...

arxiv.org

7

21

160

Jimmy Lin

@lintool

2 years

tl;dr - in the enterprise context, you'll start with rapid prototyping using GPT-4 (or another LLM) but eventually end up with a far smaller but just as capable specialized distilled model. That's the journey I see from PoC to prod.

8

20

147

Jimmy Lin

@lintool

1 year

New entrants into the camelidae family 🦙 for retrieval! @xueguang_ma presents RepLLaMA (a dense retrieval model) and RankLLaMA (a pointwise reranker) fine-tuned on (you guessed it!) LLaMA for multi-stage text retrieval:

Fine-Tuning LLaMA for Multi-Stage Text Retrieval

The effectiveness of multi-stage text retrieval has been solidly demonstrated since before the era of pre-trained language models. However, most existing studies utilize models that predate recent...

arxiv.org

5

18

140

Jimmy Lin

@lintool

10 months

Introducing RankZephyr 💨 - a fully open-source zero-shot listwise reranking LLM that achieves effectiveness parity with GPT-4! Brought to you by @rpradeep42 and @sahel_sharify !

RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze!

In information retrieval, proprietary large language models (LLMs) such as GPT-4 and open-source counterparts such as LLaMA and Vicuna have played a vital role in reranking. However, the gap...

arxiv.org

4

26

136

Jimmy Lin

@lintool

3 years

"Pretrained Transformers for Text Ranking: BERT and Beyond" with @rodrigfnogueira and @andrewyates - it started on June 18, 2020 and culminates here with the official publication... Enjoy! institutional subscribers: retail orders:

2

29

131

Jimmy Lin

@lintool

2 years

My (contrarian?) predictions on ChatGPT, Bard, and its ilk: Regarding the two biggest problems today, (1) hallucinations and (2) toxicity, the first will be transient (i.e., solved relatively shortly) and the second will be perpetual (i.e., will never be solved). Rationale:

6

27

131

Jimmy Lin

@lintool

4 years

Those who argue that curation is the answer to "more ethical AI/ML/LMs" come across as intellectually naive. Archivists have been grappling with this issue, literally, for millenia. They'd be well advised to consult some of the literature from that field.

6

20

122

Jimmy Lin

@lintool

1 year

I'll go on the record with perhaps another contrarian opinion: Lucene HNSW is the future and (pure) vector DB vendors are in trouble. Why?

5

34

121

Jimmy Lin

@lintool

4 years

We are scheming to write a paper where the author list is Jimmy Lin, Jimmy Lin, Jimmy Lin.

6

3

119

Jimmy Lin

@lintool

4 years

I'll conclude by being constructive. As an AC, I have overridden this "not novel, too simple" garbage on more than one occasion. In some cases I have spent hours pouring over the literature to determine if this paper was indeed the first to have a particular insight. 1/2

1

3

115

Jimmy Lin

@lintool

1 year

Dense retrieval without requiring a dedicated vector DB! Here's a guide on how you can take OpenAI ada2 embeddings (on MS MARCO passages) and perform retrieval directly using Lucene with our group's Anserini toolkit

3

17

114

Jimmy Lin

@lintool

5 years

What a sad state of affairs we find our field in: rejected because you didn't STOA; rejected because you STOA'ed (just leaderboard chasing, no insight); flag plant on arXiv, rejected (reviewer cites your paper as evidence of lack of novelty); don't arXiv, your idea is scooped.

4

18

109

Jimmy Lin

@lintool

3 years

The modern search landscape confusing you? Dense retrieval, sparse retrieval, transformer-based rerankers, multi-stage architectures, nearest neighbors, HNSW, blah blah blah. My attempt to sort it all out in a single conceptual framework:

1

22

107

Jimmy Lin

@lintool

9 months

Prompt-decoder LLMs for listwise reranking too large for you? Introducing our new LiT5 family of listwise reranking models: nearly as good but *much* smaller. Yup, T5's still got tricks to offer!

Scaling Down, LiTting Up: Efficient Zero-Shot Listwise Reranking...

Recent work in zero-shot listwise reranking using LLMs has achieved state-of-the-art results. However, these methods are not without drawbacks. The proposed methods rely on large LLMs with...

arxiv.org

3

16

98

Jimmy Lin

@lintool

1 year

Another addition to the "X is all you need" genre of papers: We took OpenAI embeddings of MS MARCO passages and stuffed them into Lucene - turns out you don't need fancy schmancy vector stores for dense retrieval! Lucene will do.

Jimmy Lin

@lintool

1 year

Dense retrieval without requiring a dedicated vector DB! Here's a guide on how you can take OpenAI ada2 embeddings (on MS MARCO passages) and perform retrieval directly using Lucene with our group's Anserini toolkit

3

17

114

9

16

100

Jimmy Lin

@lintool

4 years

This includes checking with PC chairs on whether the previous work and the paper in question are by the same authors. And if I do recommend rejection based on novelty, it is with a citation to a paper confirmed not to be by the authors of the same paper under review. 2/2

6

4

99

Jimmy Lin

@lintool

6 years

Knuth says not everyone should be working on deep learning...

4

8

97

Jimmy Lin

@lintool

3 years

With @rodrigfnogueira and @andrewyates we're happy to share the revised version of our book "Pretrained Transformers for Text Ranking: BERT and Beyond" - significant updates to transformer-based reranking models and dense retrieval techniques!

Pretrained Transformers for Text Ranking: BERT and Beyond

The goal of text ranking is to generate an ordered list of texts retrieved from a corpus in response to a query. Although the most common formulation of text ranking is search, instances of the...

arxiv.org

4

27

98

Jimmy Lin

@lintool

4 years

Two years ago today, BERT was shared with the #NLProc community. What amazing progress we've seen since!

BERT: Pre-training of Deep Bidirectional Transformers for Language...

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is...

arxiv.org

4

6

95

Jimmy Lin

@lintool

3 years

Presenting AfriBERTa, a pretrained LM for 11 African languages by @Kelechukwu_ What's neat is that its pretraining corpus is 0.04% that of XLMR (<1GB), but AfriBERTa performs just as well on downstream tasks! To appear at MRL workshop at #EMNLP2021

2

19

93

Jimmy Lin

@lintool

2 years

Slides from my recent talk: A Conceptual Framework for a Representational Approach to Information Retrieval:

1

12

92

Jimmy Lin

@lintool

3 years

I have a new gig!

The Waterloo AI Institute: Reflecting on Past Successes and Charting a New Course | Mathematics

From self-driving cars to intelligent voice assistants to smart factories, Artificial Intelligence (AI) is transforming every sector of the economy and the very

uwaterloo.ca

8

3

91

Jimmy Lin

@lintool

3 years

By knocking out two major datacenters, road construction work has literally brought down a non-trivial fraction of all deep learning experiments being conducted in Canada.

5

4

85

Jimmy Lin

@lintool

11 months

Belated, but congrats to @cohere for their new Embed v3 model! With @JinaAI_ making a similar product announcement recently, it's clear that this space is heating up! This is awesome for the community... and of course for my students working in this space!

Introducing Embed v3

We're excited to introduce Embed v3, our latest and most advanced embeddings model. Embed v3 offers state-of-the-art performance per trusted MTEB and BEIR benchmarks. One of the key improvements in...

cohere.com

2

11

89

Jimmy Lin

@lintool

3 years

"Serverless BM25 Search and BERT Reranking." #DESIRES2021 paper: slides:

2

18

87

Jimmy Lin

@lintool

9 years

It's official! I'm leaving Maryland to take up the David R. Cheriton Chair in the School of Computer Science at the University of Waterloo!

42

13

83

Jimmy Lin

@lintool

3 years

Yesterday @rodrigfnogueira @andrewyates and I wrapped up the final preproduction version of "Pretrained Transformers for Text Ranking: BERT and Beyond" - posted on arXiv as v3: now in the hands of @MorganClaypool and will be in print soon!

Pretrained Transformers for Text Ranking: BERT and Beyond

The goal of text ranking is to generate an ordered list of texts retrieved from a corpus in response to a query. Although the most common formulation of text ranking is search, instances of the...

arxiv.org

1

16

83

Jimmy Lin

@lintool

1 year

#ACL2023NLP (or #ACL2023 ) is a great opportunity to plug what is shaping up to be perhaps the largest collection of core NLP faculty in Canada 🇨🇦🍁 at Waterloo @UWCheritonCS - joining @WenhuChen and me next year will be @fredahshi @hllo_wrld @yuntiandeng ! Come find us to chat!

1

12

84

Jimmy Lin

@lintool

4 years

Thanks to the tremendous effort of @edwinzhng @1729_gupta @kchonyc we're proud to present the Neural Covidex, our updated AI-powered search interface to @allen_ai 's COVID-19 corpus: Powered primarily by Lucene, T5, and BioBERT.

3

42

82

Jimmy Lin

@lintool

4 years

Introducing... SegaBERT! by @Richard_baihe et al. intuition is to introduce hierarchical position embeddings (paragraph, sentence, tokens) to better capture context during pretraining: simple idea, fairly large gains!

Segatron: Segment-Aware Transformer for Language Modeling and Understanding

Transformers are powerful for sequence modeling. Nearly all state-of-the-art language models and pre-trained language models are based on the Transformer architecture. However, it distinguishes...

arxiv.org

1

20

79

Jimmy Lin

@lintool

4 years

Why I hate doing reimbursements: the default assumption by @UWaterloo is that you're a criminal trying to embezzle money from research accounts. Maryland was more sane. What's your experience been like at other places?

18

0

76

Jimmy Lin

@lintool

3 years

If you're interested in dense retrieval, you'll want to check out this DPR replication effort led by @xueguang_ma tl;dr - BM25 is better than the original authors made it out to be, and free QA boost with better evidence fusion!

A Replication Study of Dense Passage Retriever

Text retrieval using learned dense representations has recently emerged as a promising alternative to "traditional" text retrieval using sparse bag-of-words representations. One recent work that...

arxiv.org

5

6

77

Jimmy Lin

@lintool

6 years

BERTserini: combining the magic of BERT with Anserini for end-to-end open-domain QA:

2

22

74

Jimmy Lin

@lintool

4 years

@colinraffel You have to grow a grey beard first. And don't say "bag of words". Dress it up as "heuristically weighted sparse representations".

1

0

74

Jimmy Lin

@lintool

2 years

"Should You Take My Advice?" I recently wrote up this essay on how I communicate with my students. Maybe it applies more broadly to other advisors as well? Feedback/comments/questions welcome!

1

12

76

Jimmy Lin

@lintool

3 years

I've posted the video of my keynote at #icde2021 - The Attack of the Muppets: Data Management in the Era of Pretrained Transformers

ICDE 2021 Keynote - The Attack of the Muppets (Jimmy Lin)

The Attack of the Muppets: Data Management in the Era of Pretrained TransformersJimmy Lin (University of Waterloo)Natural language processing and data manage...

www.youtube.com

2

11

73

Jimmy Lin

@lintool

2 years

Wanna train a multilingual dense retrieval model but confused what to do? For example: Which backbone? Pre-fine-tune? Use non-target language data? Here's a helpful guide that begins to compile together best practices:

Towards Best Practices for Training Multilingual Dense Retrieval Models

Dense retrieval models using a transformer-based bi-encoder design have emerged as an active area of research. In this work, we focus on the task of monolingual retrieval in a variety of...

arxiv.org

1

13

71

Jimmy Lin

@lintool

6 years

I expect this will generate some discussion... "The Neural Hype and Comparisons Against Weak Baselines"

2

26

68

Jimmy Lin

@lintool

4 years

We've written up a description of the Neural Covidex and shared a few thoughts about our journey so far in a submission to the ACL 2020 COVID Workshop: Comments and feedback welcome! #NLProc #acl2020nlp

Rapidly Deploying a Neural Search Engine for the COVID-19 Open...

We present the Neural Covidex, a search engine that exploits the latest neural ranking architectures to provide information access to the COVID-19 Open Research Dataset curated by the Allen...

openreview.net

1

14

68

Jimmy Lin

@lintool

1 year

New work by @ralph_tang @crystina_z @xueguang_ma adds yet another prompting technique to the mix: *permutation* self-consistency prompting to overcome positional bias in LLMs. Useful for listwise ranking... read all about it!

0

10

69

Jimmy Lin

@lintool

5 years

I think we've reached the end of the road in how we currently perform inverted indexing. Check out our benchmark of Lucene indexing performance and analysis:

The Performance Envelope of Inverted Indexing on Modern Hardware

This paper explores the performance envelope of "traditional" inverted indexing on modern hardware using the implementation in the open-source Lucene search library. We benchmark indexing...

arxiv.org

2

16

68

Jimmy Lin

@lintool

7 years

We're celebrating the opening of the new Data Systems Lab at UWaterloo today! First innovation: we've learned to turn off gravity.

2

12

65

Jimmy Lin

@lintool

7 years

PSA: I say to my students - if you're working on NNs and DL, there are at least 5 PhD students at Tsinghua, Peking, Jiao Tong, Zhejiang, ... working on your idea right now. Be paranoid, execute, and publish!

3

25

66

Jimmy Lin

@lintool

3 years

Our group has this server with 100TB disk... and it's always full. Why? These dense retrieval models take up so much $#!& space. But @xueguang_ma et al. came up with simple compression solutions, to appear in #emnlp2021

2

11

68

Jimmy Lin

@lintool

2 years

My basic point: many people (researchers, companies, etc.) are making claims about their ability to do multi-lingual search. With MIRACL there's actually an objective (vendor-neutral) benchmark over 18 languages #wsdm2023 - put your 💰 where your 👄 is?

6

17

68

Jimmy Lin

@lintool

6 years

This came in the mail today. It weights 37 pounds. I told my wife, "it's for the kids". Of course, that's a lie.

5

2

64

Jimmy Lin

@lintool

5 years

From the "ivory tower" to the "real-world": the story of how block-max WAND made its way into Lucene 8 by @jpountz et al. is told in a #ecir2020 preprint: Lessons for academics seeking to achieve research impact?

2

20

64

Jimmy Lin

@lintool

2 years

At the outset, zero-shot effectiveness of LLMs will be impressive, but it'll likely not be "good enough". Improving the model requires you to more precisely characterize the task - that is, you need what "in the old days" we'd call a task description and annotation guidelines.

3

1

66

Jimmy Lin

@lintool

4 years

The new SPECTER embeddings from @allen_ai are awesome: They're even more awesome when integrated into the Neural Covidex to power related article search: given an article, find similar articles. Try it out!

SPECTER: Document-level Representation Learning using...

Representation learning is a critical ingredient for natural language processing systems. Recent Transformer language models like BERT learn powerful textual representations, but these models are...

arxiv.org

0

14

65

Jimmy Lin

@lintool

4 years

I've written about anti-Asian bias and model minority issues before: ... and I've gotten the eye-roll reaction of "why don't you sit down since you've got it so good already"... which is exactly the problem.

guide/ecosystem.md at master · lintool/guide

The Student's Guide to @lintool. Contribute to lintool/guide development by creating an account on GitHub.

github.com

0

11

65

Jimmy Lin

@lintool

4 years

New work on using doc2query for summarization by @rodrigfnogueira et al. - works surprisingly well! Samples from CORD-19 corpus related to COVID-19 below.

0

15

65

Jimmy Lin

@lintool

2 years

The path will be from expensive general-purpose models (e.g., GPT-4) to cheaper specialized distilled models (e.g., encoder-only). Here's the progression I see -

2

61

Jimmy Lin

@lintool

3 years

Oh, btw, I'm looking for post-docs.

4

7

63

Jimmy Lin

@lintool

4 years

Another instance of the disconnect between academic reviewing and real-world impact: tl;dr - our keyword spotting model in JavaScript now powers wake word detection "Hey Firefox!" in Firefox Voice:

3

14

62

Jimmy Lin

@lintool

5 years

Thanks to @edwinzhng and @1729_gupta our Anserini IR toolkit can now search @allen_ai 's COVID-19 corpus. - @kchonyc 's connected it up to SciBERT , and bam, we have a two-stage neural ranking pipeline! Join and build on our work!

3

22

62

Jimmy Lin

@lintool

13 years

Interested in large-scale machine learning ( #hadoop and otherwise)? I recommend this tutorial at #KDD2011 :: http://t.co/NlEzGEe

0

33

61

Jimmy Lin

@lintool

1 year

LLMs are missing a critical ingredient... and @Primal knows what it is! (Hint: knowledge graphs and neuro-symbolic approaches) Here's a writeup of the journey so far, featuring CEO @YvanCouture - oh btw, I'm the CTO.

Deep Tech company spent nearly 20 years building world leading AI platform in Kitchener

KITCHENER — Over the past six months 1,750 chief AI officers were hired across North America as a gold-rush fever gripped the world after the platform ChatGPT was released in

www.therecord.com

2

6

61

Jimmy Lin

@lintool

9 years

"I couldn't do my assignment because GitHub is down" will be the next "my dog ate my homework".

2

49

61

Jimmy Lin

@lintool

4 years

Use of IBM Model 1 translation probs learned from MS MARCO to improve ranking by @srchvrs is brilliant and insightful!

1

6

56

Jimmy Lin

@lintool

4 years

To every university president who's sent mass email condolences (offering support, solidarity, etc.) after previous racially-motivated hate crimes before... we're waiting.

6

59

Jimmy Lin

@lintool

5 months

Well, since you asked... a (short) history lesson on multi-stage ranking in IR...

1

8

56

Jimmy Lin

@lintool

2 years

And finally... with a fine-tuned model: "Why do I need all those billions of parameters?" Cost pressure is a powerful economic driver. Distillation (along with pruning, quantization, etc.) provides the answer (and will likely yield safer models).

1

57

Jimmy Lin

@lintool

5 years

In preprint of our #sigir2019 short we conducted a meta-analysis of 100+ papers reporting results on Robust04. tl;dr - weak baselines are still prevalent (both neural and non-neural models).

5

18

56

Jimmy Lin

@lintool

5 years

Want to replicate our ( @rodrigfnogueira @victoryang118 @kchonyc ) doc2query work (currently) sitting on top of the MS MARCO leaderboard? Have we got code for you! paper:

Document Expansion by Query Prediction

One technique to improve the retrieval effectiveness of a search engine is to expand documents with terms that are related or representative of the documents' content.From the perspective of a...

arxiv.org

0

13

55

Jimmy Lin

@lintool

4 years

Further evidence of the one-sidedness of recent attacks on pretrained language models: the critics have conveniently forgotten about the benefits they bring to disadvantaged populations.

12

4

53

Jimmy Lin

@lintool

9 years

Check out my new plates!

3

7

53

Jimmy Lin

@lintool

4 years

We've released a new version (v0.11.0.0) of our Pyserini Python toolkit to support replicable IR research, now providing first-stage retrieval for sparse, dense, and hybrid representations. Our new arXiv paper provides an overview:

Pyserini: An Easy-to-Use Python Toolkit to Support Replicable IR...

Pyserini is an easy-to-use Python toolkit that supports replicable IR research by providing effective first-stage retrieval in a multi-stage ranking architecture. Our toolkit is self-contained as...

arxiv.org

2

11

54

Jimmy Lin

@lintool

6 years

I am satisfied to end a recent paper thusly: Finally, in our collective frenzy to improve results on standard benchmarks, we may sometimes forget that the ultimate goal of science is knowledge, not owning the top entry in a leaderboard.

4

10

54

Jimmy Lin

@lintool

4 years

Tutorial slides to go with the book!

Pretrained Transformers for Text Ranking: BERT and Beyond

The goal of text ranking is to generate an ordered list of texts retrieved from a corpus in response to a query. Although the most common formulation of text ranking is search, instances of the...

arxiv.org

Rodrigo Nogueira

@rodrigfnogueira

4 years

Slides of our WSDM 2021 tutorial "Pretrained Transformers for Text Ranking: BERT and Beyond" are available here: with @andrewyates and @lintool

0

44

154

0

12

53

Jimmy Lin

@lintool

4 years

Apparently, I was recognized as an outstanding area chair at #emnlp2020 and didn't realize it until now... #humblebrag (do people even use this hashtag anymore?)

1

51

Jimmy Lin

@lintool

4 years

It's not every day you land a $1M (CAD) grant... announcing our Archives Unleashed 2 project led by @ianmilligan1 Looking forward to working with @ruebot @jefferson_bail @SamVFritz over the next few years!

2

51

Jimmy Lin

@lintool

5 years

Our latest study of BM25 variants, including Lucene's weird doc length encoding, with @kamphuis_c @arjenpdevries @srchvrs tl;dr - it's okay!

0

20

51

Jimmy Lin

@lintool

2 years

Sparse or dense representations for retrieval? Or hybrid? psssssh, says @jacklin_64 - neither! Densify sparse lexical reps and tack on dense semantic reps: best of both worlds and simplified infrastructure also (no need for HNSW or inverted indexes!)

1

7

48

Jimmy Lin

@lintool

1 year

The Iron Triangle of LLMs: capable models, low development costs (CapEx), low inference costs (OpEx)... pick two, because you can't have all three! Examples? OpenAI API = low CapEx, high OpEx; BloombergGPT = (very) high CapEx but flexibility in OpEx depending on deployment.

2

7

51

Jimmy Lin

@lintool

5 years

We've connected Anserini to Solr to Blacklight to present a search frontend to @allen_ai 's COVID-19 corpus! Check out - awesome work by @edwinzhng and @1729_gupta

4

18

50

Jimmy Lin

@lintool

3 years

Slides of my keynote at the BIR workshop @ #ecir2021 are available at We couldn't get "smarter" to beat "bigger"... but that's okay!

2

8

51

Jimmy Lin

@lintool

4 years

It's hard to build usable software, but tweets like this make all the blood, sweat, and tears worthwhile. Credit goes to an awesome team!

Claudia Hauff 🇪🇺 🇺🇦 🇩🇪 🇳🇱

@CharlotteHase

4 years

In the Information Retrieval course I let my students pick the IR toolkit of their choice among all the solutions we have available as research community. Clear front-runner by a mile was Pyserini: . In big part thanks to its elaborate documentation!

3

9

57

2

50

Jimmy Lin

@lintool

3 years

MS MARCO v2 datasets are available for TREC 2021! Anserini baselines are available here:

TREC 2023 Deep Learning Track Guidelines

website for MS Marco

microsoft.github.io

0

6

50

Jimmy Lin

@lintool

3 years

I'll be giving a talk on a Conceptual Framework for a Representational Approach to Information Retrieval on April 5, 4pm PT as a Pinterest Labs Tech Talk @PinterestEng . RSVP and learn more here!

0

5

49

Jimmy Lin

@lintool

3 years

Happy to share Mr. TyDi, a multi-lingual benchmark dataset for mono-lingual retrieval in 11 languages by @crystina_z @xueguang_ma @ShiPeng16 tl;dr - think of this as the open-retrieval condition of TyDi. Paper: Data:

GitHub - castorini/mr.tydi: Mr. TyDi is a multi-lingual benchmark dataset built on TyDi, covering...

Mr. TyDi is a multi-lingual benchmark dataset built on TyDi, covering eleven typologically diverse languages. - castorini/mr.tydi

github.com

2

12

50