With so much noise from ChatGPT, LLMs, and the "we're all going to die" crowd, my students have (understandably) been experiencing existential angst, asking me about the implications for NLP and IR.
Reviewers automatically assume that simple is not novel. This is sheer laziness. Yes, it may be simple and obvious in retrospect, but someone had to have that insight first. Simple is good. Simple is robust, easy to implement and reproduce, broadly applicable, etc.
DAAM... You saw it here first! Attribution maps for Stable Diffusion based on upscaling and aggregating cross-attention activations in the latent denoising subnetwork: example for "an angry, bald man doing research" below - demo at
GPT-4 and its ilk are awesome for rapid prototyping and one-offs, but at the end of the day, enterprises will deploy far smaller distilled models in production. Here's my contrarian take -
Still cropping and modifying BERT diagrams from Devlin et al. (2019)? I spent several hours redrawing BERT in PowerPoint so you don't have to... Perfect for use in presentations, papers, etc.! Happy to share under Releasing under CC BY 4.0
Following the AI Residency program by Google, Facebook, Microsoft, Uber, etc., I'd like to start the Waterloo AI Residency program. It's called grad school.
Happy to share an early draft of "Pretrained Transformers for Text Ranking: BERT and Beyond", our forthcoming book (tentatively, early 2021) by
@lintool
@rodrigfnogueira
@andrewyates
My (contrarian?) take: prompt engineering is programming in natural language. We've tried this before, with attempts dating back decades. Recent advances do not change the fact that natural languages are ambiguous, imprecise, under-specified, highly contextual, etc.
Recently,
@CohereAI
boasted "3X better performance" in multilingual text understanding. We tested that claim by evaluating Cohere embeddings on MIRACL:
tl;dr - We weren't able to replicate the 3X claim, but we did observe a 38% improvement over BM25.
RAG is all the RAGe these days, but we (still) don't quite know how to evaluate it properly... This year, we are taking a stab at it in the context of TREC, building on 30+ years of experience in evaluating IR systems.
Case in point: "Passage Re-ranking with BERT" by
@rodrigfnogueira
and
@kchonyc
was never accepted anywhere because of the "too simple, not novel" laziness. Yet that paper is LITERALLY cited in every single BERT-for-ranking paper ever published since.
Just how good are commercially available embedding APIs for vector search? An effort led by
@ehsk0
evaluated a few of them -
@OpenAI
@CohereAI
@Aleph__Alpha
- on BEIR and MIRACL... Check out the results! - forthcoming
#ACL2023
industry track paper
Presenting... RankVicuna π¦ - the first fully open-source LLM capable of performing high-quality listwise reranking in a zero-shot setting! Brought to you by
@rpradeep42
and
@sahel_sharify
On the subject of current citation formats being, essentially, racist: I have two students (current/former) with surname Wang, 2 x Liu, 3 x Yang + 1 collaborator; ~dozen x Zhang in my collaboration circle. A citation like [Zhang et al. 2019] is unhelpful.
I've thought long and hard in recent years about building a culture of reproducibility in research. Over the holidays, I had a chance to organize my thoughts in a piece that has been on the back burner for many months. I'd welcome your comments!
tl;dr - in the enterprise context, you'll start with rapid prototyping using GPT-4 (or another LLM) but eventually end up with a far smaller but just as capable specialized distilled model. That's the journey I see from PoC to prod.
New entrants into the camelidae family π¦ for retrieval!
@xueguang_ma
presents RepLLaMA (a dense retrieval model) and RankLLaMA (a pointwise reranker) fine-tuned on (you guessed it!) LLaMA for multi-stage text retrieval:
Introducing RankZephyr π¨ - a fully open-source zero-shot listwise reranking LLM that achieves effectiveness parity with GPT-4! Brought to you by
@rpradeep42
and
@sahel_sharify
!
"Pretrained Transformers for Text Ranking: BERT and Beyond" with
@rodrigfnogueira
and
@andrewyates
- it started on June 18, 2020 and culminates here with the official publication... Enjoy! institutional subscribers: retail orders:
My (contrarian?) predictions on ChatGPT, Bard, and its ilk: Regarding the two biggest problems today, (1) hallucinations and (2) toxicity, the first will be transient (i.e., solved relatively shortly) and the second will be perpetual (i.e., will never be solved). Rationale:
Those who argue that curation is the answer to "more ethical AI/ML/LMs" come across as intellectually naive. Archivists have been grappling with this issue, literally, for millenia. They'd be well advised to consult some of the literature from that field.
I'll conclude by being constructive. As an AC, I have overridden this "not novel, too simple" garbage on more than one occasion. In some cases I have spent hours pouring over the literature to determine if this paper was indeed the first to have a particular insight. 1/2
Dense retrieval without requiring a dedicated vector DB! Here's a guide on how you can take OpenAI ada2 embeddings (on MS MARCO passages) and perform retrieval directly using Lucene with our group's Anserini toolkit
What a sad state of affairs we find our field in: rejected because you didn't STOA; rejected because you STOA'ed (just leaderboard chasing, no insight); flag plant on arXiv, rejected (reviewer cites your paper as evidence of lack of novelty); don't arXiv, your idea is scooped.
The modern search landscape confusing you? Dense retrieval, sparse retrieval, transformer-based rerankers, multi-stage architectures, nearest neighbors, HNSW, blah blah blah. My attempt to sort it all out in a single conceptual framework:
Prompt-decoder LLMs for listwise reranking too large for you? Introducing our new LiT5 family of listwise reranking models: nearly as good but *much* smaller. Yup, T5's still got tricks to offer!
Another addition to the "X is all you need" genre of papers: We took OpenAI embeddings of MS MARCO passages and stuffed them into Lucene - turns out you don't need fancy schmancy vector stores for dense retrieval! Lucene will do.
Dense retrieval without requiring a dedicated vector DB! Here's a guide on how you can take OpenAI ada2 embeddings (on MS MARCO passages) and perform retrieval directly using Lucene with our group's Anserini toolkit
This includes checking with PC chairs on whether the previous work and the paper in question are by the same authors. And if I do recommend rejection based on novelty, it is with a citation to a paper confirmed not to be by the authors of the same paper under review. 2/2
With
@rodrigfnogueira
and
@andrewyates
we're happy to share the revised version of our book "Pretrained Transformers for Text Ranking: BERT and Beyond" - significant updates to transformer-based reranking models and dense retrieval techniques!
Presenting AfriBERTa, a pretrained LM for 11 African languages by
@Kelechukwu_
What's neat is that its pretraining corpus is 0.04% that of XLMR (<1GB), but AfriBERTa performs just as well on downstream tasks! To appear at MRL workshop at
#EMNLP2021
By knocking out two major datacenters, road construction work has literally brought down a non-trivial fraction of all deep learning experiments being conducted in Canada.
Belated, but congrats to
@cohere
for their new Embed v3 model! With
@JinaAI_
making a similar product announcement recently, it's clear that this space is heating up! This is awesome for the community... and of course for my students working in this space!
Yesterday
@rodrigfnogueira
@andrewyates
and I wrapped up the final preproduction version of "Pretrained Transformers for Text Ranking: BERT and Beyond" - posted on arXiv as v3: now in the hands of
@MorganClaypool
and will be in print soon!
Thanks to the tremendous effort of
@edwinzhng
@1729_gupta
@kchonyc
we're proud to present the Neural Covidex, our updated AI-powered search interface to
@allen_ai
's COVID-19 corpus: Powered primarily by Lucene, T5, and BioBERT.
Introducing... SegaBERT! by
@Richard_baihe
et al. intuition is to introduce hierarchical position embeddings (paragraph, sentence, tokens) to better capture context during pretraining: simple idea, fairly large gains!
Why I hate doing reimbursements: the default assumption by
@UWaterloo
is that you're a criminal trying to embezzle money from research accounts. Maryland was more sane. What's your experience been like at other places?
If you're interested in dense retrieval, you'll want to check out this DPR replication effort led by
@xueguang_ma
tl;dr - BM25 is better than the original authors made it out to be, and free QA boost with better evidence fusion!
"Should You Take My Advice?" I recently wrote up this essay on how I communicate with my students. Maybe it applies more broadly to other advisors as well? Feedback/comments/questions welcome!
Wanna train a multilingual dense retrieval model but confused what to do? For example: Which backbone? Pre-fine-tune? Use non-target language data? Here's a helpful guide that begins to compile together best practices:
We've written up a description of the Neural Covidex and shared a few thoughts about our journey so far in a submission to the ACL 2020 COVID Workshop: Comments and feedback welcome!
#NLProc
#acl2020nlp
New work by
@ralph_tang
@crystina_z
@xueguang_ma
adds yet another prompting technique to the mix: *permutation* self-consistency prompting to overcome positional bias in LLMs. Useful for listwise ranking... read all about it!
I think we've reached the end of the road in how we currently perform inverted indexing. Check out our benchmark of Lucene indexing performance and analysis:
PSA: I say to my students - if you're working on NNs and DL, there are at least 5 PhD students at Tsinghua, Peking, Jiao Tong, Zhejiang, ... working on your idea right now. Be paranoid, execute, and publish!
Our group has this server with 100TB disk... and it's always full. Why? These dense retrieval models take up so much $#!& space. But
@xueguang_ma
et al. came up with simple compression solutions, to appear in
#emnlp2021
My basic point: many people (researchers, companies, etc.) are making claims about their ability to do multi-lingual search. With MIRACL there's actually an objective (vendor-neutral) benchmark over 18 languages
#wsdm2023
- put your π° where your π is?
From the "ivory tower" to the "real-world": the story of how block-max WAND made its way into Lucene 8 by
@jpountz
et al. is told in a
#ecir2020
preprint: Lessons for academics seeking to achieve research impact?
At the outset, zero-shot effectiveness of LLMs will be impressive, but it'll likely not be "good enough". Improving the model requires you to more precisely characterize the task - that is, you need what "in the old days" we'd call a task description and annotation guidelines.
The new SPECTER embeddings from
@allen_ai
are awesome: They're even more awesome when integrated into the Neural Covidex to power related article search: given an article, find similar articles. Try it out!
I've written about anti-Asian bias and model minority issues before: ... and I've gotten the eye-roll reaction of "why don't you sit down since you've got it so good already"... which is exactly the problem.
New work on using doc2query for summarization by
@rodrigfnogueira
et al. - works surprisingly well! Samples from CORD-19 corpus related to COVID-19 below.
The path will be from expensive general-purpose models (e.g., GPT-4) to cheaper specialized distilled models (e.g., encoder-only). Here's the progression I see -
Another instance of the disconnect between academic reviewing and real-world impact: tl;dr - our keyword spotting model in JavaScript now powers wake word detection "Hey Firefox!" in Firefox Voice:
Thanks to
@edwinzhng
and
@1729_gupta
our Anserini IR toolkit can now search
@allen_ai
's COVID-19 corpus. -
@kchonyc
's connected it up to SciBERT , and bam, we have a two-stage neural ranking pipeline! Join and build on our work!
LLMs are missing a critical ingredient... and
@Primal
knows what it is! (Hint: knowledge graphs and neuro-symbolic approaches) Here's a writeup of the journey so far, featuring CEO
@YvanCouture
- oh btw, I'm the CTO.
To every university president who's sent mass email condolences (offering support, solidarity, etc.) after previous racially-motivated hate crimes before... we're waiting.
And finally... with a fine-tuned model: "Why do I need all those billions of parameters?" Cost pressure is a powerful economic driver. Distillation (along with pruning, quantization, etc.) provides the answer (and will likely yield safer models).
In preprint of our
#sigir2019
short we conducted a meta-analysis of 100+ papers reporting results on Robust04. tl;dr - weak baselines are still prevalent (both neural and non-neural models).
Further evidence of the one-sidedness of recent attacks on pretrained language models: the critics have conveniently forgotten about the benefits they bring to disadvantaged populations.
We've released a new version (v0.11.0.0) of our Pyserini Python toolkit to support replicable IR research, now providing first-stage retrieval for sparse, dense, and hybrid representations. Our new arXiv paper provides an overview:
I am satisfied to end a recent paper thusly: Finally, in our collective frenzy to improve results on standard benchmarks, we may sometimes forget that the ultimate goal of science is knowledge, not owning the top entry in a leaderboard.
Apparently, I was recognized as an outstanding area chair at
#emnlp2020
and didn't realize it until now...
#humblebrag
(do people even use this hashtag anymore?)
Sparse or dense representations for retrieval? Or hybrid? psssssh, says
@jacklin_64
- neither! Densify sparse lexical reps and tack on dense semantic reps: best of both worlds and simplified infrastructure also (no need for HNSW or inverted indexes!)
The Iron Triangle of LLMs: capable models, low development costs (CapEx), low inference costs (OpEx)... pick two, because you can't have all three! Examples? OpenAI API = low CapEx, high OpEx; BloombergGPT = (very) high CapEx but flexibility in OpEx depending on deployment.
We've connected Anserini to Solr to Blacklight to present a search frontend to
@allen_ai
's COVID-19 corpus! Check out - awesome work by
@edwinzhng
and
@1729_gupta
In the Information Retrieval course I let my students pick the IR toolkit of their choice among all the solutions we have available as research community. Clear front-runner by a mile was Pyserini: . In big part thanks to its elaborate documentation!
I'll be giving a talk on a Conceptual Framework for a Representational Approach to Information Retrieval on April 5, 4pm PT as a Pinterest Labs Tech Talk
@PinterestEng
. RSVP and learn more here!
Happy to share Mr. TyDi, a multi-lingual benchmark dataset for mono-lingual retrieval in 11 languages by
@crystina_z
@xueguang_ma
@ShiPeng16
tl;dr - think of this as the open-retrieval condition of TyDi.
Paper:
Data: