Cohere is growing! If you’re passionate about building world-class LLMs and delivering them to customers, you should apply. I’m specifically looking for folks with experience in NLP data, eval, and annotation. Check out the roles here:
DMs are open!
Our new work "Asking and Answering Questions to Evaluate the Factual Consistency of Summaries" does exactly that. We use question generation and question answering models to evaluate whether summaries are factually consistent w/ the source text.
🎉🎉🎉
Also, I'm hiring for an MLE/SWE! If you want to build LLMs with
@cohere
and are interested in developing challenging model evaluation settings + curating high-quality data, please reach out!
ICYMI: We also just opened our NYC office 👀
[Arena Update]
@cohere
's Command R is now top-10 in Arena leaderboard🔥
It's now one of the best open models reaching the level of top proprietary models. We find the model great at handling longer context, which we plan to separate as a new category in Arena very soon.
Hello. I am popping up from Twitter lurking to claim the "Longest Time Between Life Update and Actually Announcing It" award:
I graduated from NYU in this May and started working at
@CohereAI
in August as a tech lead for Data+Evaluation!
📣 We heard you liked the open weights we dropped last month, so we're doing it again, except more.
🎉 Introducing Command R+! 🎉 Really proud of what we've built and excited to see what y'all build on top of this!
⌘R+
Welcoming Command R+, our latest model focused on scalability, RAG, and Tool Use. Like last time, we're releasing the weights for research use, we hope they're useful to everyone!
Already tired of months-old papers at ACL? Looking for a hot, new preprint?
Check out SQuALITY 💨🍵!
SQuALITY is a long document, question-focused summarization dataset. Unlike many existing summ. datasets SQuALITY summaries are fully crowdsourced!
(1/8)
It's true, I successfully defended my dissertation yesterday! Big thanks to
@hhexiy
,
@ml_perception
,
@JoaoSedoc
for serving on the committee, and an especially big thank you to my advisors
@sleepinyourhat
and
@kchonyc
for advising and supporting me over the past five years.
there is an unreasonable number of "alex wang"s in the LLM space, between myself at Cohere, an Alex Wang at Perplexity,
@alexandr_wang
at Scale...truly blursed.
s/o Alex L. Wang for once maintaining a disambiguation of "alex wang"s in ML
Excited to share this work with the world, both the results and the actual model weights. Looking forward to seeing what the community will build with this! Stay tuned for more!
✍️details:
⚖️weights:
🤖chat:
⌘-R
Introducing Command-R, a model focused on scalability, RAG, and Tool Use. We've also released the weights for research use, we hope they're useful to the community!
I live in Toronto now and
#ACL2023NLP
happens to be here too! If you want to chat about LLMs, where to eat/drink in Toronto, or opportunities at
@cohere
, feel free to reach out or stop by the Cohere booth!
This drove me crazy for a while: We had internal experiments showing RM > LLM for evaluation, which felt really counterintuitive to me. Nice to get external confirmation, and thanks for building the benchmark
@natolambert
! :)
If you're interested in working with us shoot me a DM or email about yourself and something cool you've worked on recently! I'm looking for people interested in LLM evaluation and data creation, but we have plenty of other roles.
The new NYC office is sweet!!
Introducing Rerank 3! Our latest model focused on powering much more complex and accurate search.
It's the fastest, cheapest, and highest performance reranker that exists. We're really excited to see how this model influences RAG applications and search stacks.
The SustaiNLP2020 (at
@emnlp2020
) Call for Submissions is up at . The task evaluates on SuperGLUE and energy efficiency as measured by
@PeterHndrsn
's library. Come develop more energy efficient NLP models! Deadline Aug 28 and baseline code available soon!
New ish paper for ACL2019 comparing a diverse set of tasks for pretraining sentence encoders and augmenting existing pretrained LMs, made possible by great collaborators from Brown, Google, JHU, and many more, as well as oodles of compute.
Come hear me attempt to recap a couple years of progress in NLP in 5m, or tell me your favorite glue puns at the poster session immediately afterwards!
#NeurIPS2019
, catch the spotlight on our recently created SuperGLUE benchmark which helps language understanding researchers set a new, higher bar for
#NLP
research. It's Wed 4:55-5:00 PM West Ballrooms A + B. Read more:
Benchmark:
🚨PREPRINT ALERT🚨
"On the Challenges of Using Black-Box APIs for Toxicity Evaluation in Research" w/ Beyza Ermis,
@PSH_Lewis
, and
@sarahookr
Paper:
Code:
Excited to be co-hosting a mentoring session on "establishing collaborations and networking" and "managing up" with
@ryanzhumich
and
@yangfeng_ji
for
#acl2020nlp
on 7/8 at 12pm ET. I imagine there will be a lot of learning from this session, especially by me😅
Sam is an amazing advisor and human being! This is incredibly deserved. The lab is also looking to hire researchers at various levels, and that's a great opportunity to work with Sam and the rest of us!
Excited to share this work! We look at (1) methods for measuring bias in words embeddings applied to sentence encoders, and find that these methods don't straightforwardly apply (2) tests for nuanced social biases that are difficult or impossible to study at the word level
The birds-of-a-feather session on generation at
#acl2020nlp
was awesome! Discussion was lively and spawned *multiple* followup discussions. Kudos to
@sebgehr
,
@gh_marjan
, and another moderator whose name I missed!
There's another one in a few hours (5pm ET), highly recommended!
I'll also be presenting "Asking and Answering Questions to Evaluate the Factual Consistency of Summaries" (joint work with
@kchonyc
and
@ml_perception
) at sessions 9A (7/7, 1pm ET) and 10B (7/7, 5pm ET). Come chat and hang out!
The past four months have been a blitz of fast, fun, and cool projects. And I've been fortunate to learn from
@egrefen
@Nils_Reimers
Phil Blunsom and many others. There's cool stuff from Cohere on the horizon that I'm excited to share soon.
This is being presented at
@emnlpmeeting
on Friday at Session 2! Sadly none of us (
@yzpang97
,
@_angie_chen
,
@zhansheng
,
@sleepinyourhat
) could make it to Abu Dhabi, but feel free to reach out if you have questions or want to talk about summ., data quality, or crowdsourcing!
Already tired of months-old papers at ACL? Looking for a hot, new preprint?
Check out SQuALITY 💨🍵!
SQuALITY is a long document, question-focused summarization dataset. Unlike many existing summ. datasets SQuALITY summaries are fully crowdsourced!
(1/8)
We're throwing 4 hackathons at each of our offices around the world!! If you're in NYC, London, Toronto, or SF come hang out with us and build with Command R and R+ 🛠️
There's a lot to do in using the multi-references, developing efficient human evaluation of long texts, and enabling long-text summ. with prompting. If this sounds interesting to you, check out the links below:
paper:
data:
(7/8)
We're now accepting applications for the 6th CSLI Undergraduate Summer Internship Program, which places students in Stanford labs for 8 weeks of mentored research. Housing and a stipend provided. Prior research experience not required:
Protein language models (pLMs) can give protein sequences likelihood scores, which are commonly used as a proxy for fitness in protein engineering. But what do likelihoods encode?
In a new paper (w/
@JacobSteinhardt
) we find that pLM likelihoods have a strong species bias!
1/
Also, while I have you here, consider taking the NLP Community Metasurvey! Having an opinion is fun and seeing how your opinion lines up with the rest of the community is extra fun!
Inspecting the generated questions, we were surprised to find that they are often fluent, on-topic, and sensible. Nvidia has a great paper pushing on the question generation capabilities of existing models: .
SQuALITY is question-focused and multi-reference: For each story there are 5 questions, and for each question there are 4 reference summaries. The responses are highly diverse, an aspect of summarization that isn't well-represented in existing single-reference datasets. (4/8)
Probably one of the best decisions I've made in the past five years has been to do my PhD at NYU. It's a great place to do cutting-edge ML and NLP research. Not to mention it's in NYC!
We spent several months working with Upwork writers and undergraduates to create summaries of Project Gutenberg stories (4-6k words long). We put a big focus on developing a protocol for collecting text responses that is cost-efficient while also maintaining quality. (3/8)
#EMNLP2020
is great, but it can be challenging to engage with so much research when it's getting late and you've spent most of the day "at" the conf... shout out to
@gregd_nlp
for putting the Language Generation session on his back and keeping the questions+discussion flowing😅
Human evaluators consider human-written summaries to be substantially better than summaries from state-of-the-art supervised summarization systems along several dimensions. Also, automatic metrics are a poor indicator of model quality for SQuALITY. (5/8)
The group is very collaborative and supportive, and is pursuing excitingly risky and fun lines of research. I highly recommend collaborating with the folks here and visiting whenever you're able!
Mondays are dumb.
Cloudy Mondays are dumb.
Damn, cloudy Mondays are dumb.
Those damn cloudy Mondays are dumb.
Dominate those damn cloudy dumb Mondays.
Tl;dr: dom dem dam dim dum days
#MondayMotivation
Common approaches for building summarization datasets (scraping, developing heuristics) have led to unexpected amounts of noise in the datasets. Crowdsourcing summaries is expensive (and subsequently understudied), but one way to mitigate noise, if done carefully. (2/8)
Using NLP models to evaluate generated text is a promising direction, but it's clear there is a lot of (exciting!) work to be done to make these methods reliable.
This method correlates much better with human judgments of consistency than existing metrics on the XSUM and CNN/DM summarization datasets. Our method is especially effective on the latter, likely due to the somewhat extractive nature of the dataset.
Hongyao and Eric TA'd several of my classes, and I can attest that they are super smart and kind people working on exciting problems. I remember talking with Hongyao about strategic Doodle voting, the lessons of which I continue to use today. Congrats
@hongyaoma
and Eric!
The ACM SIGecom Dissertation Award for 2019 goes to Hongyao Ma
@hongyaoma
, with honorary mentions going to Rediet Abebe
@red_abebe
and Eric Balkanski. Read more about their dissertations here:
@rajammanabrolu
@natolambert
Mostly believing in the magic of a general purpose LLMs working better than a smaller task specific model, nothing especially principled
@joechoochoy
@idavidrein
Anecdotally, there were some activities (mostly cognitively intensive games) where i am starving afterwards but I've only really been thinking
On the other hand, we find that the bottleneck in our metric is due to the QA model breaking down, despite the models being pretrained and finetuned on quite similar data sources as the test environment.