Karthik Narasimhan @karthik_r_n Twitter profile | Pikagi

Pikagi

Karthik Narasimhan

@karthik_r_n

3,114

Followers

456

Following

7

Media

260

Statuses

Head of Research @SierraPlatform , Associate Professor @PrincetonCS . Previously @OpenAI , PhD @MIT_CSAIL , BTech @iitmadras

Princeton, NJ

https://t.co/AxUwjx0pNd

Joined July 2015

Don't wanna be here? Send us removal request.

Pinned Tweet

@karthik_r_n

Karthik Narasimhan

3 months

Excited to release 𝜏-bench (TAU for Tool-Agent-User ⚒️-🤖-🧑), a new benchmark to evaluate AI agents' performance and reliability in real-world settings with dynamic user and tool interaction. Paper: , Blog:

Tweet card media

𝜏-Bench: Benchmarking AI agents for the real-world | Sierra

Sierra’s AI research team is on a mission to advance the frontier of conversational AI agents. In this research paper, we present a new benchmark for evaluating AI agents' performance and reliability...

4

26

117

Last Seen Profiles

@MavundlaLungeka

@ElizabethW5547

@walterpauli

@JoslynArmstrong

@hiighlien

@Safia_Alsnaani

@Aphiren

@BinorRaja

@stw_pdg

@MDDailyRecord

@MarcAnthony0742

@dtapp55

@LIVELOVEMYY

@Mohamed7h7

@JesusYo39225402

@DSwoffer

@JannatulMa25123

@sk6vice

@Jus_CT_

@sayakaiurani

@sanyukkuri115

@stw_pdg

@mohom44096487

@pencitaemanedut

@SmurphDFS

@funshodarq1

@undergroundchi

@JennMaguire2

@NaftaliClinton

@prinnylee

@VoltItalia

@bokeplokalmalam

@Schmack53

@sabrina_fain

@CurateMatobo

@sari_8i

@karthik_r_n

Karthik Narasimhan

6 months

SWE-agent is finally out. A few highlights: 1. Agent-Computer Interface (ACI) design will be critical for the success of AI agents, much like HCI is critical for how effective humans are with computers. 2. You can use SWE-agent out of the box on any github issue. (1/2)

Tweet media one

@jyangballin

John Yang

6 months

SWE-agent is our new system for autonomously solving issues in GitHub repos. It gets similar accuracy to Devin on SWE-bench, takes 93 seconds on avg + it's open source! We designed a new agent-computer interface to make it easy for GPT-4 to edit+run code

Tweet media one

64

426

2K

13

22

168

@karthik_r_n

Karthik Narasimhan

1 year

LMs are not just about generating text in a linear fashion anymore. They can be used to explore unique points of attack, backtrack, search, and much more to solve complex problems and puzzles. Check out 🌲Tree-of-thought, our new framework for systematic reasoning with LLMs!

@ShunyuYao12

Shunyu Yao

1 year

Still use ⛓️Chain-of-Thought (CoT) for all your prompting? May be underutilizing LLM capabilities🤠 Introducing 🌲Tree-of-Thought (ToT), a framework to unleash complex & general problem solving with LLMs, through a deliberate ‘System 2’ tree search.

Tweet media one

93

594

3K

1

23

105

@karthik_r_n

Karthik Narasimhan

2 years

We teach language models to both reason and act in the same breath. Exciting since it provides a natural mechanism for allowing LLMs to incorporate external knowledge (APIs, databases, web) vs black box internal reasoning, for tasks from QA to webpage navigation.

@ShunyuYao12

Shunyu Yao

2 years

Large Language Models (LLM) are 🔥in 2 ways: 1.🧠Reason via internal thoughts (explain jokes, math reasoning..) 2.💪Act in external worlds (SayCan, ADEPT ACT-1, WebGPT..) But so far 🧠and💪 remain distinct methods/tasks... Why not 🧠+💪? In our new work ReAct, we show 1+1>>2!

10

66

389

2

3

50

@karthik_r_n

Karthik Narasimhan

4 years

Thank you @AmazonScience for supporting our research! Hope to make some useful advances in conversational systems!

@AmazonScience

Amazon Science

4 years

Congratulations to the 51 award recipients of the 2019 Amazon Research Awards, who represent 39 universities in 10 countries. View the full list and find out how to be added to the 2020 Call For Proposal distribution list here: #AmazonResearchAwards

1

11

69

5

1

47

@karthik_r_n

Karthik Narasimhan

3 years

Even though we're still at the tip of the iceberg here, I'm very excited about this direction and its future potential. As we continue to scale up our neural networks, we will need more methods to enable more efficient and cost-effective inference per input instance.

@VishvakM

Vishvak Murahari

3 years

Data Multiplexing for Neural Networks🔀 Can neural networks process multiple instances simultaneously as a single mixed input, similar to how radio channels can share bandwidth to carry multiple signals? Surprisingly, we find they can indeed!! 📜 [1/6]

10

70

397

0

4

47

@karthik_r_n

Karthik Narasimhan

1 year

Very excited for this upcoming Center for Language and Intelligence @ Princeton! We are actively recruiting!

@prfsanjeevarora

Sanjeev Arora

@prfsanjeevarora

1 year

Princeton has a new Center for Language and Intelligence, researching LLMs + large AI models, as well as their interdisciplinary applications. Looking for postdocs/research scientists/engineers; attractive conditions.

21

115

617

0

1

43

@karthik_r_n

Karthik Narasimhan

7 months

Software engineering is so much more than just generating code. Exciting to see good progress on SWE-bench () - solving ~13% of real-world bugs is impressive. Still a long way to go though!

@cognition_labs

Cognition

@cognition_labs

7 months

Today we're excited to introduce Devin, the first AI software engineer. Devin is the new state-of-the-art on the SWE-Bench coding benchmark, has successfully passed practical engineering interviews from leading AI companies, and has even completed real jobs on Upwork. Devin is

5K

11K

45K

3

4

38

@karthik_r_n

Karthik Narasimhan

2 years

The web (or even a simulation of it) is a great environment for building language-grounded RL agents! Has real-world content, is interactive and can be dynamically changing, is easily scalable and fast to run, and has direct practical applications in assisting humans with tasks.

@ShunyuYao12

Shunyu Yao

2 years

What if you had a bot you could just instruct in English to shop online for you? Check out our latest work 🛒WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents w/ @__howardchen @jyangballin , @karthik_r_n @princeton_nlp

1

14

61

1

7

35

@karthik_r_n

Karthik Narasimhan

9 months

Thank you for inviting me! Much of the work I presented was led by @ShunyuYao12 (who is on the job market!), @jyangballin (who is applying to PhD programs this year!), @_carlosejimenez , and @tedsumers

@stanfordnlp

Stanford NLP Group

9 months

We were really excited to have @karthik_r_n from @princeton_nlp join us today to give us the good oil on the future of language agents in the age of Foundation Models! #NLProc

Tweet media one

2

4

63

0

4

29

@karthik_r_n

Karthik Narasimhan

4 years

If you work on RL agents for text-based games (or other interactive envs with language), this might be interesting! We show that agents may not be adequately leveraging the rich semantics present in the observations and there's lots of room for more semantic-centric approaches

@ShunyuYao12

Shunyu Yao

4 years

For autonomous tasks with language (e.g. text games), how much does an agent rely on language semantics vs. memorization? Our #NAACL2021 paper (, joint w/ @karthik_r_n , @mhauskn ) proposes ablation studies with surprising findings and useful insights! (1/3)

Tweet media one

1

2

32

1

4

26

@karthik_r_n

Karthik Narasimhan

2 years

Training agents to operate autonomously over the web will be the next frontier for AI. If you are looking for an open-source benchmark to build and test your own web agents, check out (by @ShunyuYao12 , @__howardchen , @jyangballin )

Towards Scalable Real-World Web Interaction with Grounded Language Agents

webshop-pnlp.github.io

@AdeptAILabs

Adept

2 years

1/7 We built a new model! It’s called Action Transformer (ACT-1) and we taught it to use a bunch of software tools. In this first video, the user simply types a high-level request and ACT-1 does the rest. Read on to see more examples ⬇️

136

919

5K

1

5

26

@karthik_r_n

Karthik Narasimhan

4 years

Seems like a very cool RL testbed for many different reasons!

@egrefen

Edward Grefenstette

4 years

Want to help push the boundaries of RL research? Need a rich, difficult, and procedurally-generated environment with loads of structure and intricacy? An astounding amount of human play data? Sophisticated strategies and documentation? We got you (and it's faster than ALE!) [1/6]

Tweet media one

2

49

171

0

3

23

@karthik_r_n

Karthik Narasimhan

3 years

Check out our new #ACL2021 paper which shows how self-attention nets can learn languages with bounded hierarchical structure (which applies to most practical uses of human languages). A step towards understanding why Transformers are 🔥 for NLP!

@ShunyuYao12

Shunyu Yao

3 years

Hierarchical structure is a core aspect of language syntax. Recurrent networks can systematically process recursion by emulating stacks, but can self-attention networks? If so, how? Our #ACL2021 paper shed lights into this fundamental issue! (1/5)

Tweet media one

1

6

46

0

2

23

@karthik_r_n

Karthik Narasimhan

4 years

Looking forward to this exciting workshop!

@egrefen

Edward Grefenstette

4 years

"Going" to @icmlconf ? Come hear about the future of language+RL at the #LaReL2020 workshop on Language in Reinforcement Learning, held July 18. Here's a short thread introducing some of the highlights. [1/9]

Tweet media one

1

47

128

0

3

21

@karthik_r_n

Karthik Narasimhan

1 year

Check out our latest paper on a large-scale toxicity analysis of 'persona-assigned' ChatGPT! Beyond the direct findings that assigning personas can make these models behave in undesirable ways, our study also hints at a couple of broader points:

@AmeetDeshpande_

Ameet Deshpande

@AmeetDeshpande_

1 year

Large language models and chatbots are being ubiquitously used. But are they safe? In our large-scale toxicity analysis of ChatGPT, we find that assigning it a “persona” significantly increases its toxicity (up to 6X). Paper: Blog:

Tweet media one

4

33

137

1

2

21

@karthik_r_n

Karthik Narasimhan

2 years

DataMUX () wins the @BellLabs 2nd Prize! Proud of the students @VishvakM @_carlosejimenez @RunzheYang . Humbling experience visiting the campus with all its history and interacting with the researchers and esteemed judges (incl. a Nobel Prize winner!)

Tweet media one

@BellLabs

Bell Labs

2 years

The runners-up of the #BellLabsPrize22 are... 🥈 2nd: Karthik Narasimhan, Carlos Jimenez, Vishvak Murahari, & Runzhe Yang of @Princeton . 🥉 3rd: Xiangfeng Duan of @UCLA .

0

1

5

1

2

20

@karthik_r_n

Karthik Narasimhan

1 year

Check out @RunzheYang 's SocraticAI framework! Language models like GPT are trained on vast amounts of data, so they surely have the knowledge to solve a wide variety of problems. The key question has always been - how does one elicit the right knowledge?

@RunzheYang

"Tony" Runzhe Yang

1 year

🧙Introducing SocraticAI, a multi-agent framework for using independent instantiations of large language models (LLMs) to collaboratively and creatively solve problems.

2

39

143

1

1

18

@karthik_r_n

Karthik Narasimhan

3 years

Super excited about this new workshop on language supervision! (organized w/ @jacobandreas @aidanematzadeh ) Language will play an increasingly important role in building scalable, value-aligned ML systems in the future, beyond just being an applied area for ML methods.

@LNLSWorkshop

Workshop on Learning with NL Supervision

3 years

Interested in: - training image classifiers & robots using language? - using info from text corpora in other ML applications? - understanding the role of language in human concept learning? Join us for the Workshop on Learning with Natural Language Supervision ( @aclmeeting '22)!

Tweet media one

1

8

23

0

9

18

@karthik_r_n

Karthik Narasimhan

1 year

Check out our latest paper connecting long standing ideas in cognitive architectures with facets of modern language agents, and subsequently sketching a blueprint for improving agent capabilities.

@ShunyuYao12

Shunyu Yao

1 year

Language Agents are cool & fast-moving, but no systematic way to understand & design them.. So we use classical CogSci & AI insights to propose Cognitive Architectures for Language Agents (🐨CoALA)! w/ great @tedsumers @karthik_r_n @cocosci_lab (1/6)

Tweet media one

10

116

464

0

0

16

@karthik_r_n

Karthik Narasimhan

4 years

@emnlp2020 We could start using openreview with public anonymous reviews? Might help increase reviewer accountability.

0

0

16

@karthik_r_n

Karthik Narasimhan

4 years

Check out how we use language models for better exploration in text adventure games! #emnlp2020

Tweet card media

Keep CALM and Explore: Language Models for Action Generation in...

Text-based games present a unique challenge for autonomous agents to operate in natural language and handle enormous action spaces. In this paper, we propose the Contextual Action Language Model...

@ShunyuYao12

Shunyu Yao

4 years

Happy to announce our new #emnlp2020 paper “Keep CALM and Explore: Language Models for Action Generation in Text-based Games” is online! w/ Rohan, @mhauskn , @karthik_r_n arxiv: code: more below (1/n)

1

9

36

0

2

16

@karthik_r_n

Karthik Narasimhan

5 years

Very nice overview of recent work in language+RL! Solving language will be key for scaling up RL and making it more robust and sample efficient.

@_rockt

Tim Rocktäschel

5 years

How can RL agents exploit the compositional, relational and hierarchical structure of the world? A growing number of authors propose learning from natural language. We are excited to share our @IJCAIconf survey of this emerging field! TL;DR:🤖+📖=📈🎯🏆🥳

Tweet media one

2

73

258

0

2

15

@karthik_r_n

Karthik Narasimhan

1 year

Thanks to TAS Princeton () for facilitating this seminar series! Was a great experience to interact with the teachers and discuss how LLMs like ChatGPT will change K-12 education

Teachers as Scholars

Teachers as Scholars (TAS) and Administrators as Scholars (AAS) is a partnership between Princeton University and surrounding schools and districts formed with the objective of providing scholarly...

teacherprep.princeton.edu

@EPrinceton

Princeton Engineering

1 year

Last week, @karthik_r_n presented a seminar for 8 HS & middle school teachers on #naturallanguageprocessing , including tools like #ChatGPT . They learned about challenges of the technology, & shared ideas on how to thoughtfully incorporate ChatGPT into lessons. #TeachersAsScholars

Tweet media one

0

1

3

0

1

14

@karthik_r_n

Karthik Narasimhan

2 years

Some of the jailbreaking #ChatGPT examples in are crazy. Building LLM-based systems that are 100% robust to such attacks feels so hard since it seems like one can always go one level deeper with prompts (think Inception-style)

1

0

14

@karthik_r_n

Karthik Narasimhan

2 years

@percyliang @percyliang you were way ahead of time with WoB () 🙂- one of my favorite papers and a huge inspiration for our own recent work. Now we finally have stronger hammers (models) to tackle these challenges and push further in this space

World of Bits: An Open-Domain Platform for Web-Based Agents

While simulated game environments have greatly accelerated research in reinforcement learning, existing environments lack the open-domain realism of tasks in...

proceedings.mlr.press

1

2

13

@karthik_r_n

Karthik Narasimhan

2 years

@kroscoo Absolutely. Simplest and most natural way to introduce the idea of LMs as probability distributions over sequences, MLE, etc. in my opinion

0

1

10

@karthik_r_n

Karthik Narasimhan

6 months

3. If you're still on the fence about LLM agents being practically useful, try out and build on SWE-agent yourself :) (2/2)

GitHub - princeton-nlp/SWE-agent: SWE-agent takes a GitHub issue and tries to automatically fix it,...

SWE-agent takes a GitHub issue and tries to automatically fix it, using GPT-4, or your LM of choice. It solves 12.47% of bugs in the SWE-bench evaluation set and takes just 1 minute to run. - princ...

0

0

11

@karthik_r_n

Karthik Narasimhan

2 years

I think we haven't yet fully imagined the capabilities of ReAct-style models (or 'language agents' as @ShunyuYao12 likes to call them). It's going to be a very interesting next few months!

@random_walker

Arvind Narayanan

2 years

For connecting LLMs to the Internet, I'm using the ReAct paper (which I thought was elegant and brilliant *before* realizing it was coauthored by my Princeton CS colleagues 💯): I used @simonw 's implementation as a starting point.

6

49

568

1

0

11

@karthik_r_n

Karthik Narasimhan

1 year

good point! Historically, we split up the general 'AI' problem into fields like NLP, CV, etc. and further subfields to make problems tractable. Now that the some of these atomic problems are being solved, time to move back up the layers again, towards more holistic NLP/AI systems

@YiTayML

Yi Tay

1 year

Just a few years ago, research is mostly sorted by "applications". When folks asked what research you're working on, you're expected to say something like "oh I work in question answering" or "sentiment analysis" or something 😅. In fact, all the conference tracks are sorted as

20

43

367

1

1

12

@karthik_r_n

Karthik Narasimhan

3 years

Very excited about our latest work on ‘Semantic Supervision’ as an alternative paradigm to the standard supervised classification. We provide semantic descriptions of output classes (in English, JSON, etc.) directly as choices for the classifier ...

@AmeetDeshpande_

Ameet Deshpande

@AmeetDeshpande_

3 years

Can standard ML classifiers generalize to 1️⃣New classes❓ 2️⃣Novel superclasses❓ 3️⃣Unseen tasks❓ 4️⃣Rephrased class descriptions❓ In our paper “Semantic Supervision” we propose a unified framework to enable it all✅! 🌐 (w Austin Wang, @karthik_r_n ) 🧵

2

21

80

1

1

11

@karthik_r_n

Karthik Narasimhan

1 year

Imitation learning (IL) for decision making is similar to next token prediction in language models. Given the role of scaling in modern LLMs, should it not help IL similarly? Check out @JensTuyls 's work on pushing IL to the limit for new SOTA (by 2x) on challenging Nethack!

@JensTuyls

Jens Tuyls

1 year

Imitation learning is one of the most widely used methods in ML, but how does compute affect its performance? We explore this question in the challenging game of NetHack and find our scaled-up agent to outperform prior SOTA by 2x! [1/6]

Tweet media one

2

19

107

0

0

11

@karthik_r_n

Karthik Narasimhan

2 years

Semantic supervision with class descriptions provides a simple, scalable solution for few-shot extreme classification (XC) over millions of labels! Check out @PranjalAggarw16 's paper below which achieves SOTA results on XC

@PranjalAggarw16

Pranjal Aggarwal

@PranjalAggarw16

2 years

Check out our new paper 'SemSup-XC: Semantic Supervision for Zero and Few-shot Extreme Classification' where we tackle the task of text classification over millions of unseen labels, with state-of-the-art results!📈 🌐 (w @AmeetDeshpande_ , @karthik_r_n )

1

2

15

1

1

10

@karthik_r_n

Karthik Narasimhan

5 years

How can we measure (and fix) discrepancies in long-term properties of language models (e.g. entropy)? Check out our latest work: (w/ Mark Braverman, Xinyi Chen, Sham Kakade, Cyril Zhang, Yi Zhang)

Tweet card media

Calibration, Entropy Rates, and Memory in Language Models

Building accurate language models that capture meaningful long-term dependencies is a core challenge in natural language processing. Towards this end, we present a calibration-based approach to...

0

1

10

@karthik_r_n

Karthik Narasimhan

4 years

In our new #emnlp2020 findings paper on handling spatial references, we find that Relation Nets can enable more fine-grained and robust grounding (compared to more prevalent CNN-based approaches).

@JimmyTYYang1

Jimmy T.Y. Yang

4 years

Can we learn a multi-model representation that is robust to the noise unseen during training? Happy to share our new EMNLP finding paper w/ Andrew, @karthik_r_n Paper: (1/n)

Tweet media one

1

0

11

0

1

9

@karthik_r_n

Karthik Narasimhan

7 months

@sea_snell

0

0

8

@karthik_r_n

Karthik Narasimhan

1 year

LLMs like GPT-4 do more than generate text, but we don’t have systematic, scalable benchmarks to measure reasoning capabilities. COLLIE is a framework that allows generating “unit tests” for LLMs that combine text generation with types of reasoning like logic, arithmetic, etc.

@ShunyuYao12

Shunyu Yao

1 year

Write a sentence with "dog, frisbee, catch, throw" 👉Too easy for 7B LM... Will (constrained) text generation (CTG) "die out" like many other NLP tasks, in face of LLM? 👉Excited to introduce 🐕COLLIE, next-gen CTG that even challenges GPT-4! (1/n)

Tweet media one

2

13

94

1

1

9

@karthik_r_n

Karthik Narasimhan

4 years

Attention Guidance for Transformers: #emnlp2020 Findings

Tweet card media

Guiding Attention for Self-Supervised Learning with Transformers

In this paper, we propose a simple and effective technique to allow for efficient self-supervised learning with bi-directional Transformers. Our approach is motivated by recent studies...

@AmeetDeshpande_

Ameet Deshpande

@AmeetDeshpande_

4 years

Can attention heads in Transformers be modified to improve performance and convergence speed during pre-training? Find out from our paper (w @karthik_r_n ) accepted at Findings of EMNLP! #nlproc #emnlp2020 Paper: Code: (1/4)

Tweet media one

1

5

36

1

1

9

@karthik_r_n

Karthik Narasimhan

1 year

We need to come up with new challenges for NLP 2.0 and redefine what "language processing" means today, which is no longer purely about text reading or generation or specific use cases

1

0

8

@karthik_r_n

Karthik Narasimhan

3 years

We propose a new method to tackle the classic explore-exploit dilemma in RL! Building on prior insights like PC-PG (), Go-Explore (), the classic E3 algorithm (), we get huge boosts on playing text games.

Tweet card media

Go-Explore: a New Approach for Hard-Exploration Problems

A grand challenge in reinforcement learning is intelligent exploration, especially when rewards are sparse or deceptive. Two Atari games serve as benchmarks for such hard-exploration domains:...

@JensTuyls

Jens Tuyls

3 years

How can RL agents deal with both sparse rewards and large, dynamic action spaces – a key challenge in text games? Our method eXploit-Then-eXplore (XTX) tackles these challenges and achieves a more than 2x improvement on Zork! #ICLR2022 Spotlight 📜[1/5]

Tweet media one

6

11

44

1

1

8

@karthik_r_n

Karthik Narasimhan

7 years

@ryan_p_adams @patrickshafto Best of both:

1

1

7

@karthik_r_n

Karthik Narasimhan

4 years

Inspiring talk by Lester Mackey at #ICML2020 . Anyone aware of similar 'NLP for social good' initiatives? Or interested in starting one?

@icmlconf

ICML Conference

4 years

Today at #ICML2020 (Tu, Jul 14) 👥Invited speaker Lester Mackey of MSFT 🕛8am EDT (12pm UTC) 👉🏾Doing Some Good with Machine Learning Registered attendees can join here:

Tweet media one

0

7

23

0

1

7

@karthik_r_n

Karthik Narasimhan

10 months

@srush_nlp Isn't everything in ML (and life?) just search, with some guiding heuristics (finding parameters via gradient descent, generating text with token-level probabilities, etc.)?

0

0

7

@karthik_r_n

Karthik Narasimhan

7 years

Checkout our latest arxiv preprint on grounded spatial reasoning:

1

1

6

@karthik_r_n

Karthik Narasimhan

2 years

@DrJimFan or we could just do away with typing altogether and just speak to our phones. not too far now... :)

0

0

6

@karthik_r_n

Karthik Narasimhan

2 years

Our paper draws on previous work like World of bits (, ), WikiNav (), AndroidEnv () and text game benchmarks () - see paper() for details

0

1

6

@karthik_r_n

Karthik Narasimhan

1 year

@McaleerStephen Nice work! Would love to see how it does on WebShop () (basically a harder version of miniWob++ with more abstract instructions and real-world language elements)

Towards Scalable Real-World Web Interaction with Grounded Language Agents

webshop-pnlp.github.io

1

0

6

@karthik_r_n

Karthik Narasimhan

6 years

Hope this helps push transfer learning for RL!

@OpenAI

OpenAI

6 years

Introducing the OpenAI Retro Contest — a contest where agents use their past experience to adapt to new environments:

12

272

713

0

1

5

@karthik_r_n

Karthik Narasimhan

6 years

Very impressive results! great to see pre-training being pushed to limits for nlp

0

0

4

@karthik_r_n

Karthik Narasimhan

2 years

Having LLMs like GPT-4 to assist peer review process (e.g. by auto-evaluating reviews or helping refine them to be better quality) seems like a very valuable feature to have @openreviewnet @MicrosoftCMT

@AmeetDeshpande_

Ameet Deshpande

@AmeetDeshpande_

2 years

An LLM-driven tool for *verifying* review quality? Out of our nine ICML reviews, 5 were very helpful (2 critical, 3 positive). But it was easy to detect that no effort went into the other four (including a positive one)! Don't replace reviewers, but help improve review quality!

1

2

12

2

0

4

@karthik_r_n

Karthik Narasimhan

3 months

The paper contains more results and analysis on failures, and discussion on how to further build on the benchmark. Work w/ the amazing Sierra research team @ShunyuYao12 , @noahrshinn , Pedram Razavi

Tweet media one

0

0

5

@karthik_r_n

Karthik Narasimhan

6 years

Very cool post by @harvardnlp on transformers

@srush_nlp

Sasha Rush

6 years

The Annotated Transformer: Line-by-Line PyTorch implementation of "Attention is All You Need"

Tweet media one

5

279

708

0

1

4

@karthik_r_n

Karthik Narasimhan

2 years

Check out @ShunyuYao12 's #ICLR2022 paper on connecting emergent communication protocols to human languages by 1) using them to generate corpora for pre-training language models and 2) using translation scores as a measure of their 'naturalness'

@ShunyuYao12

Shunyu Yao

2 years

ICLR week! Finally muster up a long-due tweet for our spotlight work: Linking Emergent and Natural Languages via Cospus Transfer paper: code: poster: Apr 27 13:30 - 15:30 EDT 1/n

Tweet media one

2

8

39

0

0

4

@karthik_r_n

Karthik Narasimhan

3 months

Key differentiators of 𝜏-bench: (1) simulating users with LLMs by specifying complex scenarios in text, (2) testing agent ability to follow complex rules, and (3) measuring agent reliability by re-running conversational variants of the same scenario.

1

0

4

@karthik_r_n

Karthik Narasimhan

2 years

@random_walker for example, see @LangChainAI 's integration with Zapier API that allows use of 20k+ tools with ReAct prompts - you can now automate your own workflows in 2 mins

@hwchase17

Harrison Chase

2 years

Want to give your agent access to 20k+ tools? 🔥 @LangChainAI x @zapier 🔥 Integration now out in Python and JS Blog Post: Python Docs: JS Docs:

Tweet media one

14

121

664

0

0

4

@karthik_r_n

Karthik Narasimhan

1 year

This intersection of LLMs and multi-agent systems has exciting potential for more exploration! also related is the Generative Agents work () by @joon_s_pk et al, though it focuses more on the social interactions between agents

2

0

3

@karthik_r_n

Karthik Narasimhan

2 years

Potentially interesting connections to the old ideas of inner speech/verbal thinking in CogSci as well

0

0

3

@karthik_r_n

Karthik Narasimhan

8 years

If you use overleaf and want/prefer offline editing with version control, check this out:

0

1

3

@karthik_r_n

Karthik Narasimhan

1 year

... which of course means we need stronger methods for deeply aligning LLMs and other AI systems with human values or set up deterministic guardrails during inference, rather than just patching the model post-hoc

0

0

3

@karthik_r_n

Karthik Narasimhan

2 years

@maxhkw What about learning by reading? Maybe they didn't have paper back then? :) Humans have "progressed" so quickly only because of stored wisdom from previous generations

2

1

3

@karthik_r_n

Karthik Narasimhan

1 year

@random_walker I don't think Embra was really using agents - most use cases seem to be just generating text using screen context, no real actions (which is probably why renamed). Language agents will definitely work - just not trivial as a few prompts/wrappers around GPT as most people think.

0

0

3

@karthik_r_n

Karthik Narasimhan

1 year

One can easily use COLLIE’s grammar-based framework to create new and harder tests - hopefully proves useful for the community! Check out for more info

COLLIE: Systematic Construction of Constrained Text Generation Tasks

collie-benchmark.github.io

0

0

3

@karthik_r_n

Karthik Narasimhan

1 year

SocraticAI explores a role-playing multi-agent setup where multiple instances of the same LLM have different roles (analyst, proofreader, etc. ) to collaboratively self-discover knowledge and solve problems (). Having multiple roles allows for...

Tweet card media

Socratic-AI-Demo

This is "Socratic-AI-Demo" by Runzhe Yang on Vimeo, the home for high quality videos and the people who love them.

1

0

2

@karthik_r_n

Karthik Narasimhan

2 years

The dynamic reasoning trace helps prevent hallucinations and error propagation in decision making, while also providing some measure of interpretability for the model's action choices -> this can potentially be corrected by a human to "fix" model behavior.

1

0

2

@karthik_r_n

Karthik Narasimhan

3 years

... this is like providing the model multiple choices to ‘read’ and understand before picking the most appropriate choice instead of just picking from integer class IDs

1

0

2

@karthik_r_n

Karthik Narasimhan

3 months

We find that function calling/ReAct agents are far from production-ready, both in terms of avg. performance and consistency (<25% pass^8)

Tweet media one

Tweet media two

1

0

3

@karthik_r_n

Karthik Narasimhan

2 years

We will be presenting DataMUX at #neurips2022 later this month - so do stop by our poster if you are around! You can also get a sneak preview of some exciting new results from our current efforts

1

0

2

@karthik_r_n

Karthik Narasimhan

3 months

𝜏-bench bridges this gap, featuring realistic dialog and tool use in a customer support setting, open-ended and diverse tasks, faithful objective evaluation, and a modular framework for further extension.

Tweet media one

Tweet media two

1

0

3

@karthik_r_n

Karthik Narasimhan

3 months

To measure consistency and reliability, we introduce the pass^k metric, which evaluates whether the agent completes a specific task in multiple (k) trials

Tweet media one

1

0

3

@karthik_r_n

Karthik Narasimhan

1 year

This work seems highly relevant too! (by @guohao_li , @hammh0a , @itanih0 , et al.)

1

0

2

@karthik_r_n

Karthik Narasimhan

2 years

@yoavgo I actually enjoyed its 'weirdness' and found it quite fun and thought provoking. IMO, was easy to get what 'eigen-simulacra' meant from the context (though I agree we will need more standard terminology as these concepts, like LLMs as simulators, become more mainstream)

0

0

2

@karthik_r_n

Karthik Narasimhan

4 years

like building dynamic knowledge graphs (), exploring with a language model () or using models for reading comprehension ()

Tweet card media

Keep CALM and Explore: Language Models for Action Generation in...

Text-based games present a unique challenge for autonomous agents to operate in natural language and handle enormous action spaces. In this paper, we propose the Contextual Action Language Model...

0

0

2

@karthik_r_n

Karthik Narasimhan

2 years

Interestingly, we see that agents trained on WebShop perform equally well on real-world websites like ! This sim-to-real transfer (like recent results in robotics) is exciting for training practical agents for interacting with the real WWW autonomously.

1

0

1

@karthik_r_n

Karthik Narasimhan

8 years

clearly captures the rise of arxiv.

0

0

1

@karthik_r_n

Karthik Narasimhan

1 year

Prior work has tried different prompting strategies like chain-of-thought ( @_jasonwei ), ReAct ( @ShunyuYao12 ), inner monologue ( @wenlong_huang ), etc. but they all consider a single agent with a single stream of reasoning.

1

0

1

@karthik_r_n

Karthik Narasimhan

2 years

@rahulgk Right, the problem seems to be that LLMs are great at simulating themselves (+variations), which can create an infinite recursion. It's tricky to try and block effects like toxic responses completely at every level without destroying the LM's simulation capabilities...

2

0

1

@karthik_r_n

Karthik Narasimhan

3 years

@iandanforth @Thom_Wolf @zacharylipton @hardmaru Not sure if anyone's tried with garbage disposals yet :) but you might find these of interest:

0

0

1

@karthik_r_n

Karthik Narasimhan

3 months

Autonomous agents have great potential for redefining industries but deploying them in real-world scenarios is hard! We need benchmarks that test agents in realistic scenarios with humans in the loop.

1

0

2

@karthik_r_n

Karthik Narasimhan

2 years

not to mention the frustration over a badly written review (whether you're an author, an AC or even one of the other reviewers for the paper)

1

0

1

@karthik_r_n

Karthik Narasimhan

1 year

2. Methods like RLHF may not be sufficient to weed out all undesirable behavior, since it is a feedback signal for outputs conditioned on a very specific context. One could (potentially) always simulate a new scenario and elicit that behavior - very large # of possibilities

1

0

1

@karthik_r_n

Karthik Narasimhan

3 years

Our work has interesting connections to existing literature in zero-shot learning with text, contrastive learning, prompting pre-trained models, learning with NL explanations or NL descriptions, etc. - see paper for exact comparisons!

0

0

1

@karthik_r_n

Karthik Narasimhan

2 years

we may also have to start flagging purely machine-written reviews soon... @edward_the6

0

0

1

@karthik_r_n

Karthik Narasimhan

2 years

There is a lot to be done yet - improving agent performance, scaling up to new domains and tasks, thinking carefully about safety issues, etc but I'm excited to see how far we can push this direction!

1

0

1

@karthik_r_n

Karthik Narasimhan

2 years

@AmeetDeshpande_ Great idea. @kchonyc maybe this would assist ACs and program chairs in weeding out noise from the signal? :)

1

0

1

@karthik_r_n

Karthik Narasimhan

8 years

single page for references makes no sense whatsoever... #nips2016

0

0

1

@karthik_r_n

Karthik Narasimhan

3 years

@michahu8 @PrincetonCS Congrats Michael!

0

0

1

@karthik_r_n

Karthik Narasimhan

3 years

SemSup organically opens up a wide array of possibilities for generalization in the output space since the notion of classes is now more fluid/continuous rather than discrete — we demonstrate a few cases in the paper including unseen descriptions, classes, superclasses, etc.

1

0

1

@karthik_r_n

Karthik Narasimhan

1 year

... more focused reasoning within each agent and the ability to solve difficult puzzles (game of 24) or answering high-level research questions ("estimate the connection density of a fly brain"), while also allowing for human-in-the-loop collaboration.

1

0

1

@karthik_r_n

Karthik Narasimhan

1 year

Our first benchmark COLLIE-v1 contains very simple tests (generate a sentence with 10 words, where the fifth word is 'language'), which surprisingly trip up even models like GPT-4. If we want LLMs to be used in day-to-day systems, we probably want them to pass these “unit tests”

1

0

1

@karthik_r_n

Karthik Narasimhan

2 years

esp. since it is a purely volunteer-driven process with large time investments, and the number of paper submissions are exploding

1

0

1

@karthik_r_n

Karthik Narasimhan

9 months

@ChrSzegedy @SimianLuo It was NeurIPS :)

1

0

1