Karthik Narasimhan Profile Banner
Karthik Narasimhan Profile
Karthik Narasimhan

@karthik_r_n

3,114
Followers
456
Following
7
Media
260
Statuses

Head of Research @SierraPlatform , Associate Professor @PrincetonCS . Previously @OpenAI , PhD @MIT_CSAIL , BTech @iitmadras

Princeton, NJ
Joined July 2015
Don't wanna be here? Send us removal request.
Pinned Tweet
@karthik_r_n
Karthik Narasimhan
3 months
Excited to release ๐œ-bench (TAU for Tool-Agent-User โš’๏ธ-๐Ÿค–-๐Ÿง‘), a new benchmark to evaluate AI agents' performance and reliability in real-world settings with dynamic user and tool interaction. Paper: , Blog:
4
26
117
@karthik_r_n
Karthik Narasimhan
6 months
SWE-agent is finally out. A few highlights: 1. Agent-Computer Interface (ACI) design will be critical for the success of AI agents, much like HCI is critical for how effective humans are with computers. 2. You can use SWE-agent out of the box on any github issue. (1/2)
Tweet media one
@jyangballin
John Yang
6 months
SWE-agent is our new system for autonomously solving issues in GitHub repos. It gets similar accuracy to Devin on SWE-bench, takes 93 seconds on avg + it's open source! We designed a new agent-computer interface to make it easy for GPT-4 to edit+run code
Tweet media one
64
426
2K
13
22
168
@karthik_r_n
Karthik Narasimhan
1 year
LMs are not just about generating text in a linear fashion anymore. They can be used to explore unique points of attack, backtrack, search, and much more to solve complex problems and puzzles. Check out ๐ŸŒฒTree-of-thought, our new framework for systematic reasoning with LLMs!
@ShunyuYao12
Shunyu Yao
1 year
Still use โ›“๏ธChain-of-Thought (CoT) for all your prompting? May be underutilizing LLM capabilities๐Ÿค  Introducing ๐ŸŒฒTree-of-Thought (ToT), a framework to unleash complex & general problem solving with LLMs, through a deliberate โ€˜System 2โ€™ tree search.
Tweet media one
93
594
3K
1
23
105
@karthik_r_n
Karthik Narasimhan
2 years
We teach language models to both reason and act in the same breath. Exciting since it provides a natural mechanism for allowing LLMs to incorporate external knowledge (APIs, databases, web) vs black box internal reasoning, for tasks from QA to webpage navigation.
@ShunyuYao12
Shunyu Yao
2 years
Large Language Models (LLM) are ๐Ÿ”ฅin 2 ways: 1.๐Ÿง Reason via internal thoughts (explain jokes, math reasoning..) 2.๐Ÿ’ชAct in external worlds (SayCan, ADEPT ACT-1, WebGPT..) But so far ๐Ÿง and๐Ÿ’ช remain distinct methods/tasks... Why not ๐Ÿง +๐Ÿ’ช? In our new work ReAct, we show 1+1>>2!
10
66
389
2
3
50
@karthik_r_n
Karthik Narasimhan
4 years
Thank you @AmazonScience for supporting our research! Hope to make some useful advances in conversational systems!
@AmazonScience
Amazon Science
4 years
Congratulations to the 51 award recipients of the 2019 Amazon Research Awards, who represent 39 universities in 10 countries. View the full list and find out how to be added to the 2020 Call For Proposal distribution list here: #AmazonResearchAwards
1
11
69
5
1
47
@karthik_r_n
Karthik Narasimhan
3 years
Even though we're still at the tip of the iceberg here, I'm very excited about this direction and its future potential. As we continue to scale up our neural networks, we will need more methods to enable more efficient and cost-effective inference per input instance.
@VishvakM
Vishvak Murahari
3 years
Data Multiplexing for Neural Networks๐Ÿ”€ Can neural networks process multiple instances simultaneously as a single mixed input, similar to how radio channels can share bandwidth to carry multiple signals? Surprisingly, we find they can indeed!! ๐Ÿ“œ [1/6]
10
70
397
0
4
47
@karthik_r_n
Karthik Narasimhan
1 year
Very excited for this upcoming Center for Language and Intelligence @ Princeton! We are actively recruiting!
@prfsanjeevarora
Sanjeev Arora
1 year
Princeton has a new Center for Language and Intelligence, researching LLMs + large AI models, as well as their interdisciplinary applications. Looking for postdocs/research scientists/engineers; attractive conditions.
21
115
617
0
1
43
@karthik_r_n
Karthik Narasimhan
7 months
Software engineering is so much more than just generating code. Exciting to see good progress on SWE-bench () - solving ~13% of real-world bugs is impressive. Still a long way to go though!
@cognition_labs
Cognition
7 months
Today we're excited to introduce Devin, the first AI software engineer. Devin is the new state-of-the-art on the SWE-Bench coding benchmark, has successfully passed practical engineering interviews from leading AI companies, and has even completed real jobs on Upwork. Devin is
5K
11K
45K
3
4
38
@karthik_r_n
Karthik Narasimhan
2 years
The web (or even a simulation of it) is a great environment for building language-grounded RL agents! Has real-world content, is interactive and can be dynamically changing, is easily scalable and fast to run, and has direct practical applications in assisting humans with tasks.
@ShunyuYao12
Shunyu Yao
2 years
What if you had a bot you could just instruct in English to shop online for you? Check out our latest work ๐Ÿ›’WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents w/ @__howardchen @jyangballin , @karthik_r_n @princeton_nlp
1
14
61
1
7
35
@karthik_r_n
Karthik Narasimhan
9 months
Thank you for inviting me! Much of the work I presented was led by @ShunyuYao12 (who is on the job market!), @jyangballin (who is applying to PhD programs this year!), @_carlosejimenez , and @tedsumers
@stanfordnlp
Stanford NLP Group
9 months
We were really excited to have @karthik_r_n from @princeton_nlp join us today to give us the good oil on the future of language agents in the age of Foundation Models! #NLProc
Tweet media one
2
4
63
0
4
29
@karthik_r_n
Karthik Narasimhan
4 years
If you work on RL agents for text-based games (or other interactive envs with language), this might be interesting! We show that agents may not be adequately leveraging the rich semantics present in the observations and there's lots of room for more semantic-centric approaches
@ShunyuYao12
Shunyu Yao
4 years
For autonomous tasks with language (e.g. text games), how much does an agent rely on language semantics vs. memorization? Our #NAACL2021 paper (, joint w/ @karthik_r_n , @mhauskn ) proposes ablation studies with surprising findings and useful insights! (1/3)
Tweet media one
1
2
32
1
4
26
@karthik_r_n
Karthik Narasimhan
2 years
Training agents to operate autonomously over the web will be the next frontier for AI. If you are looking for an open-source benchmark to build and test your own web agents, check out (by @ShunyuYao12 , @__howardchen , @jyangballin )
@AdeptAILabs
Adept
2 years
1/7 We built a new model! Itโ€™s called Action Transformer (ACT-1) and we taught it to use a bunch of software tools. In this first video, the user simply types a high-level request and ACT-1 does the rest. Read on to see more examples โฌ‡๏ธ
136
919
5K
1
5
26
@karthik_r_n
Karthik Narasimhan
4 years
Seems like a very cool RL testbed for many different reasons!
@egrefen
Edward Grefenstette
4 years
Want to help push the boundaries of RL research? Need a rich, difficult, and procedurally-generated environment with loads of structure and intricacy? An astounding amount of human play data? Sophisticated strategies and documentation? We got you (and it's faster than ALE!) [1/6]
Tweet media one
2
49
171
0
3
23
@karthik_r_n
Karthik Narasimhan
3 years
Check out our new #ACL2021 paper which shows how self-attention nets can learn languages with bounded hierarchical structure (which applies to most practical uses of human languages). A step towards understanding why Transformers are ๐Ÿ”ฅ for NLP!
@ShunyuYao12
Shunyu Yao
3 years
Hierarchical structure is a core aspect of language syntax. Recurrent networks can systematically process recursion by emulating stacks, but can self-attention networks? If so, how? Our #ACL2021 paper shed lights into this fundamental issue! (1/5)
Tweet media one
1
6
46
0
2
23
@karthik_r_n
Karthik Narasimhan
4 years
Looking forward to this exciting workshop!
@egrefen
Edward Grefenstette
4 years
"Going" to @icmlconf ? Come hear about the future of language+RL at the #LaReL2020 workshop on Language in Reinforcement Learning, held July 18. Here's a short thread introducing some of the highlights. [1/9]
Tweet media one
1
47
128
0
3
21
@karthik_r_n
Karthik Narasimhan
1 year
Check out our latest paper on a large-scale toxicity analysis of 'persona-assigned' ChatGPT! Beyond the direct findings that assigning personas can make these models behave in undesirable ways, our study also hints at a couple of broader points:
@AmeetDeshpande_
Ameet Deshpande
1 year
Large language models and chatbots are being ubiquitously used. But are they safe? In our large-scale toxicity analysis of ChatGPT, we find that assigning it a โ€œpersonaโ€ significantly increases its toxicity (up to 6X). Paper: Blog:
Tweet media one
4
33
137
1
2
21
@karthik_r_n
Karthik Narasimhan
2 years
DataMUX () wins the @BellLabs 2nd Prize! Proud of the students @VishvakM @_carlosejimenez @RunzheYang . Humbling experience visiting the campus with all its history and interacting with the researchers and esteemed judges (incl. a Nobel Prize winner!)
Tweet media one
@BellLabs
Bell Labs
2 years
The runners-up of the #BellLabsPrize22 are... ๐Ÿฅˆ 2nd: Karthik Narasimhan, Carlos Jimenez, Vishvak Murahari, & Runzhe Yang of @Princeton . ๐Ÿฅ‰ 3rd: Xiangfeng Duan of @UCLA .
0
1
5
1
2
20
@karthik_r_n
Karthik Narasimhan
1 year
Check out @RunzheYang 's SocraticAI framework! Language models like GPT are trained on vast amounts of data, so they surely have the knowledge to solve a wide variety of problems. The key question has always been - how does one elicit the right knowledge?
@RunzheYang
"Tony" Runzhe Yang
1 year
๐Ÿง™Introducing SocraticAI, a multi-agent framework for using independent instantiations of large language models (LLMs) to collaboratively and creatively solve problems.
2
39
143
1
1
18
@karthik_r_n
Karthik Narasimhan
3 years
Super excited about this new workshop on language supervision! (organized w/ @jacobandreas @aidanematzadeh ) Language will play an increasingly important role in building scalable, value-aligned ML systems in the future, beyond just being an applied area for ML methods.
@LNLSWorkshop
Workshop on Learning with NL Supervision
3 years
Interested in: - training image classifiers & robots using language? - using info from text corpora in other ML applications? - understanding the role of language in human concept learning? Join us for the Workshop on Learning with Natural Language Supervision ( @aclmeeting '22)!
Tweet media one
1
8
23
0
9
18
@karthik_r_n
Karthik Narasimhan
1 year
Check out our latest paper connecting long standing ideas in cognitive architectures with facets of modern language agents, and subsequently sketching a blueprint for improving agent capabilities.
@ShunyuYao12
Shunyu Yao
1 year
Language Agents are cool & fast-moving, but no systematic way to understand & design them.. So we use classical CogSci & AI insights to propose Cognitive Architectures for Language Agents (๐ŸจCoALA)! w/ great @tedsumers @karthik_r_n @cocosci_lab (1/6)
Tweet media one
10
116
464
0
0
16
@karthik_r_n
Karthik Narasimhan
4 years
@emnlp2020 We could start using openreview with public anonymous reviews? Might help increase reviewer accountability.
0
0
16
@karthik_r_n
Karthik Narasimhan
4 years
Check out how we use language models for better exploration in text adventure games! #emnlp2020
@ShunyuYao12
Shunyu Yao
4 years
Happy to announce our new #emnlp2020 paper โ€œKeep CALM and Explore: Language Models for Action Generation in Text-based Gamesโ€ is online! w/ Rohan, @mhauskn , @karthik_r_n arxiv: code: more below (1/n)
1
9
36
0
2
16
@karthik_r_n
Karthik Narasimhan
5 years
Very nice overview of recent work in language+RL! Solving language will be key for scaling up RL and making it more robust and sample efficient.
@_rockt
Tim Rocktรคschel
5 years
How can RL agents exploit the compositional, relational and hierarchical structure of the world? A growing number of authors propose learning from natural language. We are excited to share our @IJCAIconf survey of this emerging field! TL;DR:๐Ÿค–+๐Ÿ“–=๐Ÿ“ˆ๐ŸŽฏ๐Ÿ†๐Ÿฅณ
Tweet media one
2
73
258
0
2
15
@karthik_r_n
Karthik Narasimhan
1 year
Thanks to TAS Princeton () for facilitating this seminar series! Was a great experience to interact with the teachers and discuss how LLMs like ChatGPT will change K-12 education
@EPrinceton
Princeton Engineering
1 year
Last week, @karthik_r_n presented a seminar for 8 HS & middle school teachers on #naturallanguageprocessing , including tools like #ChatGPT . They learned about challenges of the technology, & shared ideas on how to thoughtfully incorporate ChatGPT into lessons. #TeachersAsScholars
Tweet media one
0
1
3
0
1
14
@karthik_r_n
Karthik Narasimhan
2 years
Some of the jailbreaking #ChatGPT examples in are crazy. Building LLM-based systems that are 100% robust to such attacks feels so hard since it seems like one can always go one level deeper with prompts (think Inception-style)
1
0
14
@karthik_r_n
Karthik Narasimhan
2 years
@percyliang @percyliang you were way ahead of time with WoB () ๐Ÿ™‚- one of my favorite papers and a huge inspiration for our own recent work. Now we finally have stronger hammers (models) to tackle these challenges and push further in this space
1
2
13
@karthik_r_n
Karthik Narasimhan
2 years
@kroscoo Absolutely. Simplest and most natural way to introduce the idea of LMs as probability distributions over sequences, MLE, etc. in my opinion
0
1
10
@karthik_r_n
Karthik Narasimhan
2 years
I think we haven't yet fully imagined the capabilities of ReAct-style models (or 'language agents' as @ShunyuYao12 likes to call them). It's going to be a very interesting next few months!
@random_walker
Arvind Narayanan
2 years
For connecting LLMs to the Internet, I'm using the ReAct paper (which I thought was elegant and brilliant *before* realizing it was coauthored by my Princeton CS colleagues ๐Ÿ’ฏ): I used @simonw 's implementation as a starting point.
6
49
568
1
0
11
@karthik_r_n
Karthik Narasimhan
1 year
good point! Historically, we split up the general 'AI' problem into fields like NLP, CV, etc. and further subfields to make problems tractable. Now that the some of these atomic problems are being solved, time to move back up the layers again, towards more holistic NLP/AI systems
@YiTayML
Yi Tay
1 year
Just a few years ago, research is mostly sorted by "applications". When folks asked what research you're working on, you're expected to say something like "oh I work in question answering" or "sentiment analysis" or something ๐Ÿ˜…. In fact, all the conference tracks are sorted as
20
43
367
1
1
12
@karthik_r_n
Karthik Narasimhan
3 years
Very excited about our latest work on โ€˜Semantic Supervisionโ€™ as an alternative paradigm to the standard supervised classification. We provide semantic descriptions of output classes (in English, JSON, etc.) directly as choices for the classifier ...
@AmeetDeshpande_
Ameet Deshpande
3 years
Can standard ML classifiers generalize to 1๏ธโƒฃNew classesโ“ 2๏ธโƒฃNovel superclassesโ“ 3๏ธโƒฃUnseen tasksโ“ 4๏ธโƒฃRephrased class descriptionsโ“ In our paper โ€œSemantic Supervisionโ€ we propose a unified framework to enable it allโœ…! ๐ŸŒ (w Austin Wang, @karthik_r_n ) ๐Ÿงต
2
21
80
1
1
11
@karthik_r_n
Karthik Narasimhan
1 year
Imitation learning (IL) for decision making is similar to next token prediction in language models. Given the role of scaling in modern LLMs, should it not help IL similarly? Check out @JensTuyls 's work on pushing IL to the limit for new SOTA (by 2x) on challenging Nethack!
@JensTuyls
Jens Tuyls
1 year
Imitation learning is one of the most widely used methods in ML, but how does compute affect its performance? We explore this question in the challenging game of NetHack and find our scaled-up agent to outperform prior SOTA by 2x! [1/6]
Tweet media one
2
19
107
0
0
11
@karthik_r_n
Karthik Narasimhan
2 years
Semantic supervision with class descriptions provides a simple, scalable solution for few-shot extreme classification (XC) over millions of labels! Check out @PranjalAggarw16 's paper below which achieves SOTA results on XC
@PranjalAggarw16
Pranjal Aggarwal
2 years
Check out our new paper 'SemSup-XC: Semantic Supervision for Zero and Few-shot Extreme Classification' where we tackle the task of text classification over millions of unseen labels, with state-of-the-art results!๐Ÿ“ˆ ๐ŸŒ (w @AmeetDeshpande_ , @karthik_r_n )
1
2
15
1
1
10
@karthik_r_n
Karthik Narasimhan
5 years
How can we measure (and fix) discrepancies in long-term properties of language models (e.g. entropy)? Check out our latest work: (w/ Mark Braverman, Xinyi Chen, Sham Kakade, Cyril Zhang, Yi Zhang)
0
1
10
@karthik_r_n
Karthik Narasimhan
4 years
In our new #emnlp2020 findings paper on handling spatial references, we find that Relation Nets can enable more fine-grained and robust grounding (compared to more prevalent CNN-based approaches).
@JimmyTYYang1
Jimmy T.Y. Yang
4 years
Can we learn a multi-model representation that is robust to the noise unseen during training? Happy to share our new EMNLP finding paper w/ Andrew, @karthik_r_n Paper: (1/n)
Tweet media one
1
0
11
0
1
9
@karthik_r_n
Karthik Narasimhan
7 months
0
0
8
@karthik_r_n
Karthik Narasimhan
1 year
LLMs like GPT-4 do more than generate text, but we donโ€™t have systematic, scalable benchmarks to measure reasoning capabilities. COLLIE is a framework that allows generating โ€œunit testsโ€ for LLMs that combine text generation with types of reasoning like logic, arithmetic, etc.
@ShunyuYao12
Shunyu Yao
1 year
Write a sentence with "dog, frisbee, catch, throw" ๐Ÿ‘‰Too easy for 7B LM... Will (constrained) text generation (CTG) "die out" like many other NLP tasks, in face of LLM? ๐Ÿ‘‰Excited to introduce ๐Ÿ•COLLIE, next-gen CTG that even challenges GPT-4! (1/n)
Tweet media one
2
13
94
1
1
9
@karthik_r_n
Karthik Narasimhan
4 years
Attention Guidance for Transformers: #emnlp2020 Findings
@AmeetDeshpande_
Ameet Deshpande
4 years
Can attention heads in Transformers be modified to improve performance and convergence speed during pre-training? Find out from our paper (w @karthik_r_n ) accepted at Findings of EMNLP! #nlproc #emnlp2020 Paper: Code: (1/4)
Tweet media one
1
5
36
1
1
9
@karthik_r_n
Karthik Narasimhan
1 year
We need to come up with new challenges for NLP 2.0 and redefine what "language processing" means today, which is no longer purely about text reading or generation or specific use cases
1
0
8
@karthik_r_n
Karthik Narasimhan
3 years
We propose a new method to tackle the classic explore-exploit dilemma in RL! Building on prior insights like PC-PG (), Go-Explore (), the classic E3 algorithm (), we get huge boosts on playing text games.
@JensTuyls
Jens Tuyls
3 years
How can RL agents deal with both sparse rewards and large, dynamic action spaces โ€“ a key challenge in text games? Our method eXploit-Then-eXplore (XTX) tackles these challenges and achieves a more than 2x improvement on Zork! #ICLR2022 Spotlight ๐Ÿ“œ[1/5]
Tweet media one
6
11
44
1
1
8
@karthik_r_n
Karthik Narasimhan
7 years
1
1
7
@karthik_r_n
Karthik Narasimhan
4 years
Inspiring talk by Lester Mackey at #ICML2020 . Anyone aware of similar 'NLP for social good' initiatives? Or interested in starting one?
@icmlconf
ICML Conference
4 years
Today at #ICML2020 (Tu, Jul 14) ๐Ÿ‘ฅInvited speaker Lester Mackey of MSFT ๐Ÿ•›8am EDT (12pm UTC) ๐Ÿ‘‰๐ŸพDoing Some Good with Machine Learning Registered attendees can join here:
Tweet media one
0
7
23
0
1
7
@karthik_r_n
Karthik Narasimhan
10 months
@srush_nlp Isn't everything in ML (and life?) just search, with some guiding heuristics (finding parameters via gradient descent, generating text with token-level probabilities, etc.)?
0
0
7
@karthik_r_n
Karthik Narasimhan
7 years
Checkout our latest arxiv preprint on grounded spatial reasoning:
1
1
6
@karthik_r_n
Karthik Narasimhan
2 years
@DrJimFan or we could just do away with typing altogether and just speak to our phones. not too far now... :)
0
0
6
@karthik_r_n
Karthik Narasimhan
2 years
Our paper draws on previous work like World of bits (, ), WikiNav (), AndroidEnv () and text game benchmarks () - see paper() for details
0
1
6
@karthik_r_n
Karthik Narasimhan
1 year
@McaleerStephen Nice work! Would love to see how it does on WebShop () (basically a harder version of miniWob++ with more abstract instructions and real-world language elements)
1
0
6
@karthik_r_n
Karthik Narasimhan
6 years
Hope this helps push transfer learning for RL!
@OpenAI
OpenAI
6 years
Introducing the OpenAI Retro Contest โ€” a contest where agents use their past experience to adapt to new environments:
12
272
713
0
1
5
@karthik_r_n
Karthik Narasimhan
6 years
Very impressive results! great to see pre-training being pushed to limits for nlp
0
0
4
@karthik_r_n
Karthik Narasimhan
2 years
Having LLMs like GPT-4 to assist peer review process (e.g. by auto-evaluating reviews or helping refine them to be better quality) seems like a very valuable feature to have @openreviewnet @MicrosoftCMT
@AmeetDeshpande_
Ameet Deshpande
2 years
An LLM-driven tool for *verifying* review quality? Out of our nine ICML reviews, 5 were very helpful (2 critical, 3 positive). But it was easy to detect that no effort went into the other four (including a positive one)! Don't replace reviewers, but help improve review quality!
1
2
12
2
0
4
@karthik_r_n
Karthik Narasimhan
3 months
The paper contains more results and analysis on failures, and discussion on how to further build on the benchmark. Work w/ the amazing Sierra research team @ShunyuYao12 , @noahrshinn , Pedram Razavi
Tweet media one
0
0
5
@karthik_r_n
Karthik Narasimhan
6 years
Very cool post by @harvardnlp on transformers
@srush_nlp
Sasha Rush
6 years
The Annotated Transformer: Line-by-Line PyTorch implementation of "Attention is All You Need"
Tweet media one
5
279
708
0
1
4
@karthik_r_n
Karthik Narasimhan
2 years
Check out @ShunyuYao12 's #ICLR2022 paper on connecting emergent communication protocols to human languages by 1) using them to generate corpora for pre-training language models and 2) using translation scores as a measure of their 'naturalness'
@ShunyuYao12
Shunyu Yao
2 years
ICLR week! Finally muster up a long-due tweet for our spotlight work: Linking Emergent and Natural Languages via Cospus Transfer paper: code: poster: Apr 27 13:30 - 15:30 EDT 1/n
Tweet media one
2
8
39
0
0
4
@karthik_r_n
Karthik Narasimhan
3 months
Key differentiators of ๐œ-bench: (1) simulating users with LLMs by specifying complex scenarios in text, (2) testing agent ability to follow complex rules, and (3) measuring agent reliability by re-running conversational variants of the same scenario.
1
0
4
@karthik_r_n
Karthik Narasimhan
2 years
@random_walker for example, see @LangChainAI 's integration with Zapier API that allows use of 20k+ tools with ReAct prompts - you can now automate your own workflows in 2 mins
@hwchase17
Harrison Chase
2 years
Want to give your agent access to 20k+ tools? ๐Ÿ”ฅ @LangChainAI x @zapier ๐Ÿ”ฅ Integration now out in Python and JS Blog Post: Python Docs: JS Docs:
Tweet media one
14
121
664
0
0
4
@karthik_r_n
Karthik Narasimhan
1 year
This intersection of LLMs and multi-agent systems has exciting potential for more exploration! also related is the Generative Agents work () by @joon_s_pk et al, though it focuses more on the social interactions between agents
2
0
3
@karthik_r_n
Karthik Narasimhan
2 years
Potentially interesting connections to the old ideas of inner speech/verbal thinking in CogSci as well
0
0
3
@karthik_r_n
Karthik Narasimhan
8 years
If you use overleaf and want/prefer offline editing with version control, check this out:
0
1
3
@karthik_r_n
Karthik Narasimhan
1 year
... which of course means we need stronger methods for deeply aligning LLMs and other AI systems with human values or set up deterministic guardrails during inference, rather than just patching the model post-hoc
0
0
3
@karthik_r_n
Karthik Narasimhan
2 years
@maxhkw What about learning by reading? Maybe they didn't have paper back then? :) Humans have "progressed" so quickly only because of stored wisdom from previous generations
2
1
3
@karthik_r_n
Karthik Narasimhan
1 year
@random_walker I don't think Embra was really using agents - most use cases seem to be just generating text using screen context, no real actions (which is probably why renamed). Language agents will definitely work - just not trivial as a few prompts/wrappers around GPT as most people think.
0
0
3
@karthik_r_n
Karthik Narasimhan
1 year
One can easily use COLLIEโ€™s grammar-based framework to create new and harder tests - hopefully proves useful for the community! Check out for more info
0
0
3
@karthik_r_n
Karthik Narasimhan
1 year
SocraticAI explores a role-playing multi-agent setup where multiple instances of the same LLM have different roles (analyst, proofreader, etc. ) to collaboratively self-discover knowledge and solve problems (). Having multiple roles allows for...
1
0
2
@karthik_r_n
Karthik Narasimhan
2 years
The dynamic reasoning trace helps prevent hallucinations and error propagation in decision making, while also providing some measure of interpretability for the model's action choices -> this can potentially be corrected by a human to "fix" model behavior.
1
0
2
@karthik_r_n
Karthik Narasimhan
3 years
... this is like providing the model multiple choices to โ€˜readโ€™ and understand before picking the most appropriate choice instead of just picking from integer class IDs
1
0
2
@karthik_r_n
Karthik Narasimhan
3 months
We find that function calling/ReAct agents are far from production-ready, both in terms of avg. performance and consistency (<25% pass^8)
Tweet media one
Tweet media two
1
0
3
@karthik_r_n
Karthik Narasimhan
2 years
We will be presenting DataMUX at #neurips2022 later this month - so do stop by our poster if you are around! You can also get a sneak preview of some exciting new results from our current efforts
1
0
2
@karthik_r_n
Karthik Narasimhan
3 months
๐œ-bench bridges this gap, featuring realistic dialog and tool use in a customer support setting, open-ended and diverse tasks, faithful objective evaluation, and a modular framework for further extension.
Tweet media one
Tweet media two
1
0
3
@karthik_r_n
Karthik Narasimhan
3 months
To measure consistency and reliability, we introduce the pass^k metric, which evaluates whether the agent completes a specific task in multiple (k) trials
Tweet media one
1
0
3
@karthik_r_n
Karthik Narasimhan
1 year
This work seems highly relevant too! (by @guohao_li , @hammh0a , @itanih0 , et al.)
1
0
2
@karthik_r_n
Karthik Narasimhan
2 years
@yoavgo I actually enjoyed its 'weirdness' and found it quite fun and thought provoking. IMO, was easy to get what 'eigen-simulacra' meant from the context (though I agree we will need more standard terminology as these concepts, like LLMs as simulators, become more mainstream)
0
0
2
@karthik_r_n
Karthik Narasimhan
2 years
Interestingly, we see that agents trained on WebShop perform equally well on real-world websites like ! This sim-to-real transfer (like recent results in robotics) is exciting for training practical agents for interacting with the real WWW autonomously.
1
0
1
@karthik_r_n
Karthik Narasimhan
8 years
clearly captures the rise of arxiv.
0
0
1
@karthik_r_n
Karthik Narasimhan
1 year
Prior work has tried different prompting strategies like chain-of-thought ( @_jasonwei ), ReAct ( @ShunyuYao12 ), inner monologue ( @wenlong_huang ), etc. but they all consider a single agent with a single stream of reasoning.
1
0
1
@karthik_r_n
Karthik Narasimhan
2 years
@rahulgk Right, the problem seems to be that LLMs are great at simulating themselves (+variations), which can create an infinite recursion. It's tricky to try and block effects like toxic responses completely at every level without destroying the LM's simulation capabilities...
2
0
1
@karthik_r_n
Karthik Narasimhan
3 years
@iandanforth @Thom_Wolf @zacharylipton @hardmaru Not sure if anyone's tried with garbage disposals yet :) but you might find these of interest:
0
0
1
@karthik_r_n
Karthik Narasimhan
3 months
Autonomous agents have great potential for redefining industries but deploying them in real-world scenarios is hard! We need benchmarks that test agents in realistic scenarios with humans in the loop.
1
0
2
@karthik_r_n
Karthik Narasimhan
2 years
not to mention the frustration over a badly written review (whether you're an author, an AC or even one of the other reviewers for the paper)
1
0
1
@karthik_r_n
Karthik Narasimhan
1 year
2. Methods like RLHF may not be sufficient to weed out all undesirable behavior, since it is a feedback signal for outputs conditioned on a very specific context. One could (potentially) always simulate a new scenario and elicit that behavior - very large # of possibilities
1
0
1
@karthik_r_n
Karthik Narasimhan
3 years
Our work has interesting connections to existing literature in zero-shot learning with text, contrastive learning, prompting pre-trained models, learning with NL explanations or NL descriptions, etc. - see paper for exact comparisons!
0
0
1
@karthik_r_n
Karthik Narasimhan
2 years
we may also have to start flagging purely machine-written reviews soon... @edward_the6
0
0
1
@karthik_r_n
Karthik Narasimhan
2 years
There is a lot to be done yet - improving agent performance, scaling up to new domains and tasks, thinking carefully about safety issues, etc but I'm excited to see how far we can push this direction!
1
0
1
@karthik_r_n
Karthik Narasimhan
2 years
@AmeetDeshpande_ Great idea. @kchonyc maybe this would assist ACs and program chairs in weeding out noise from the signal? :)
1
0
1
@karthik_r_n
Karthik Narasimhan
8 years
single page for references makes no sense whatsoever... #nips2016
0
0
1
@karthik_r_n
Karthik Narasimhan
3 years
@michahu8 @PrincetonCS Congrats Michael!
0
0
1
@karthik_r_n
Karthik Narasimhan
3 years
SemSup organically opens up a wide array of possibilities for generalization in the output space since the notion of classes is now more fluid/continuous rather than discrete โ€” we demonstrate a few cases in the paper including unseen descriptions, classes, superclasses, etc.
1
0
1
@karthik_r_n
Karthik Narasimhan
1 year
... more focused reasoning within each agent and the ability to solve difficult puzzles (game of 24) or answering high-level research questions ("estimate the connection density of a fly brain"), while also allowing for human-in-the-loop collaboration.
1
0
1
@karthik_r_n
Karthik Narasimhan
1 year
Our first benchmark COLLIE-v1 contains very simple tests (generate a sentence with 10 words, where the fifth word is 'language'), which surprisingly trip up even models like GPT-4. If we want LLMs to be used in day-to-day systems, we probably want them to pass these โ€œunit testsโ€
1
0
1
@karthik_r_n
Karthik Narasimhan
2 years
esp. since it is a purely volunteer-driven process with large time investments, and the number of paper submissions are exploding
1
0
1
@karthik_r_n
Karthik Narasimhan
9 months
@ChrSzegedy @SimianLuo It was NeurIPS :)
1
0
1