New paper from
@Berkeley_AI
on Autonomous Evaluation and Refinement of Digital Agents!
We show that VLM/LLM-based evaluators can significantly improve the performance of agents for web browsing and device control, advancing sotas by 29% to 75%.
[🧵]
🎉Release day!
We develop RL techniques / infra to post-train VLM agents for device control.
Our 2B VLM, when post-trained with an autonomous evaluator (reward model), improves its success rate on Android device-control tasks from 17% to 67%.
🚨 New paper: we trained a SOTA (> GPT4, Gemini) VLM agent, DigiRL, that can do tasks on an Android phone in real time, in the wild, via autonomous offline + online RL
Web:
Paper:
🧵 ⬇️ / little gif of learning progress👇:
Just finished all my grad school applications yesterday. It was such a great opportunity to reflect on my research, career goals, and future. Excited to take a break from the pressure of "greedy optimization" and focus on the bigger picture.
Introduce Archer, our latest efforts to develop better RL algorithms for LM agents.
This multi-turn RL alg outperforms all baselines significantly and can achieve up to 100X greater sample efficiency comparing to PPO. It was a pleasure to be part of the team.
How can we train LLM Agents, to learn from their own experience autonomously?
Introducing ArCHer, a simple (i.e., small change on top of standard RLHF) and effective way of doing so with multi-turn RL 🧵⬇️
Paper:
Website:
OpenDevin is more than just a reproduction of Devin—it's a vibrant community of researchers and engineers with an exciting, ambitious roadmap ahead. This can potentially provide lasting value to the community
Don't miss out!
Introducing OpenDevin CodeAct 1.0 - a new State-of-the-art open coding agent! It achieves a 21% unassisted resolve rate on SWE-Bench Lite, a 17% relative improvement above the previous SOTA by SWE-Agent.
Check out our blog or the thread 🧵for more details:
I will at CVPR next week! If you are interested in:
• Building (real-time) VLMs
• Post-training (multi-modal) generalist agents
We should talk! My DM is open :)
Thanks Aran for sharing!
AI feedbacks will enable autonomous evaluation and improvement of language agents at scale. We have a thread here if you wanna learn more :)
Autonomous Evaluation and Refinement of Digital Agents
Improves WebArena's GPT4 SotA agent by 30%+ and CogAgent in iOS by 75% without any extra supervision but only a VLM-based evaluator
repo:
abs:
🎉So excited to see our work recognized at
#ACL2023NLP
! Our work, bridging grounding capabilities in Vision-Language Models, serves both practical and scientific purposes.
Extremely grateful to have been on this journey with my awesome mentor and advisor at
@SLED_AI
.
🎉Thrilled to share that our paper "World-to-Words: Grounded Open Vocabulary Acquisition through Fast Mapping in Vision-Language Models" was selected for the outstanding paper award at
#ACL2023NLP
! Thanks
@aclmeeting
:-)
Let's take grounding seriously in VLMs because...
🧵[1/n]
Excited to present our projects—Autonomous Evaluation & Refinement + Archer—at next week's CMU agents workshop!
I’ll also be in Ann Arbor over the weekend catching up with friends.
DM if you’re up for a chat or boba🥤!
GPT4V with assistive tool(Vimium) can be descent web agents. Happy to share a proof-of-concept project I built last night. It's under 300 lines of code
This seems the most principled approach to parallel decoding for LMs so far + has interesting connections to the original consistency model popular in accelerating diffusion models.
I'll try to share one paper I particularly like every week on Twitter starting with this one :)
Check out consistency LLM (to appear at ICML'24)!
We found that we can easily adapt an LLM as a parallel decoder by training it on autogen jacobi decoding trajectories using a consistency loss -- just like how we train consistency models in diffusion.
The model quickly learn
@ArmenAgha
Maybe they just aren't that different, a pretty sketchy cot:
1. prompting is constrained / manually optimized prefix tuning
2. full-weight fine-tuning ≈ ones w/ lora
3. prefix tuning ≈ lora:
We get prompting ≈ fine-tuning when "done right"
@nishuang
NLP业内有段时间非常流行用芝麻街角色取名字,“To date, this new breed of language AIs includes an ELMo, a BERT, a Grover, a Big BIRD, a Rosita, a RoBERTa, at least two ERNIEs (three if you include ERNIE 2.0), and a KERMIT.”,所以基本可以肯定百度是特意往这里靠的🤣
Excited to share InfEdit, which delivers the 𝐛𝐞𝐬𝐭 𝐞𝐝𝐢𝐭𝐢𝐧𝐠 𝐪𝐮𝐚𝐥𝐢𝐭𝐲, 𝐬𝐩𝐞𝐞𝐝, 𝐚𝐧𝐝 𝐜𝐨𝐧𝐬𝐢𝐬𝐭𝐞𝐧𝐜𝐲 in training-free algorithms. It was also my first project on DMs, where I learned so much from my awesome collaborators. Check out the paper/demo below👇
Want to edit your image with language descriptions in less than 3s? Ever questioned the need for prolonged inversion in text-guided editing? We are happy to release ♾ InfEdit (with demo), a flexible framework for fast, faithful and consistent editing.
🔗
@xwang_lk
Despite missing the third and forth column, my reproduction results seem relatively positive. Our recent work on visual illusion also found sth similar, where the model excels at standard test images found online but hard to generalize to novel ones
I switched from Obsidian to Notion for the same reason I switched from Emacs to VSCode:
I found myself spending more time tweaking the system than gaining efficiency from it.
Which app do you use for note-taking? 📝
I've just made the switch from Notion to Obsidian and I find it's much faster, less cluttered and makes it easier to focus on the content itself.
Research thrives on faith. I think there will be larger open progress in improving LLM's reasoning capability soon, simply because someone (OpenAI) made people aware it is achievable.
Achievement unlocked: 🥳
Our research analyzing VLM’s perception under visual illusion is covered by Scientific American.
It was one of my favorite magazines back in high school!
We hope our results convey to you the potential of using open-ended model-based evaluators in evaluating and improving language agents.
All code is available at:
Paper:
Work w/
@594zyc
@NickATomlin
@YifeiZhou02
@svlevine
@alsuhr
Are you using Figma to create figures for your papers? Be aware that Safari (or Figma?) has a bug that sometimes prevents Figma images from rendering😅
Make sure to double-check this before it’s too late
For this week, I am sharing
@dwarkesh_sp
podcast with John Schulman
@johnschulman2
, which is particularly informative
• How to enable LLMs for long-horizon tasks -- just train them to do so
• Insights into OpenAI’s post-training stack
• His prediction on future progress
Honored to be part of this exhilarating journey alongside other team members. The knowledge we've garnered from this competition will fuel our excitement for the next stage of advancing embodied AI.
Do VLMs perceive visual illusions like human or faithfully represent reality? Our
#EMNLP2023
paper analyzed this question systematically across 4 models. Come and check it out!
1/ Excited to share our latest research "Grounding Visual Illusions in Language: Do Vision-Language Models Perceive Illusions Like Humans?" at
#EMNLP2023
🎉 Discover how VLMs fare against tricky visual illusions👀➡️
@OfirPress
I think this is a special case for offline/online RL, and the most general statement should be that doing RL on the agent really works.
In both recent papers, the filtered BC technique (the most trivial offline RL algorithm?) almost doubles the agent's success rate: see the iOS
New paper from
@Berkeley_AI
on Autonomous Evaluation and Refinement of Digital Agents!
We show that VLM/LLM-based evaluators can significantly improve the performance of agents for web browsing and device control, advancing sotas by 29% to 75%.
[🧵]
LLM agents have demonstrated promise in their ability to automate computer tasks, but face challenges with multi-step reasoning and planning. Towards addressing this, we propose an inference-time tree search algorithm for LLM agents to explicitly perform exploration and
We begin by developing two types of evaluators: one that directly queries GPT-4V and another that employs an open-weight solution. Our best model shows 82% / 93% agreement with oracle evaluations on web browsing and android device control settings respectively.
There's a lot of redundant info in Vision, MAE shows 25% of the patches is enough for encoder training. CrossMAE goes further - no need to decode the whole image either!
Open question: how can we transfer this success to generative models and make them efficient as well.
UC Berkeley presents CrossMAE
CrossMAE matches MAE in performance with 2.5 to 3.7x less decoding compute via independent partial patch reconstruction
proj:
abs:
Some additional ✨speculation✨
Our preliminary results showed that inference-time improvement w/ Reflexion was very dependent on the performance of the critic model. A bad critic often tanks model performance
Lastly, we experiment with improving CogAgent on iOS, for which there is no existing benchmark environment or training data.
By using the evaluator to filter sampled trajectories for behavior cloning, we significantly improve the CogAgent's success rate by a relative 75%.
🚨 New paper on RL, synthetic data, LLM math reasoning (MATH / GSM 8k)
TL, DR: RL on wrong responses (yes, "proper" RL, not filtered SFT or STaR / RFT) scales utility of syn data by **8x**,
❌spurious correlations
✅stitching, credit assignment
🧵⬇️
🎉Thrilled to share that our paper "World-to-Words: Grounded Open Vocabulary Acquisition through Fast Mapping in Vision-Language Models" was selected for the outstanding paper award at
#ACL2023NLP
! Thanks
@aclmeeting
:-)
Let's take grounding seriously in VLMs because...
🧵[1/n]
We see the improvement our evaluators provide scales favorable with evaluator capability, with the best evaluator achieving 29% improvement over previous sota.
Fine-tuning lm agents with success signal can be really juicy and sft on good trajectory would be a good first step.
The finetuning part in Ofir's post is quite similar to the iOS filtered-bc exp in our work, where we see 75% rel improvement. I'm curious about the results we
Predictions:
>=2 orgs will get 35% on SWE-bench by Aug 1, 2024.
A fully open source system will reach 35% by Nov 1, 2024. Probably based on SWE-agent + ACI improvements: debugger, better code retrieval, lang. server protocol. The LM will be finetuned on ~500 good trajectories
This looks cool:
• Predict downstream performance directly from training FLOPs
• Generalize across different model families
• Can be derived from widely available data - training FLOPs and benchmark results
Look forward to the code release so we can try it out firsthand
Will LM agents continue to scale? Which LM post-training methods work at scale?
To answer these qns, we built Observational Scaling Laws: a generalization of scaling laws that makes accurate predictions without model training using existing public LMs.
GPT-4o can also generate any combination of audio, text, and image outputs, which leads to interesting new capabilities we are still exploring.
See e.g. the "Explorations of capabilities" section in our launch blog post (), or these generated images:
@xhluca
A shameless plug for our Agent-Eval-Refine paper.
Autonomous evaluators augments any in-the-wild digital environment into effective benchmark/training environment.
New paper from
@Berkeley_AI
on Autonomous Evaluation and Refinement of Digital Agents!
We show that VLM/LLM-based evaluators can significantly improve the performance of agents for web browsing and device control, advancing sotas by 29% to 75%.
[🧵]
Next, we show how they could be used for improving agents, either through inference-time guidance or fine-tuning.
We start with WebArena, a popular web agent benchmark. We experiment integrating the sota agent with Reflexion algorithm, using our evaluators as the reward function.
In hindsight, it's not SciFi but today's tech leveraged well.
Prev vid models weren't scaled up like LLMs. VideoPoet already creates good vid with a cost akin to a 8b llm. Dramatic improvement with further scaling and good recipe is certain.
But seeing the outputs personally? Wow
Excited to share what
@billpeeb
@_tim_brooks
and my team has been working on for the past year! Our text-to-video model Sora can generate videos of complex scenes up to a minute long. We're excited about making this step toward AI that can reason about the world like we do.
OpenRLHF
An Easy-to-use, Scalable and High-performance RLHF Framework
As large language models (LLMs) continue to grow by scaling laws, reinforcement learning from human feedback (RLHF) has gained significant attention due to its outstanding performance. However,
@alex_lacoste_
This looks great! I wish we had this infrastructure during our previous project on auto-eval-refinement.
One quick question: Does your web arena agent receive images as input? And do you have any intuition on how much benefit this offers?
This makes so much sense.
When output y is actually a distribution of input x, regression only gave you the weighted average, which isn’t optimal at all.
With the expressiveness of classification loss, we could simply model the distribution p(y|x).
Stop Regressing
Training Value Functions via Classification for Scalable Deep RL
Value functions are a central component of deep reinforcement learning (RL). These functions, parameterized by neural networks, are trained using a mean squared error regression objective to
In our new work on socially compliant navigation, we show how real-world RL-based finetuning can enable mobile robots to adapt on the fly to the behavior of humans, to obstacles, and other challenges associated with real-world navigation:
@frankxu2004
@berkeley_ai
Good question! Evaluators should work just fine, whether on live websites or other domains, for tasks having a similar complexity
In fact, each evaluator shares the same weight across all experiments, with only a change in the prompt. And WebArena isn't part of its training data
@AhmadMustafaAn1
@xiao_ted
They do, but they are in fact from Microsoft Research Asia based in Beijing, which could have a slightly different culture than the US based teams.
@tomosman
@garrytan
@ylecun
If high-bandwidth visual input is really the key to human intelligence, people without vision would be at a significant disadvantage. However, this is clearly not the case.
No LLM is secure! A year ago, we unveiled the first of many automated jailbreak capable of cracking all major LLMs. 🚨
But there is hope?!
We introduce Short Circuiting: the first alignment technique that is adversarially robust. 🧵
📄 Paper:
@dwarkesh_sp
@johnschulman2
Week 7:
There are many ways why intuitively “transcendence” would happen, but having some rigorous empirical evidence / theory support is fantastic.
This paper seems very interesting: say you train an LLM to play chess using only transcripts of games of players up to 1000 elo. Is it possible that the model plays better than 1000 elo? (i.e. "transcends" the training data performance?). It seems you get something from nothing,
@mrdrozdov
It actually can. Sorry that I didn’t keep the screenshot but in my case i successfully evoke insert mode, write something, :wq to exit, and used cat to inspect the edit result
This seems the most principled approach to parallel decoding for LMs so far + has interesting connections to the original consistency model popular in accelerating diffusion models.
I'll try to share one paper I particularly like every week on Twitter starting with this one :)
@shuyanzhxyc
@sreecharan93
Probably because TACL and ACL/EMNLP/NAACL/etc are all sponsored by ACL? I imagine doing so for these conferences which do not share the same root would be much harder.