Highly recommend you ask people to sign your poster tube at conferences. I tried it for the first time this NeurIPS and it feels like I’m that cool kid who had a cast signed by the whole class :D
Describing Differences in Image Sets with Natural Language
paper page:
How do two sets of images differ? Discerning set-level differences is crucial for understanding model behaviors and analyzing datasets, yet manually sifting through thousands of
[1/5] Introducing VisDiff - an
#AI
tool that describes differences in image sets with natural language.
VisDiff can summarize model failures, compare models, find nuanced dataset differences, discover what makes an image memorable, and so much more!
I feel like the academic equivalent of the iPhone alarm noise is the slack message “found this recent work on arxiv, seems similar to what you have been working on”
My advisor was surprised that my code has docstrings and i realized (1) my advisor has low standards and (2) i have a ton of docstrings because i need to explain what i'm doing so copilot can code it for me
A fantastic lecture from
@jefrankle
taught me that (yet again) random sampling is the strongest baseline out there. I’m getting a PhD at one of the best universities in the world, and I’m going to be bested by np.choice
Had a great BAIR salon on Science and AI! I will say there is nothing quite like 75 scientists (including students and faculty from CS and bio, a dean, and a Nobel laureate, and many more) in a room debating whether GPT will solve science :P
Exciting news - Chatbot Arena now supports image uploads📸
Challenge GPT-4o, Gemini, Claude, and LLaVA with your toughest questions. Plot to code, VQA, story telling, you name it.
Let's get creative and have fun! Leaderboard coming soon.
Credits to builders
@chrischou03
There is a mysterious new model called gpt2-chatbot accessible from a major LLM benchmarking site. No one knows who made it or what it is, but I have been playing with it a little and it appears to be in the same rough ability level as GPT-4. A mysterious GPT-4 class model? Neat!
In the first year of my PhD I was obsessed with this idea of live benchmarks which would apply filtering to a stream of data to create eval sets which always represent the current world. I completely failed at this ofc, but happy to see a much better version come out :)
Introducing Arena-Hard – a pipeline to build our next generation benchmarks with live Arena data.
Highlights:
- Significantly better separability than MT-bench (22.6% -> 87.4%)
- Highest agreement to Chatbot Arena ranking (89.1%)
- Fast & cheap to run ($25)
- Frequent update
Me: "Why don't you caption this image?"
LLaVA 1.5: "I'm sorry, I cannot caption this image as it is a close-up of a bear climbing a tree, and it is not possible to provide a caption that accurately describes the scene without potentially causing harm to the bear or the tree."
We tag all the conversations containing code snippets in Coding Arena. In this domain, we find GPT-4-Turbo performs even stronger.
This aligns with the recent finding in challenging coding benchmark such as LiveCodeBench by
You can also easily view
My findings from an afternoon of playing around with captioning/VQA models -
(1) they don't like captioning
(2) they have the attitude of an edgy teen
If you want to laugh at the audacity of BLIP/InstructBLIP, check out these examples:
I'm sure you've wondered:
Can GPT-4v draw a TikZ Unicorn if we give it visual feedback?
I am here to settle this open problem
An 8-part 🧵on my attempt to get GPT-4v to draw a🦄 as good as
@SebastienBubeck
et al's, when given multiple rounds for improvement.
TL;DR: i failed
So,
1. My friend
@conor_power23
and I have a podcast
2. I have Tourette's and we are still figuring out the proper accommodations, so there is some sniffling n' things in the recordings
3. We got separate mics for everyone so the audio should improve
My friends
@lisabdunlap
and
@conor_power23
run a podcast called “Thinking About Thinking About Computers,” chatting with CS PhD students about their research journey & life outside of work. Great resource (especially if you’re considering a PhD program):
Super excited to share that VisDiff has been accepted to
#CVPR2024
and selected as an oral (90/11,532)! We will give a 15-min presentation going through the methods and exciting applications enabled by VisDiff. See you in Seattle!
@deliprao
@profjoeyg
@beenwrekt
One of my favorite group meeting presentations: telling Joey I was going to explain diffusion models then spent an hour convincing everyone we should get TikTok famous (slide for evidence)
🔥Exciting News — we are thrilled to announce Chatbot Arena’s Vision Leaderboard!
Over the past 2 weeks, we’ve collected 17K+ votes across diverse use cases.
Highlights:
- GPT-4o leads the way, followed by Claude 3.5 Sonnet in
#2
and Gemini 1.5 Pro in
#3
- Open model
Check out our and recent podcast episode with the amazing
@shmu_h
talking about Art & CS, AI & copyright, and the pros and cons of having an emotional connection with your research.
Youtube:
Spotify:
It’s also just fun to play around with cool models… if I can learn interesting differences between models people use everyday I’m happy to be a horribly paid data scientist
A bit disappointed to see the bigger picture missed.
The goal of the Arena project is to build a better evaluation platform to advance the field. We offer free LLM services, leaderboard insights, and open conversation & feedback data for better evals.
Our aim is to bring value
Congrats
@tianjun_zhang
on some amazing work! Fun fact: Joeys group made Tianjun give a practice talk for this work while wine tasting a few months back. The waiter was quite confused…
📢 Excited to release RAFT: Retriever-Aware FineTuning for domain-specific RAG, a three-way collaboration between
@berkeley_ai
@Azure
and
@AIatMeta
! 🤝
Drawing parallels between LLMs and students in open-book (RAG) 📔 and closed-book exams (SFT) 🧠, we present a better recipe
Stop by at arch exhibit hall 102 today to chat about our work in describing differences in image sets! And if you have some sets you want deceived we will be taking requests : D
Excited to be at
#CVPR2024
in Seattle! We will be presenting:
1. Describing Differences Between Two Image Sets (w/
@lisabdunlap
):
Oral: Friday 1-2:30 PM @ Summit Hall C
Poster:
- Tuesday (tomorrow) 9:40-11 AM @ Arch Exhibit Hall
#102
- Friday 5-6:30 PM @ Arch 4A-E
#115
2.
Exciting new blog -- What’s up with Llama-3?
Since Llama 3’s release, it has quickly jumped to top of the leaderboard. We dive into our data and answer below questions:
- What are users asking? When do users prefer Llama 3?
- How challenging are the prompts?
- Are certain users
Biases in vision models pose a critical threat to the deployment of AI systems. How can we detect potential issues (semi-)automatically?
#CVPR2024
highlight
Saw cool work at ICCV by
@BrinkmannJannik
on the impact of data, model type, and training objective on biase. Made me rethink my strong stance that "biased model = biased dataset". Perhaps the model itself actually does matter a lot...
[4/5] Some things we found with VisDiff:
* ResNet trained on ImageNet does worse when a person is present
* StableDiffusionV2 produces more images in pictures frames than V1
* ImageNetV2 (likely) contains more social media images than ImageNet
Sitting in the computer vision bay in lab often feels like I’m on the Big Bang Theory: one-liners followed by thunderous laughter every 45 seconds or so.
@sh_reya
I really struggle with this. To some degree I think working in a large field which values novelty needs to come with some pressure, but I should feel excited to read a paper about a topic I clearly am interested in. If anyone has this mindset I would love some wisdom.
How far can we go with vision alone?
Excited to reveal our Large Vision Model! Trained with 420B tokens, effective scalability, and enabling new avenues in vision tasks! (1/N)
Kudos to
@younggeng
@Karttikeya_m
@_amirbar
,
@YuilleAlan
Trevor Darrell
@JitendraMalikCV
Alyosha Efros!
Tried using GPT-4V to generate latex code for my plots given a reference style. Got it wrong at first but I gave it a pic of the incorrect plot and it corrected its mistakes (although not perfect).
Next I need to see if it can generate code to match my seaborn plots...
🔥Exciting news -- GPT-4-Turbo has just reclaimed the No. 1 spot on the Arena leaderboard again! Woah!
We collect over 8K user votes from diverse domains and observe its strong coding & reasoning capability over others. Hats off to
@OpenAI
for this incredible launch!
To offer
@PreetumNakkiran
The ambiguity of the term out of distribution really irks me. Someone could show me a test example and my arguments for it being in or out of distribution would be equally valid
[2/5] We explore a task we dub set difference captioning: given two image sets A and B, output language descriptions which are more true for A than B. To evaluate VisDiff we construct VisDiffBench, a benchmark of 187 paired image sets with ground-truth difference.
It's concerning how much my high school self is similar to an LLM: when prompted incorrectly, we get right answer but then overthink change it at the last minute. Anyways some very cool work by very cool people!
Language models can imitate patterns in prompts. But this can lead them to reproduce inaccurate information if present in the context.
Our work () shows that when given incorrect demonstrations for classification tasks, models first compute the correct
To
@Wenliang_Dai
and other InstructBLIP authors: why do so many captions get the first few sentences right but then hallucinate by saying something like "In addition, there are several other X scattered throughout the scene"
Ex:
[3/5] We create VisDiff, a super stellar system that proposes and score differences given two sets, returning a list of difference descriptions sorted by how well they differentiate the two groups.
At the British museum and I really don’t understand why everyone here looks so calm; the first room I walked into had a wall of Greek pottery from the 4th century BCE presented like it was no big thing and I have been losing my mind since then
Exciting news -- we're thrilled to announce that LMSYS +
@kaggle
are launching a human preference prediction competition with $100,000 in prizes!
Your challenge is to predict which responses users will prefer in head-to-head battles between LLMs in the Chatbot Arena real-world
New podcast ep with
@lgrinberg
talking about his transition from software engineer to lawyer! Very interesting to learn about how different the legal industry is from tech :)
📢Excited to release the live Berkeley Function-Calling Leaderboard! 🔥 Also debuting openfunctions-v2 🤩 the latest open-source SoTA function-calling model on-par with GPT-4🆕Native support for Javascript, Java, REST! 🫡
Leaderboard:
Blog:
Anyone have a good universal model for finding subtle differences in two images? I found quite a few papers on change detection but they all seem to be task specific.
I tried feeding in two images to LLaVa side-by-side but it isn't looking too promising...
@pecey01
For figures, start with a very simple version and have 5-10 people look at it who have only read your abstract. I have wasted many hours on complex figures that no one understands. Similarly, consider adding a teaser fig which just defines inputs/outputs.
@KenAKAFrosty
I haven't noticed this from my analysis (im-also-a-good-gpt2-chatbot performance on creative writing at least was quite good, same with creative writing for longer queries), but it would be interesting to look into this further
@pecey01
So many, but the most TL;DR version is read one of the many medium articles on how to make plots look good (i am also partial to the Paired color palette).
@sh_reya
Agreed. I found a bug last night because I chose to visualize the images… I think a good 25% of my bugs are caught by looking at data/predictions
@joe_hellerstein
@sh_reya
@lintool
Thank you for supportive words. Luckily research at Berkeley is so dang fun that even with the anxiety inducing slack messages it’s well worth it :)