Lisa Dunlap Profile Banner
Lisa Dunlap Profile
Lisa Dunlap

@lisabdunlap

680
Followers
181
Following
31
Media
169
Statuses

messing around with model evals @berkeley_ai and @lmsysorg

Joined October 2021
Don't wanna be here? Send us removal request.
@lisabdunlap
Lisa Dunlap
9 months
Highly recommend you ask people to sign your poster tube at conferences. I tried it for the first time this NeurIPS and it feels like I’m that cool kid who had a cast signed by the whole class :D
Tweet media one
3
12
176
@lisabdunlap
Lisa Dunlap
9 months
I finally feel like an ML researcher.
@_akhaliq
AK
9 months
Describing Differences in Image Sets with Natural Language paper page: How do two sets of images differ? Discerning set-level differences is crucial for understanding model behaviors and analyzing datasets, yet manually sifting through thousands of
Tweet media one
1
34
155
5
3
143
@lisabdunlap
Lisa Dunlap
9 months
[1/5] Introducing VisDiff - an #AI tool that describes differences in image sets with natural language. VisDiff can summarize model failures, compare models, find nuanced dataset differences, discover what makes an image memorable, and so much more!
Tweet media one
4
30
126
@lisabdunlap
Lisa Dunlap
11 months
I feel like the academic equivalent of the iPhone alarm noise is the slack message “found this recent work on arxiv, seems similar to what you have been working on”
1
4
54
@lisabdunlap
Lisa Dunlap
11 months
My advisor was surprised that my code has docstrings and i realized (1) my advisor has low standards and (2) i have a ton of docstrings because i need to explain what i'm doing so copilot can code it for me
5
0
47
@lisabdunlap
Lisa Dunlap
1 year
A fantastic lecture from @jefrankle taught me that (yet again) random sampling is the strongest baseline out there. I’m getting a PhD at one of the best universities in the world, and I’m going to be bested by np.choice
2
2
39
@lisabdunlap
Lisa Dunlap
10 months
Had a great BAIR salon on Science and AI! I will say there is nothing quite like 75 scientists (including students and faculty from CS and bio, a dean, and a Nobel laureate, and many more) in a room debating whether GPT will solve science :P
Tweet media one
0
4
33
@lisabdunlap
Lisa Dunlap
3 months
We added images to chatbot arena! Here’s another fun demo:
@lmsysorg
lmsys.org
3 months
Exciting news - Chatbot Arena now supports image uploads📸 Challenge GPT-4o, Gemini, Claude, and LLaVA with your toughest questions. Plot to code, VQA, story telling, you name it. Let's get creative and have fun! Leaderboard coming soon. Credits to builders @chrischou03
13
67
421
0
9
34
@lisabdunlap
Lisa Dunlap
4 months
Battle with the mysterious gpt2! I even added a comically large vote button to the huggingface leaderboard so people know what to click :P
@emollick
Ethan Mollick
4 months
There is a mysterious new model called gpt2-chatbot accessible from a major LLM benchmarking site. No one knows who made it or what it is, but I have been playing with it a little and it appears to be in the same rough ability level as GPT-4. A mysterious GPT-4 class model? Neat!
Tweet media one
Tweet media two
95
212
1K
4
7
29
@lisabdunlap
Lisa Dunlap
4 months
After a grueling few days of having to click accept to view the chatbot leaderboard, we put it back on HF :p
0
5
28
@lisabdunlap
Lisa Dunlap
5 months
In the first year of my PhD I was obsessed with this idea of live benchmarks which would apply filtering to a stream of data to create eval sets which always represent the current world. I completely failed at this ofc, but happy to see a much better version come out :)
@lmsysorg
lmsys.org
5 months
Introducing Arena-Hard – a pipeline to build our next generation benchmarks with live Arena data. Highlights: - Significantly better separability than MT-bench (22.6% -> 87.4%) - Highest agreement to Chatbot Arena ranking (89.1%) - Fast & cheap to run ($25) - Frequent update
Tweet media one
20
122
636
0
2
26
@lisabdunlap
Lisa Dunlap
11 months
Me: "Why don't you caption this image?" LLaVA 1.5: "I'm sorry, I cannot caption this image as it is a close-up of a bear climbing a tree, and it is not possible to provide a caption that accurately describes the scene without potentially causing harm to the bear or the tree."
Tweet media one
1
6
22
@lisabdunlap
Lisa Dunlap
5 months
Excited to be a part of creating the category based leaderboard! Now I have a ton of category-based model insights and Gradio knowledge
@lmsysorg
lmsys.org
5 months
We tag all the conversations containing code snippets in Coding Arena. In this domain, we find GPT-4-Turbo performs even stronger. This aligns with the recent finding in challenging coding benchmark such as LiveCodeBench by You can also easily view
Tweet media one
3
10
115
2
4
22
@lisabdunlap
Lisa Dunlap
1 year
My findings from an afternoon of playing around with captioning/VQA models - (1) they don't like captioning (2) they have the attitude of an edgy teen If you want to laugh at the audacity of BLIP/InstructBLIP, check out these examples:
Tweet media one
1
3
18
@lisabdunlap
Lisa Dunlap
10 months
I would like the throw in a unicorn we asked GPT-4 to make right after it’s release that has been hanging in our lab ever since
Tweet media one
@DimitrisPapail
Dimitris Papailiopoulos
10 months
I'm sure you've wondered: Can GPT-4v draw a TikZ Unicorn if we give it visual feedback? I am here to settle this open problem An 8-part 🧵on my attempt to get GPT-4v to draw a🦄 as good as @SebastienBubeck et al's, when given multiple rounds for improvement. TL;DR: i failed
Tweet media one
Tweet media two
5
16
142
2
1
17
@lisabdunlap
Lisa Dunlap
4 months
Can confirm, very good.
@sama
Sam Altman
4 months
it is a very good model (we had a little fun with the name while testing)
Tweet media one
51
184
2K
2
1
14
@lisabdunlap
Lisa Dunlap
4 months
, please stop reminding me that there is another L. Dunlap who is doing far cooler research
Tweet media one
0
0
13
@lisabdunlap
Lisa Dunlap
6 months
So, 1. My friend @conor_power23 and I have a podcast 2. I have Tourette's and we are still figuring out the proper accommodations, so there is some sniffling n' things in the recordings 3. We got separate mics for everyone so the audio should improve
@sh_reya
Shreya Shankar
6 months
My friends @lisabdunlap and @conor_power23 run a podcast called “Thinking About Thinking About Computers,” chatting with CS PhD students about their research journey & life outside of work. Great resource (especially if you’re considering a PhD program):
2
10
79
1
1
12
@lisabdunlap
Lisa Dunlap
5 months
We need to get some good matching outfits for our talk; suggestions are very welcome
@Zhang_Yu_hui
Yuhui Zhang
5 months
Super excited to share that VisDiff has been accepted to #CVPR2024 and selected as an oral (90/11,532)! We will give a 15-min presentation going through the methods and exciting applications enabled by VisDiff. See you in Seattle!
1
13
57
1
1
12
@lisabdunlap
Lisa Dunlap
3 months
I am going to be in CVPR next week, let me know if you want to meet up!
0
0
11
@lisabdunlap
Lisa Dunlap
7 months
@deliprao @profjoeyg @beenwrekt One of my favorite group meeting presentations: telling Joey I was going to explain diffusion models then spent an hour convincing everyone we should get TikTok famous (slide for evidence)
Tweet media one
4
2
11
@lisabdunlap
Lisa Dunlap
2 months
Blog:
@lmsysorg
lmsys.org
2 months
🔥Exciting News — we are thrilled to announce Chatbot Arena’s Vision Leaderboard! Over the past 2 weeks, we’ve collected 17K+ votes across diverse use cases. Highlights: - GPT-4o leads the way, followed by Claude 3.5 Sonnet in #2 and Gemini 1.5 Pro in #3 - Open model
Tweet media one
15
77
447
0
2
11
@lisabdunlap
Lisa Dunlap
11 months
My new favorite hobby: helping improve people’s plots and figures. Fun to do and the payoff is so satisfying.
4
0
10
@lisabdunlap
Lisa Dunlap
9 months
Time for the obligatory “Im going to NeurIPS” tweet. Please reach if your interested in meeting up!
0
0
10
@lisabdunlap
Lisa Dunlap
3 months
Can’t wait for the AP Test on prompt engineering to drop
0
0
9
@lisabdunlap
Lisa Dunlap
4 months
It’s also just fun to play around with cool models… if I can learn interesting differences between models people use everyday I’m happy to be a horribly paid data scientist
@infwinston
Wei-Lin Chiang
4 months
A bit disappointed to see the bigger picture missed. The goal of the Arena project is to build a better evaluation platform to advance the field. We offer free LLM services, leaderboard insights, and open conversation & feedback data for better evals. Our aim is to bring value
12
3
51
1
1
9
@lisabdunlap
Lisa Dunlap
11 months
Spooky season @berkeley_ai
Tweet media one
1
0
9
@lisabdunlap
Lisa Dunlap
7 months
How TA'ing computer vision is going: tried helping a student in office hours only for him tell me I was wrong because "that's just how images work"
2
0
8
@lisabdunlap
Lisa Dunlap
11 months
Sometimes I feel like the LLM academic landscape is just a ton of CS people discovering psychology exists
0
1
7
@lisabdunlap
Lisa Dunlap
6 months
Congrats @tianjun_zhang on some amazing work! Fun fact: Joeys group made Tianjun give a practice talk for this work while wine tasting a few months back. The waiter was quite confused…
@tianjun_zhang
Tianjun Zhang
6 months
📢 Excited to release RAFT: Retriever-Aware FineTuning for domain-specific RAG, a three-way collaboration between @berkeley_ai @Azure and @AIatMeta ! 🤝 Drawing parallels between LLMs and students in open-book (RAG) 📔 and closed-book exams (SFT) 🧠, we present a better recipe
Tweet media one
6
49
231
1
1
7
@lisabdunlap
Lisa Dunlap
3 months
Stop by at arch exhibit hall 102 today to chat about our work in describing differences in image sets! And if you have some sets you want deceived we will be taking requests : D
@Zhang_Yu_hui
Yuhui Zhang
3 months
Excited to be at #CVPR2024 in Seattle! We will be presenting: 1. Describing Differences Between Two Image Sets (w/ @lisabdunlap ): Oral: Friday 1-2:30 PM @ Summit Hall C Poster: - Tuesday (tomorrow) 9:40-11 AM @ Arch Exhibit Hall #102 - Friday 5-6:30 PM @ Arch 4A-E #115 2.
Tweet media one
1
4
23
0
0
8
@lisabdunlap
Lisa Dunlap
9 months
@sarahookr Still recovering from the workshop panel: “What is overhyped? Muchanistic interpretability” :p
0
0
7
@lisabdunlap
Lisa Dunlap
4 months
I <3 Llama 3
@lmsysorg
lmsys.org
4 months
Exciting new blog -- What’s up with Llama-3? Since Llama 3’s release, it has quickly jumped to top of the leaderboard. We dive into our data and answer below questions: - What are users asking? When do users prefer Llama 3? - How challenging are the prompts? - Are certain users
Tweet media one
14
120
744
0
1
7
@lisabdunlap
Lisa Dunlap
5 months
If you like VisDiff, there’s some other great work on discovering biases in AI systems also at CVPR!
@sangwoomo
Sangwoo Mo
5 months
Biases in vision models pose a critical threat to the deployment of AI systems. How can we detect potential issues (semi-)automatically? #CVPR2024 highlight
1
7
29
0
0
6
@lisabdunlap
Lisa Dunlap
11 months
dont forget the new (+improved) sky logo
Tweet media one
@profjoeyg
Joey Gonzalez
11 months
My students decorated the Sky Computing Lab @UCBerkeley with a 🍭 candy land theme 🍬. Was this your doing @lisabdunlap ?
Tweet media one
1
2
15
0
0
6
@lisabdunlap
Lisa Dunlap
11 months
Saw cool work at ICCV by @BrinkmannJannik on the impact of data, model type, and training objective on biase. Made me rethink my strong stance that "biased model = biased dataset". Perhaps the model itself actually does matter a lot...
1
3
5
@lisabdunlap
Lisa Dunlap
1 year
@conor_power23 Also don’t forget Georgios Pavlakos going to UT Austin for computer vision! He’s THE guy of you want to do 3d vision
0
0
5
@lisabdunlap
Lisa Dunlap
6 months
My friend Suzie wrote a blog about chickens.
0
0
5
@lisabdunlap
Lisa Dunlap
10 months
Worlds coldest take, but gmail please add an ACK feature, it feels too much sending an email just saying “sounds good”
0
0
5
@lisabdunlap
Lisa Dunlap
9 months
[4/5] Some things we found with VisDiff: * ResNet trained on ImageNet does worse when a person is present * StableDiffusionV2 produces more images in pictures frames than V1 * ImageNetV2 (likely) contains more social media images than ImageNet
Tweet media one
2
2
5
@lisabdunlap
Lisa Dunlap
1 year
Sitting in the computer vision bay in lab often feels like I’m on the Big Bang Theory: one-liners followed by thunderous laughter every 45 seconds or so.
1
0
5
@lisabdunlap
Lisa Dunlap
10 months
@sh_reya I really struggle with this. To some degree I think working in a large field which values novelty needs to come with some pressure, but I should feel excited to read a paper about a topic I clearly am interested in. If anyone has this mindset I would love some wisdom.
2
1
5
@lisabdunlap
Lisa Dunlap
9 months
This is the most excited I have been about a new model in a while, amazing work Yutong!
@YutongBAI1002
Yutong Bai
9 months
How far can we go with vision alone? Excited to reveal our Large Vision Model! Trained with 420B tokens, effective scalability, and enabling new avenues in vision tasks! (1/N) Kudos to @younggeng @Karttikeya_m @_amirbar , @YuilleAlan Trevor Darrell @JitendraMalikCV Alyosha Efros!
18
160
1K
0
0
5
@lisabdunlap
Lisa Dunlap
11 months
I love the commitment to keeping the item “as is”.
Tweet media one
1
0
5
@lisabdunlap
Lisa Dunlap
10 months
@IanArawjo Ah yes, I had this happen a few months ago…
0
0
5
@lisabdunlap
Lisa Dunlap
11 months
Tried using GPT-4V to generate latex code for my plots given a reference style. Got it wrong at first but I gave it a pic of the incorrect plot and it corrected its mistakes (although not perfect). Next I need to see if it can generate code to match my seaborn plots...
Tweet media one
2
1
3
@lisabdunlap
Lisa Dunlap
5 months
@lmsysorg
lmsys.org
5 months
🔥Exciting news -- GPT-4-Turbo has just reclaimed the No. 1 spot on the Arena leaderboard again! Woah! We collect over 8K user votes from diverse domains and observe its strong coding & reasoning capability over others. Hats off to @OpenAI for this incredible launch! To offer
Tweet media one
52
202
1K
2
0
4
@lisabdunlap
Lisa Dunlap
6 months
Tweet media one
0
0
4
@lisabdunlap
Lisa Dunlap
1 year
@PreetumNakkiran The ambiguity of the term out of distribution really irks me. Someone could show me a test example and my arguments for it being in or out of distribution would be equally valid
1
0
4
@lisabdunlap
Lisa Dunlap
9 months
[2/5] We explore a task we dub set difference captioning: given two image sets A and B, output language descriptions which are more true for A than B. To evaluate VisDiff we construct VisDiffBench, a benchmark of 187 paired image sets with ground-truth difference.
Tweet media one
1
1
4
@lisabdunlap
Lisa Dunlap
6 months
It's concerning how much my high school self is similar to an LLM: when prompted incorrectly, we get right answer but then overthink change it at the last minute. Anyways some very cool work by very cool people!
@dannyhalawi15
Danny Halawi
6 months
Language models can imitate patterns in prompts. But this can lead them to reproduce inaccurate information if present in the context. Our work () shows that when given incorrect demonstrations for classification tasks, models first compute the correct
Tweet media one
4
15
89
1
0
4
@lisabdunlap
Lisa Dunlap
11 months
@profjoeyg @UCBerkeley i will say i cannot take full responsibility for this masterpiece, @conor_power23 + many other students were also down to clown
0
0
3
@lisabdunlap
Lisa Dunlap
9 months
To @Wenliang_Dai and other InstructBLIP authors: why do so many captions get the first few sentences right but then hallucinate by saying something like "In addition, there are several other X scattered throughout the scene" Ex:
1
0
3
@lisabdunlap
Lisa Dunlap
1 year
@conor_power23 Jealous of their soon-to-be students, these people are truly fantastic!
0
0
2
@lisabdunlap
Lisa Dunlap
11 months
@brjathu well played my friend.
Tweet media one
0
0
3
@lisabdunlap
Lisa Dunlap
9 months
[3/5] We create VisDiff, a super stellar system that proposes and score differences given two sets, returning a list of difference descriptions sorted by how well they differentiate the two groups.
1
1
3
@lisabdunlap
Lisa Dunlap
11 months
At the British museum and I really don’t understand why everyone here looks so calm; the first room I walked into had a wall of Greek pottery from the 4th century BCE presented like it was no big thing and I have been losing my mind since then
0
0
3
@lisabdunlap
Lisa Dunlap
4 months
Learn about how humans interact with LLMs and win $25k in the process... win-win if you ask me
@lmsysorg
lmsys.org
4 months
Exciting news -- we're thrilled to announce that LMSYS + @kaggle are launching a human preference prediction competition with $100,000 in prizes! Your challenge is to predict which responses users will prefer in head-to-head battles between LLMs in the Chatbot Arena real-world
Tweet media one
8
69
491
0
0
3
@lisabdunlap
Lisa Dunlap
4 months
Absolutely fascinating conversations, I didn’t think traffic law could be so philosophical
@conor_power23
Conor Power
4 months
New podcast ep with @lgrinberg talking about his transition from software engineer to lawyer! Very interesting to learn about how different the legal industry is from tech :)
0
0
5
0
0
3
@lisabdunlap
Lisa Dunlap
6 months
Congrats on another positively perfect project my friend :)
@shishirpatil_
Shishir Patil
6 months
📢Excited to release the live Berkeley Function-Calling Leaderboard! 🔥 Also debuting openfunctions-v2 🤩 the latest open-source SoTA function-calling model on-par with GPT-4🆕Native support for Javascript, Java, REST! 🫡 Leaderboard: Blog:
Tweet media one
9
68
300
0
0
3
@lisabdunlap
Lisa Dunlap
11 months
@conor_power23 Ayy that was my first conference too! Very good vibe and fun people
0
0
2
@lisabdunlap
Lisa Dunlap
9 months
@conor_power23 Here’s my pitch: rent out a carnival cruise ship to hold the next ML conference
1
0
2
@lisabdunlap
Lisa Dunlap
11 months
When your safety training was too effective..
1
0
2
@lisabdunlap
Lisa Dunlap
11 months
@conor_power23 Do the Adam Sandler method: make all your fun lab mates co authors
1
0
2
@lisabdunlap
Lisa Dunlap
1 year
@eternalroree @conor_power23 I would argue it’s more of an epidemic than a bug
0
0
2
@lisabdunlap
Lisa Dunlap
11 months
@conor_power23 @justinjaffray Be the change that you want to see in the world, as they say
1
0
2
@lisabdunlap
Lisa Dunlap
11 months
Anyone have a good universal model for finding subtle differences in two images? I found quite a few papers on change detection but they all seem to be task specific. I tried feeding in two images to LLaVa side-by-side but it isn't looking too promising...
Tweet media one
1
0
1
@lisabdunlap
Lisa Dunlap
11 months
@pecey01 For figures, start with a very simple version and have 5-10 people look at it who have only read your abstract. I have wasted many hours on complex figures that no one understands. Similarly, consider adding a teaser fig which just defines inputs/outputs.
0
0
2
@lisabdunlap
Lisa Dunlap
11 months
Good point from @profjoeyg , the feedback image may not have been used since I said the graph was cut off…
0
0
2
@lisabdunlap
Lisa Dunlap
11 months
Special thanks to @enfleisig and @YutongBAI1002 for curating the vibe
0
0
2
@lisabdunlap
Lisa Dunlap
1 year
@beenwrekt Via Rorschach rules, the number is actually dependent on the personality traits of the observer.
0
0
2
@lisabdunlap
Lisa Dunlap
1 year
Welcoming in the spooky season with @genmoai 's new video model
2
0
2
@lisabdunlap
Lisa Dunlap
1 year
this one is just cool
0
0
2
@lisabdunlap
Lisa Dunlap
9 months
I will be requiring all those who visit to sign my poster tube
0
0
2
@lisabdunlap
Lisa Dunlap
11 months
1
0
1
@lisabdunlap
Lisa Dunlap
4 months
@KenAKAFrosty I haven't noticed this from my analysis (im-also-a-good-gpt2-chatbot performance on creative writing at least was quite good, same with creative writing for longer queries), but it would be interesting to look into this further
1
0
1
@lisabdunlap
Lisa Dunlap
1 year
WandB stop manipulating me.
Tweet media one
0
0
1
@lisabdunlap
Lisa Dunlap
9 months
@Zhang_Yu_hui @berkeley_ai We even both went to @ZhongRuiqi separately and asked for feedback lol
0
0
1
@lisabdunlap
Lisa Dunlap
7 months
@deliprao @profjoeyg @beenwrekt warning: it's still in the cringe stage
1
0
0
@lisabdunlap
Lisa Dunlap
5 months
@sarahookr ...yes we do...
0
0
0
@lisabdunlap
Lisa Dunlap
11 months
@sh_reya are you sure this person wasn’t me? :p
0
0
1
@lisabdunlap
Lisa Dunlap
11 months
@pecey01 So many, but the most TL;DR version is read one of the many medium articles on how to make plots look good (i am also partial to the Paired color palette).
0
0
1
@lisabdunlap
Lisa Dunlap
6 months
@sh_reya @conor_power23 Top tier guest might I add
0
0
1
@lisabdunlap
Lisa Dunlap
11 months
@profjoeyg @charlespacker amazing ep but we really need a better camera setup in lab...
0
0
1
@lisabdunlap
Lisa Dunlap
9 months
@sarahookr Highly recommend the Lindsey Lohan Christmas movie on Netflix if you love Hallmark-style movies of questionable quality
0
0
0
@lisabdunlap
Lisa Dunlap
9 months
@untitled01ipynb @lisabdunlap - I want to flex on my labs meme slack channel
1
0
1
@lisabdunlap
Lisa Dunlap
9 months
@lateinteraction What can I say, it’s my favorite open source project
1
0
1
@lisabdunlap
Lisa Dunlap
1 year
another eerie one
0
0
1
@lisabdunlap
Lisa Dunlap
10 months
@sh_reya Agreed. I found a bug last night because I chose to visualize the images… I think a good 25% of my bugs are caught by looking at data/predictions
0
0
1
@lisabdunlap
Lisa Dunlap
9 months
@sangwoomo I’d love to chat!
0
0
1
@lisabdunlap
Lisa Dunlap
6 months
@shmu_h Apologies on my sound quality, we are still working on it...
0
0
1
@lisabdunlap
Lisa Dunlap
5 months
@profjoeyg And vibe based evals!
1
0
1
@lisabdunlap
Lisa Dunlap
10 months
@joe_hellerstein @sh_reya @lintool Thank you for supportive words. Luckily research at Berkeley is so dang fun that even with the anxiety inducing slack messages it’s well worth it :)
0
0
1
@lisabdunlap
Lisa Dunlap
7 months
I still don't know who was right, nor do I know if I understand how images work
0
0
1