Generated images not following your prompt?
Introducing 𝔻𝕣𝕖𝕒𝕞𝕊𝕪𝕟𝕔 from
@GoogleAI
: improving alignment + aesthetics of image generation models with feedback from VLMs!
✅ Model Agnostic
✅ Plug and Play
❌ RL
❌ Human Annotation
❌ Real Image
Honored to receive the 🥇BEST PAPER AWARD🥇 from CVPR 2024, please consider using our collected fine grained feedback!
Huge shout out to our work DreamSync, the key method that we use for using the fine grained feedback to improve the model, detail in my pined tweet! 🚀
🌟Rich Human Feedback for Text-to-Image Generation selected as CVPR 2024 Best Paper Award Candidate (top 1%)🌟
Current text-to-image models are not perfect, but where exactly? They suffer from artifacts, alignment and aesthetics. We collect feedback on 18K images to capture all
Can LLMs generate exact 5 words? No
How about 5 sentences? No
How about 5 paragraphs? No
🤷🏻♀️
In , we evaluate the performance of LLMs on various controlled generation tasks including numerical planning, story generation, paraphrase generation, and etc. (1/n)
Today I defended my thesis and became Dr. Sun! 🌞
Thank you my committee members
@MaxMa1987
@VioletNPeng
@jonathanmay
@emilio__ferrara
and Dan O’Leary!
The slides of my presentation are here: .
Ph.D done but research never ends!
Fight on!
🌟Rich Human Feedback for Text-to-Image Generation selected as CVPR 2024 Best Paper Award Candidate (top 1%)🌟
Current text-to-image models are not perfect, but where exactly? They suffer from artifacts, alignment and aesthetics. We collect feedback on 18K images to capture all
Thanks
@CSatUSC
for capturing this one of the most important moments of my life!
Thanks for my family and my dearest advisor
@MaxMa1987
for making it happen!
#PhD
A team of collaborators from ALL different institutes? 5 female researchers + 1 high school student? I am excited that our fairness work "Pretty Princess vs. Successful Leader: Gender Roles in Greeting Card Messages" is conditionally accepted by
#CHI2022
! Stay tuned for details!
After being four-year LinkedIn-less, I’m finally back! Let’s connect and chat if you:
- are hiring — have an opening that I might be a fit!
- are graduating, let’s go through the job searching together!
- know me or my work!
- just want to know me!
Wouldn't it be a 🌩️DISASTER if evaluation metrics always rate American English 10 times better than Indian English?
⚠️We (🔗) study the dialect robustness systematically, find current that evaluation metrics are NOT robust to dialects🤯, and propose NANO🧵
Can we paraphrase sentences into desirable syntactic structures? How to select proper syntactic parses that can properly guide paraphrase generation? 🤔
Our
#EMNLP2021
paper AESOP (w/
@MaxMa1987
@VioletNPeng
) proposes an adaptive way to retrieve compatible parses! 😎(1/6)
I’m working on
@eccvconf
rebuttal, and here’s one of the review:
“The reliance on training data may raise concerns about the model's generalizability to unseen prompts and scenarios.”
How should I rebut this? 🥲 I’m so speechless right now…
While
#Wikipedia
has been a great resource for knowledge, implicit biases can be subtle and detrimental. In our new
#ACL2021
paper (w/
@VioletNPeng
), we found that
#Wikipedia
pages intermingle professional career events with personal events in a systematically biased way.
1/5
🤶 Pretty Princess vs. Successful Leader?
Have you ever sent someone greeting cards? People write greeting card messages out of goodwill, but gender stereotypes in these messages may be enforced without being noticed! Check out our
#chi2022
work for a systematic analysis! (1/n)
I will be at ACL next week to present this work! Look forward to connecting with folks who work on evaluation, data and beyond! HMU if any of these sounds interesting to you! DMs are open
Wouldn't it be a 🌩️DISASTER if evaluation metrics always rate American English 10 times better than Indian English?
⚠️We (🔗) study the dialect robustness systematically, find current that evaluation metrics are NOT robust to dialects🤯, and propose NANO🧵
I enjoyed the interview with Amazon a lot! It is not only a summary about my experience in natural language generation, but also a deep conversation about how my works connect and contribute to the community! Read to learn more about me, my Amazon internship and more! 👇
Can AI help an aspiring author write a novel? Could machines learn how to make jokes? Inspired by these questions, Jiao Sun has been exploring the potential of AI-generated text. Now, as an Amazon ML Fellow, she's hoping to develop her research further.
#ConvAI
#NLProc
Sebastian was my internship mentor for 6 months. He taught me everything including technicals, how to write a better paper and collaborate with others more efficiently! If you want to have a lifelong mentor and do great NLP research, I don’t see any reason why you wouldn’t apply!
My group is hiring interns for summer 2023. If you are a current PhD student and interested, please email me.
Info on internship topics:
There are also multiple open full-time roles in AI Engineering - feel free to reach out :)
Are you excited about pun generation? In
#EMNLP2022
, we have two works accepted in the main conference:
1️⃣ Context-Situated Pun Generation 👉 a brand-new task!
2️⃣ ExPUNations: Augmenting Puns with Keywords and Explanations 👉 a new dataset!
Learn more! 🧵👇
My awesome co-first author Deqing is looking for a research internship opportunity this summer; He’s one of the most fast-paced researcher I’ve seen in these years!
We would appreciate if you can send him a DM if you are recruiting interns working on LLM/Large Vision Models!
🚨New paper alert🚨
With 𝔻𝕣𝕖𝕒𝕞𝕊𝕪𝕟𝕔, large language models (LLMs), vision-language models (VLMs), and text-to-image (T2I) models 𝕊𝕪𝕟𝕔 together!
They interactively and iteratively improve alignments and aesthetics of T2I models.
No RL needed. No human annotation
Wow thanks for the nice words and I think EVERY modeling work from the creative generation community should really think about having context as the constrain for the generation!
When I had the idea of Pun Generation one year ago, i told myself it is not going to be possible. Until i saw this in
#EMNLP2022
from the incredible author
@sunjiao123sun_
. So exciting to see creative language generation paper in our community!
What does it mean for a generative AI model for code to be explainable? My internship work at IBM Research investigated the XAI need under 3 scenarios: code translation, code autocompletion and natural language to code. to appear at
#IUI2022
#HCI
😏 (1/n)
Thanks
@QVeraLiao
for the warm welcome! A super late announcement: I will be doing a research internship
@IBMResearch
on code generation! The great combination of my beloved text generation and Human-AI collaboration! Saying I’m excited would be a massive understatement! 💪💯
We ⚠️Investigate the Benefits of Free-form Rationales in our
#EMNLP2022
findings work, from both the human and the model perspectives. For humans, do rationales aid human interpretability? For models, do rationales boost the model performance? (0/n)
Would NLU models trained on EN-US generalize well to EN-IN (Indian English)/ EN-GB (British English)? I am thinking about exploring the transferability of models between dialects. Does anyone here know some good datasets for this task? 🙏
LLMs just cannot count and generate exactly the number of words that we are asking for! With 7 being the magic number that models start to struggle with! (3/n)
Tu has been my amazing Google internship mate, close friend and life mentor. Can’t wait to see what he will achieve at his Google & VT adventure! All the best Tu!
I successfully defended my Ph.D. thesis. A special thank you to the members of my thesis committee: my wonderful advisor
@MohitIyyer
,
@MajiSubhransu
,
@HamedZamani
,
@lmthang
, and
@colinraffel
for their insightful feedback and advice on my research and career plans.
Honored to be part of the efforts. Check out the magic that LIMA can achieve with only 1000 prompts! Our human eval resonates with GPT-4 eval showing how humans prefer LIMA over/on-par with other LLMs!
The key recipes of DreamSync are:
1. Diverse text prompts from LLMs
2. VQA feedback (TIFA score) for alignment and VILA feedback for aesthetics
3. Rejection Sampling with feedback
4. LoRA Fine-tuning
5. Multiple Iterations
(2/n)
I'm in
#gradcohort2021
organized by amazing
@CRA_WP
! I've been enjoying the event a lot as it provides a platform for us female PhD students to connect and support each other! If you are here as well, feel free to drop me an email and we should talk!
Wouldn't it be a 🌩️DISASTER if evaluation metrics always rate American English 10 times better than Indian English?
⚠️We (🔗) study the dialect robustness systematically, find current that evaluation metrics are NOT robust to dialects🤯, and propose NANO🧵
I’m excited to see awesome things
#chatGPT
can do, but we need to make sure it’s not producing gobbledygook that seems to be right — it is misleading and can be harmful as knowledge query. What is needed to explain generative models? Re-sharing our work:
Thanks for featuring my work with
@QVeraLiao
and all other colleagues at IBM Research. It has been a increasing effort around generative AI, and our work outlines what explainability would benefit users who will be using the models!
Generative AI is taking the industry by storm & seeing how it has become a niche of its own,
How can we make Generative AI Models Explainable?🤔
This paper by
@sunjiao123sun_
attempts to make Code-based GenAI Models explainable, let's break it down. 🧵
Excited to share our self-labeled counterfactual paper
@emnlpmeeting
#EMNLP2023
with
@ameya_godbole1
and
@robinomial
: we develop an automated procedure that generates hard negative examples (e.g., subtle unanswerable questions) from positive examples (e.g. answerable examples).
In total, we include five controlled generation tasks, we show a spectrum of abilities of LLMs.
They are good at: constrained content generation (e.g., sentiment), story generation, rationale generation!
Bad at: numerical planning and paraphrase generation! (4/n)
Congratulations! If you are interested in decoding methods for generation, please check out the paper: . The look back decoding method automatically removes potential failures, repetitions and topic drifting from the decoding steps!
🌟Thrilled to share that our paper "Look-back Decoding for Open-Ended Text Generation" won the Outstanding Paper Award at EMNLP2023! Immense gratitude to anonymous reviewers and to my incredible collaborators
@violet_zct
,
@real_asli
and
@MaxMa1987
.
#EMNLP2023
Excited to share I’ve joined
@Google
to lead product for AI Studio and support the Gemini API.
Lots of hard work ahead, but we are going to make Google the best home for developers building with AI.
I’m not going to settle for anything less.
@ReviewAcl
so will April 15th review cycle will be 4 week or 6 week? It is important as many of us want the reviews back before EMNLP’s May 24 decision deadline of if submitting it to softconf. Btw, not a big fan of “surprise” announcement 📣🥲
Our work on "Intriguing Properties of Compression on Multilingual Models" has been accepted to EMNLP 2022.
A collaboration led by Kelechi Ogueji w
@orevaahia
@lekeonilude
,
@sebgehr
,
@KreutzerJulia
. 🎉🔥
Great news to hear at the end of a long two weeks of travel.
The deadline is around the corner, please consider voting for Kai-Wei!
Please search for “sigdat elections”in your email inbox and it should less than two minutes to vote!
Your support is greatly appreciated! ❤️
I am honored to be nominated by SIGDAT (the org that oversees EMNLP) to run for VP-elect with other awesome candidates who share the goal of improving our community. Please check your email to vote by 3/24.🗳️ See details:
It’s interesting to see how regional stereotypes got reflected in LLM just by adding the country tag in the prompts! Awesome work led by
@esindurmusnlp
!
We develop a method to test global opinions represented in language models. We find the opinions represented by the models are most similar to those of the participants in USA, Canada, and some European countries. We also show the responses are steerable in separate experiments.
I sadly cannot make it to EMNLP, but please talk to
@yufei_t
our work, especially about numerical planning!
A lot of people have reached out about code release, we are sorry for the day and are working on this. The first release of our input and output will come very soon! :)
Can LLMs generate exact 5 words? No
How about 5 sentences? No
How about 5 paragraphs? No
🤷🏻♀️
In , we evaluate the performance of LLMs on various controlled generation tasks including numerical planning, story generation, paraphrase generation, and etc. (1/n)
As my EMNLP trip is coming close, I wonder if there is a list of people who will be attending in person so that I don’t need to stalk everyone’s twitter?
@emnlpmeeting
if not, I’m happy to start one that people who want to connect can put down their name and websites 👩🏻💻
@mark_riedl
Well, I really want to self-recommend my two pun generation papers that are gonna appear at emnlp 2022, but I’m pretty sure they are not the “best” 😑 how about checking AmbiPun first! By
@yufei_t
Among all the tasks, Numerical Planning Benchmark (NPB) is the most intuitive task where LLMs are asked to generate sentences matching exact numerical constraints, such as count of words/syllables. We got the motivation from the real-world scenarios such as creative writing. 2/n
Thank you Nedjma so much for liking our work and such a wonderful summarization! 💯 We hope that you enjoyed our talk, and we would love to have/spike more discussion about event fairness in the community!
Looking for a high-quality QA dataset for event-centric reasoning? You definitely don’t want to miss out ESTER with **FIVE** event relation types! We are looking forward to seeing everyone’s great efforts on solving this challenging task! 💪💪
(1/5) Introducing our
#EMNLP21
paper “ESTER: A Machine Reading Comprehension Dataset for Event Semantic Relation Reasoning.” We invite everyone interested in event-centric reasoning to test your models on ESTER and submit results to our leaderboard:
Finally, we show that the predicted rich human feedback can be leveraged to improve image generation quality. Following the same recipe as in DreamSync, we use the rich human feedback to select high-quality training data to finetune and improve the generative models! (4/n)
You should catch me at the conference if you are attending in person! 👇
1️⃣ Context-Situated Pun Generation (Dec 9th 16:00-17:30 @ Atrium)
2️⃣ ExPUNations: Augmenting Puns with Keywords and Explanations (Dec 11th 15:30-17:00 @ Aritum)
Look forward to seeing many of you there!
Are you excited about pun generation? In
#EMNLP2022
, we have two works accepted in the main conference:
1️⃣ Context-Situated Pun Generation 👉 a brand-new task!
2️⃣ ExPUNations: Augmenting Puns with Keywords and Explanations 👉 a new dataset!
Learn more! 🧵👇
Thanks Vera! Please swing by my talk — I look forward to talking to folks interested in Code Generation + explainability! It will happen at Weds March 23th around 9:20am EDT. 🤓
Trying to attend as many
#IUI2022
sessions I can this week. Looking forward to catching up!
If you are at IUI, check out the XAI session on Wednesday and
@sunjiao123sun_
's talk on "Investigating Explainability of Generative Models for Code through Scenario-based Design"😇
Awesome collaboration with our student intern lead Yowei Liang from UCSD, Junfeng He, Gang Li, Peizhao, Arseniy,
@N_Carolan
+all other Google folks who are not on X at all :rofl. Feedback and discussions are absolutely welcome! (n/n)
A bit surprised, but this is important for folks who are having hard time deciding between ACL and EACL.
Also, EACL anonymity deadline is October 13th, it sounds like a good combo of arxiv + EACL + ACL
[1/3] Cross-submission policy with ACL 2023:
As the
#EACL2023
notification deadline and
#ACL2023
submission deadline are unfortunately on the same day, you may submit your paper to ACL 2023 while it is still under review at EACL 2023. Keep reading...
First, where did we get the prompts for training? We utilize LLM’s creativity (i.e., PaLM 2 for us)! Check out the qualitative examples as a glimpse of the diverse prompts in our training, which sets the solid foundation of DreamSync’s performance. (3/n)
Congrats on the fine work
@yufei_t
! Actually AESOP, my EMNLP work on paraphrasing contributes to converting the generated hyperboles to more natural expressions! This is a great use case showing how much paraphrasing can help! Please keep tuned with my new post about AESOP!
Is generating hyperboles easy? Our machine says yes!
Check our new
#EMNLP2021
Findings paper "HypoGen: Hyperbole Generation with Commonsense and Counterfactual Knowledge" with Arvind and
@VioletNPeng
!🧾
Code and data coming soon!
We also evaluate the performance of DreamSync on two benchmarks for both the text faithfulness and visual appeal. DreamSync performs the best among all the methods for textual faithfulness! (5/n)
@mark_riedl
@yufei_t
Thanks Mark! This gives me motivation of putting them on arxiv first — will post them here once they are alive on arxiv. Here is the link for ambipun paper:
From the annotation example, you can see that we not only 1) mark the image regions that are misaligned or implausible, but also 2) provide which words in the text prompts are misrepresented or missing! (2/n)
1/N
Tired of listening to your multilingual TTS models?
SQuId 🦑 is an automatic metric for multilingual speech synthesis: give it a waveform, it predicts how natural it sounds. To develop the model we gathered 1.9 Million listening tests in 65 locales.
With the collected data, we train a multimodal transformer to predict the rich feedback (plausibility/ alignment/aesthetics scores) automatically. Our model greatly outperforms (w and wo finetuning) CLIP in terms of correlation coefficients on our test set. (3/n)
Join us for the 12:30-12:30 AST poster session on 11/8!
@sunjiao123sun_
will present our work on adaptive syntactically controlled paraphrase generation. Joint work w/
@MaxMa1987
. She had a more interesting introduction 👇👇👇
In summary, yes, rationales can help with both human interpretability and model performance, but with a lot of caveats that people should take care of before getting to any conclusion! I will present this poster tomorrow
#BlackboxNLP
(Dec 8th) 11:00-12:30 at Mezzanine and Hall!
As an iterative approach, we also see the progressive improvement after each iteration quantitatively, both for text faithfulness and aesthetics. (6/n)
Ideally, dialects that share the same semantics should get the exact same score! this is too strict and can be easily violated.
We introduce semantic perturbation, and define relaxed dialect robustness as dialects should score higher than semantic perturbations! (1/n)
We investigate XAI needs for generative AI models for code through scenario design. More specifically, we conducted 9 workshops with 43 software engineers using **real examples** from state-of-the-art generative AI models to elicit users' explainability needs! (2/n)
This work is conducted with my amazing mentor
@QVeraLiao
and
@mayankagarwal__
, together with expert
@michael_muller
, Stephanie Houde,
@kr_t
and fabulous manager Justin Weisz (
@gratefulspam
(🧐)) ! Please check out our paper for more details, and HMU if you want to discuss! ❤️
@WikiResearch
@USC
@WikiWomenInRed
Thanks for tagging! We hope our work brings awareness of potential event gender biases in knowledge sources (e.g., Wikipedia! I personally use it everyday 🥸), and urges Wikipedia contributors to be cautious when contributing to the pages! Check out my pinned Tweet for more!
You should catch me at the conference if you are attending in person! 👇
1️⃣ Context-Situated Pun Generation (Dec 9th 16:00-17:30 @ Atrium)
2️⃣ ExPUNations: Augmenting Puns with Keywords and Explanations (Dec 11th 15:30-17:00 @ Aritum)
Look forward to seeing many of you there!
Thanks for tagging! We hope our work brings awareness of potential event gender biases in knowledge sources (e.g., Wikipedia! I personally use it everyday 🥸), and urges Wikipedia contributors to be cautious when contributing to the pages! Check out my pinned Tweet for more!
"Men Are Elected, Women Are Married: Events Gender Bias on
#Wikipedia
" event-centric study of gender biases on a large English Wikipedia corpus shows that personal life related events are more likely to appear for females than males.
(Sun et al, 2021)
We ask two questions:
1️⃣ HOW MUCH do dialect rewrites improve the metric value over semantic perturbations?
2️⃣ HOW OFTEN do dialect rewrites score higher than semantic perturbations?
We find that existing metrics BLEURT, Prism, YiSi, BLEU, chrF struggle at both. (2/n)
In addition, our experiments show that NANO also helps improve the metric performance on the standard metric benchmarks!! You should use our metric if you are evaluating dialectal texts and want a more fair judgment! (6/n)
According to the value of {dialect} - {semantic perturb}, NANO helps improve the dialect robustness across different model sizes and languages (English, Portuguese, and Mandarin Chinese!) The success rates of {dialect} > {semantic perturb} also indicate that NANO helps! (5/n)
ExPUNations: Augmenting Puns with Keywords and Explanations 🧵
Humor understanding and generation are challenging even for humans! e.g., to get the funniness of the pun "the sushi said to the bee, wasabi!" it requires the commonsense that wasabi often goes with sushi! (0/2)
@BlancheMinerva
You can probably refer to what we did in our work. We took the mC4 corpus, get the region information from url (.in) and combine it with language identification model output (English) and use those text as en-IN, aka Inglish. This is a rough approximation but benefits at scale
We include 10 languages, 95 language variants in pretraining. Then, we adapt the metric to different use cases including within-language assessment and quality estimation with or without references. (4/n)
We find that people tend to talk about achievement and career for males but appearance and domestic topics for females. Using WEAT scores, we find AI (GPT-2) generated greeting card messages further amplify such stereotypes!! 🥲 Check out techniques below: (2/n)
@BlancheMinerva
It depends on if you want a very accurate or coarse level of Inglish. For the analysis part of our Inglish dataset — we use the dataset from and I think this is probably the best that you can refer to! If you just want an approximation, (to be continued)
Although still in the mood of shattered Hawaii dream, I want to share a pre-print of our accepted
#CHI2020
paper "FDHelper: Assist Unsupervised Fraud Detection Experts with Interactive Feature Selection and Evaluation". Find out more here: . 🧡
@JiaosunT
To facilitate this new setup, we collect a corpus that contains 4,551 tuples of context keywords and associated pun pairs, labeled with whether they are compatible for composing a pun, and a human-written pun for each compatible tuple. (1/3)