Consistent Diffusion Meets Tweedie.
Our latest paper introduces an exact framework to train/finetune diffusion models like Stable Diffusion XL solely with noisy data.
A year's worth of work breakthrough in reducing memorization and its implications on copyright 🧵
DALLE-2 has a secret language.
"Apoploe vesrreaitais" means birds.
"Contarra ccetnxniams luryca tanniounons" means bugs or pests.
The prompt: "Apoploe vesrreaitais eating Contarra ccetnxniams luryca tanniounons" gives images of birds eating bugs.
A thread (1/n)🧵
Another example: "Two whales talking about food, with subtitles". We get an image with the text "Wa ch zod rea" written on it. Apparently, the whales are actually talking about their food in the DALLE-2 language. (4/n)
The discovery of the DALLE-2 language creates many interesting security and interpretability challenges.
Currently, NLP systems filter text prompts that violate the policy rules. Gibberish prompts may be used to bypass these filters. (6/n)
A known limitation of DALLE-2 is that it struggles with text. For example, the prompt: "Two farmers talking about vegetables, with subtitles" gives an image that appears to have gibberish text on it.
However, the text is not as random as it initially appears... (2/n)
We feed the text "Vicootes" from the previous image to DALLE-2. Surprisingly, we get (dishes with) vegetables! We then feed the words: "Apoploe vesrreaitars" and we get birds. It seems that the farmers are talking about birds, messing with their vegetables! (3/n)
Some words from the DALLE-2 language can be learned and used to create absurd prompts. For example, "painting of Apoploe vesrreaitais" gives a painting of a bird. "Apoploe vesrreaitais" means to the model "something that flies" and can be used across diverse styles. (5/n)
An update on the hidden vocabulary of DALLE-2.
While a lot of the feedback we received was constructive, some of the comments need to be addressed.
A thread, with some new gibberish text and some discussion 🧵 (1/N)
Why are there so many different methods for using diffusion models for inverse problems? 🤔
And how do these methods relate to each other?
In this survey, we review more than 35 different methods and we attempt to unify them into common mathematical formulations.
Announcing Soft Diffusion: A framework to correctly schedule, learn and sample from general diffusion processes.
State-of-the-art results on CelebA, outperforms DDPMs and vanilla score-based models.
A 🧵to learn about Soft Score Matching, Momentum Sampling and the role of noise
Based on valid comments, we updated our paper with a discussion on Limitations and changed the title to Discovering the Hidden Vocabulary of DALLE-2. Thanks to
@mraginsky
@rctatman
@benjamin_hilton
and others for useful comments.
Stable Diffusion and other text-to-image models sometimes blatantly copy from their training images.
We introduce Ambient Diffusion, a framework to train/finetune diffusion models given only *corrupted* images as input. This reduces the memorization of the training set.
A 🧵
Today was my first day as a Research Scientist Intern at NVIDIA 🥳
Will be working with
@ArashVahdat
and the team on some pretty exciting research directions around generative models in the coming months 👌
Looking forward to it!
Solving inverse problems (e.g. inpainting/deblurring) for general domain images is hard🤷♂️
Magic Eraser and other tools use separately trained models for each task.
We introduce PSLD, a method that uses Stable Diffusion to solve all linear problems without any extra training.
New paper: "Intermediate Layer Optimization for Inverse Problems using Deep Generative Models".
Paper:
Code:
Colab:
Below a video of the Mona Lisa with inpainted eyes and a thread🧵
Multiresolution Textual Inversion.
Given a few images, we learn pseudo-words that represent a concept at different resolutions.
"A painting of a dog in the style of <jane(number)>" gives different levels of artistic freedom to match the <jane> style based on the number index.
However, "Apodidae Ploceidae" (two names of real bird families) indeed gives 10/10 birds.
Therefore, one possible explanation is that our gibberish tokens are mashups of parts of real words. This seems reasonable.
It is interesting that DALLE-2 generates those mashups.
(6/N)
We want to emphasize that this is an adversarial attack and hence does not need to work all the time.
If a system behaves in an unpredictable way, even if that happens 1/10 times, that is still a massive security and interpretability issue, worth understanding. (10/N, N=10).
Is it possible to reconstruct 3-D geometry of a face from a single photo?
This requires solving an inverse problem for a NerfGAN, but previous methods create artifacts as shown.
During my Google internship, we developed a new method to solve this problem.
A thread 🧵(1/n)
Excited to announce our paper: Your Local GAN.
Paper:
Code:
We obtain 14.53% FID ImageNet improvement on SAGAN by only changing the attention layer.
We introduce a new sparse attention layer with 2-D locality. Thread: 1/n
New paper: Your Local GAN: a new layer of two-dimensional sparse attention and a new generative model. Also progress on inverting GANs which may be useful for inverse problems.
with
@giannis_daras
from NTUA and
@gstsdn
@Han_Zhang_
from
@googleai
Does having a better generator always lead to better priors for inverse problems? (hint: no!)
Diffusion models trained with only corrupted data can outperform models trained on clean data for several image restoration tasks🤯
Here is the story behind our new paper Ambient DPS👇
An update on the hidden vocabulary of DALLE-2.
While a lot of the feedback we received was constructive, some of the comments need to be addressed.
A thread, with some new gibberish text and some discussion 🧵 (1/N)
@BarneyFlames
,
@mattgroh
pointed out that "Apoploe", our gibberish word for birds, has similar BPE encoding to "Apodidae".
Interestingly, "Apodidae" produces ~1/10 birds (but many flying insects), while our gibberish "Apoploe" gives 10/10.
(5/N)
@benjamin_hilton
said that we got lucky with the whales example.
We found another similar example.
"Two men talking about soccer, with subtitles" gives the word "tiboer". This seems to give sports in ~4/10 images. (2/N)
Ambient Diffusion got accepted to NeurIPS 2023 🥳
Useful for training/finetuning generative models in applications where access to uncorrupted data is expensive or undesirable (because of memorization).
Very excited about this research direction.
See you all in New Orleans! 🎷
Stable Diffusion and other text-to-image models sometimes blatantly copy from their training images.
We introduce Ambient Diffusion, a framework to train/finetune diffusion models given only *corrupted* images as input. This reduces the memorization of the training set.
A 🧵
A few people, including
@realmeatyhuman
, asked whether our method works beyond natural images (of birds, etc).
Yes, we found some examples that seem statistically significant.
E.g. "doitcdces" seems related (~4/10 images) to students (or learning). (3/N)
Stable Diffusion XL and other state-of-the-art models memorize examples from their training sets.
We discover that SDXL can reconstruct images from LAION even when whole faces or objects are missing.
Row 1: Images from LAION, Row 2: Masked Input to SDXL, Row 3: Reconstruction
Our hidden vocabulary seems robust in easy and sometimes neutral prompts but not in hard ones.
These tokens may produce low confidence in the generator and small perturbations move it in random directions.
"vicootes" means vegetables in some contexts and not in others. (9/N)
Excited to be at NeurIPS 2023, presenting some papers we have been working on over the last few months 🎯
The first work is Consistent Diffusion Models 😊
Excited to announce our
#NeurIPS2020
paper:
SMYRF: Efficient Attention using Asymmetric Clustering.
Paper:
Code:
We propose a novel way to approximate *pre-trained* attention layers or train from scratch.
New ICML paper: Score-Guided Intermediate Layer Optimization (SGILO).
We train diffusion models on the latent space of StyleGAN and we show provable mixing of Langevin Dynamics for random generators.
Reconstructions for *extremely sparse* (<1%) measurements.
A thread🧵(1/N)
Our gibberish tokens might have many meanings.
@benjamin_hilton
run "Contarra ccetnxniams luryca tanniounons" and pointed out that not all are bugs. Indeed, our gibberish text produces a statistically significant fraction, but rarely a 100% match to the target concept. (7/N)
Consistent Diffusion Meets Tweedie: Training Exact Ambient Diffusion Models with Noisy Data.
Accepted to ICML 2024 🥳
Come and meet me in Vienna to learn about how to train/finetune diffusion models with noisy data 🚧
Consistent Diffusion Meets Tweedie.
Our latest paper introduces an exact framework to train/finetune diffusion models like Stable Diffusion XL solely with noisy data.
A year's worth of work breakthrough in reducing memorization and its implications on copyright 🧵
Our gibberish tokens have varying degrees of robustness in combinations with contexts.
E.g. if xx produces birds, ‘xx flying’ is an easy prompt
‘xx on a table’ is a neutral prompt, and ‘xx in space’ is a hard prompt.
(8/N)
This is me (not) preparing hard for our ICML poster session this week in Hawaii 🌴
Can’t wait for the conference this week.
As always, please reach out to talk about diffusion models, inverse problems, surfing and other equally fun topics 🏄♂️
Slides from my talk at
#spaCyIRL
regarding sparse attention factorizations are available here: …
Thanks for the massive interest, paper and code are going to be released soon.
Slides partially describe joint work with:
@georgepar_91
@AlexGDimakis
@apotam
SDXL can further reconstruct training images given heavily noisy measurements.
The reconstruction task is not important.
What matters is that these models have memorized their training set.
@benjamin_hilton
Finally, as noted by many, this is far from a language. It lacks grammar, syntax, coherence and many other things. We changed the title to: "Discovering the Hidden Vocabulary of DALLE-2" and we made the limitations explicit in the paper. Thanks for all the feedback!
An alternative to SURE that learns the optimal denoiser with only noisy observations can be found in our ICML paper: Consistent Diffusion Meets Tweedie
Bonus: it doesn't require computing the divergence and it is very similar to DSM.
The idea is closely related to Noise2Noise.
Stein's unbiased risk estimate (SURE) is an almost magical formula that enables the computation of the mean squared error of a denoiser (used, for example, in denoising score matching) using only the noisy observation y, without requiring the clean data x.
📢: Tomorrow, at 12:30 Central Time, I am giving a talk at UW-Madison.
I will present two accepted papers at NeurIPS 2023 🥳: Consistent Diffusion Models (not to be confused with Consistency Models🤷♂️) and Ambient Diffusion.
Feel free to join us remotely or in person 👇
Introducing CommonPool the largest collection of image-text pairs, 2.5x the size of LAION.
A 1.4B subset of our pool outcompetes compute-matched CLIP models from OpenAI and LAION.
DataComp, a new benchmark for multimodal datasets. is here!
Introducing DataComp, a new benchmark for multimodal datasets!
We release 12.8B image-text pairs, 300+ experiments and a 1.4B subset that outcompetes compute-matched CLIP runs from OpenAI & LAION
📜
🖥️
🌐
We open-source our code to enable further research in this area.
We are excited to see how this work is going to be used to mitigate memorization and in applications where data is inherently noisy.
If you have ideas, ping me, would love to colab!
Inversion by Direct Iteration: An Alternative to Denoising Diffusion for Image Restoration
abs:
Inversion by Direct Iteration (InDI) is a new formulation for supervised image restoration that avoids the so-called “regression to the mean” effect and
Cognitive science research indicates that bilingualism reduces the rate of cognitive decline.
Does this happen in neural networks too?
We train monolingual and bilingual GPT models and we show that the bilingual's performance decays slower under various weight corruptions.
Human bilinguals are more robust to dementia and cognitive decline. In our recent NeurIPS paper we show that bilingual GPT models are also more robust to structural damage in their neuron weights.
Further, we develop a theory.. (1/n)
I was training pytorch model on multiple gpus, getting out of memory due to single gpu loss computation. This amazing gist written by
@Thom_Wolf
is a nice and clean workaround. Check it out, if you haven't already.
Please check our preprint (work in progress):
This work is part of my internship at Google this summer, with amazing collaborators.
@2ptmvd
@docmilanfar
@AlexGDimakis
+ Hossein Talebi, thank you for this opportunity!
Happening now, in person! My Ph.D advisor,
@AlexGDimakis
, turns Prof.
@StefanoErmon
into a frog using our algorithm, Intermediate Layer Optimization. Many interesting points on fairness and modularity of algorithms that use deep generative models.
Great job
@mengweir
on your paper and poster - your hard work really paid off - Congrats!
+ thanks to the very capable social media chair
@CSProfKGD
for the great photo
We can apply our method to learn to represent any concept, given only a few images.
Here is an example of generating Grand Theft Auto (GTA) artwork at different resolutions.
The GTA artwork concept was learned with only 4 input images.
This memorization behavior has led to a series of lawsuits against the research labs that developed these models.
We develop the first framework to train/finetune diffusion models with noisy data.
The model generates high-quality images without ever seeing a clean image 🤯
My 2020: Graduated from
@ntua
, started a Ph.D. at
@UTCompSci
working with
@AlexGDimakis
, moved from Greece to the US, got my first papers accepted at
@CVPR
,
@NeurIPSConf
, got an exciting internship offer for summer 21, and created wonderful memories with friends & family.
Deterministic diffusion samplers (e.g. DDIM) can efficiently sample from any distribution given an estimation of the underlying score function!
We also show how to extend the DDIM idea to any diffusion (linear or non-linear), similar to Soft/Cold Diffusion samplers.
To appear at ICML ’23 We obtain non-asymptotic convergence bounds for *deterministic* diffusion model samplers, as well as a new operational interpretation for the probability flow ODE 🏖 1/7
Thrilled to share with you that our paper, Your Local GAN, got accepted in
@CVPR
! I feel grateful that my first ever paper as an undergrad of
@ntua
got accepted in such a conference. This work is the result of an awesome collaboration with
@AlexGDimakis
,
@gstsdn
@Han_Zhang_
.
Excited to announce our paper: Your Local GAN.
Paper:
Code:
We obtain 14.53% FID ImageNet improvement on SAGAN by only changing the attention layer.
We introduce a new sparse attention layer with 2-D locality. Thread: 1/n
@benjamin_hilton
I think there are three concerns in this thread: 1) gibberish texts don't have 1-1 mappings with English texts, 2) the meaning of gibberish texts changes, when the context is changing and 3) The attack method doesn't always work. (1/N)
When I joined the lab, I asked
@AlexGDimakis
to name a few of his past Ph.D. students who impressed him the most.
I won't disclose the full answer, but I will say one thing:
@DimitrisPapail
was among the top in this (short) list.
So, thank you
@DimitrisPapail
, means a lot!
There's a distinct sense of pride when your academic siblings thrive during their PhDs and beyond. Although you're not directly involved, you still feel incredibly proud of their successes. Hook 'em horns
Recipe to finetune without memorizing:
1) Take your dataset and encode it to latent space using SDXL.
2) Add (a lot of) noise to the latents.
3) Use our training objective to fine-tune your diffusion model.
As you increase the dataset noise, memorization gets reduced.
@rctatman
The criticism here is very fair. We changed the "Secret Language" to "Hidden Vocabulary" in the title and we added a section on Limitations in our paper. Thanks for the constructive feedback!
Very cool work ✨
Proposes an alternative to Ambient Diffusion for training diffusion models from corrupted data based on EM.
Awesome to see more work in this space. If you are interested in diffusion models and inverse problems, I invite you to think about this problem.
📢 Are you interested in diffusion models and inverse problems? Check out our new preprint "Learning Diffusion Priors from Observations by Expectation Maximization" with
@g4ndry
, François Lanusse and
@glouppe
! 🧵
Tomorrow (Friday, May 5), at 12pm PT, I am giving a talk at the Grundfest Memorial Lecture Series.
I will talk about recent work in the intersection of Generative Models and Computational Imaging.
Join us (online) to hear about diffusion models for and from inverse problems!
Giannis Daras
@giannis_daras
discusses "Generative Models for Reconstruction, Art and Things in Between: A short introduction to Intermediate Layer Optimization" FRIDAY, 4/22, at the
@UTAustin
Machine Learning Lab Research Symposium. Register today:
Our method uses a double application of Tweedie's formula and a consistency loss function that allows us to extend sampling at noise levels below the observed data noise.
This is the first method that trains exact models using noisy data, solving an open-problem in this space.
@Plinz
I agree with this thread. I don't believe there is anything "cryptic", probably the word "secret" in the title is more clickbait than it should have been. That said, the realization that "random" strings map to consistent visual concepts creates many security challenges.
Exciting personal news: Today is the first day of my Research Internship at
@Google
! I will be working with Abhishek Kumar (
@studentofml
), Vincent Chu and Dmitry Lagun (
@DmitryLagun
) on NERF-related research ideas.
@googlestudents
I will be in Vienna for ICML 2024✈️.
Please reach out if you are there and you would like to talk 💬
I am excited about generative models and particularly learning from corrupted data, memorization, and inverse problems.
Announcing Soft Diffusion: A framework to correctly schedule, learn and sample from general diffusion processes.
State-of-the-art results on CelebA, outperforms DDPMs and vanilla score-based models.
A 🧵to learn about Soft Score Matching, Momentum Sampling and the role of noise
@benjamin_hilton
1) Indeed, a gibberish text can mean more than one thing. But this is also true for words in English (homonyms). Also, DALLE-2 text might mean resemble clusters of things. For example, we found that the word "comafuruder" has something to do with hospitals/ doctors/illness.
This survey is a collab between UT Austin, Google, KAIST, and Sony.
It's here for you, just in time for your ICLR submissions:
@article
{diffusion_survey,
title={A Survey on Diffusion Models for Inverse Problems},
url={},
Amazed by the simplicity and the elegance of the code from the paper: "Elucidating the Design Space of Diffusion-Based Generative Models".
Extremely easy to experiment with different ideas for sampling from diffusion models.
A side benefit of our approach: significant computational benefits.
Deblurring (with little noise) seems to be a more efficient operation compared to denoising for image generation.
One of the greatest things I earned by doing my Ph.D. in the US is the ability to work with super-talented people (incl. professors, students, and industry researchers).
One lesson I learned is that even the most incredible researchers might be wrong about how certain things
@awjuliani
good question, we don't know, but it would be interesting to explore! As many mentioned, we discovered that there is a gibberish vocabulary, not a gibberish language. We currently have no evidence proving (or disproving) that there is some sort of syntax/grammar.
Really excited about NeurIPS next week!
I will be in New Orleans from Nov. 27 to Dec. 4.
If you want to chat about generative models, diffusion, sampling, inverse problems, or any other cool related research topic, please reach out!
The first family of methods we identify is methods that are proposing closed-forms approximations for the measurements score.
These approximations are not always explicit.
We did our best to write them all in a unified mathematical way.
@benjamin_hilton
3) The attack method doesn't always work. This is true -- we did a couple of runs to get this working. However, it works *sometimes* and it is interesting that the model is revealing its' adversarial examples. Another example: "Two men talking about soccer, with subtitles".
@ArthurB
I see some consistency in the generated outputs. It seems to me entirely possible that Midjourney has its' own vocabulary - a set of words that seem random to humans but are consistently mapped to visual concepts. Let us know if you find any!
By deleting several other letters we get even more weird result. Usually DALL·E delivers crisp images with perfect composition. Here we see birds being masked out.
Ingredient 2: Momentum Sampling
We show that the choice of sampler has a dramatic effect on the quality of the generated samples.
We propose Momentum Sampler, a novel sampling scheme to reverse general linear corruption processes, inspired by momentum methods in optimization.
New paper!🤗
Do all your samples from Stable Diffusion or Dall-E look very similar to each other? It turns out IID sampling is to blame! We study this problem and propose Particle Guidance, a technique to obtain diverse samples that can be readily applied to your diffusion model!
These issues, motivate training with corrupted samples. More data & less memorization of the training set.
But is it possible to train diffusion models that generate clean images without ever seeing one? 🤔
Our framework, Ambient Diffusion (NeurIPS 2023), solves this problem.
@benjamin_hilton
2) Yes, gibberish text changes meaning based on the context (but not always). I do not yet understand when/why this is happening -- but I think it is worth exploring.
Exciting first day at the IFML (
@MLFoundations
) GenAI workshop today!
Some interesting discussions about open science and about how far the capabilities of GPT-N might go.
Cool use of video init's using
#DeforumDiffusion
/
#stablediffusion
from u/EsdricoXD on Reddit.
prompt: A film still of lalaland, artwork by studio ghibli, makoto shinkai, pixv
sampler: euler ancestral
Steps: 45
scale: 14
strength: 0.55
Coherent and really Effective 🔥
To learn more, refer to our paper:
We also open-source our models and code:
Joint work with amazing collaborators:
@AlexGDimakis
, Adam Klivans,
@YuvalDagan3
, Kulin Shah, and Aravind Gollakota.
Research paper:
This is a work that I have been working on for a year under the supervision of two wonderful mentors: Alex Dimakis (
@AlexGDimakis
) and Constantinos Daskalakis (
@KonstDaskalakis
).
A couple of people reached out about this survey and provided great feedback. Thanks for that, you are helping us make it better!
I realized my DMs were closed, so if you couldn't DM me before, please try again or shoot me an email.
Why are there so many different methods for using diffusion models for inverse problems? 🤔
And how do these methods relate to each other?
In this survey, we review more than 35 different methods and we attempt to unify them into common mathematical formulations.