Can you make a jigsaw puzzle with two different solutions? Or an image that changes appearance when flipped?
We can do that, and a lot more, by using diffusion models to generate optical illusions!
Continue reading for more illusions and method details 🧵
What do you see in these images?
These are called hybrid images, originally proposed by Aude Oliva et al. They change appearance depending on size or viewing distance, and are just one kind of perceptual illusion that our method, Factorized Diffusion, can make.
I'm at CVPR presenting "Visual Anagrams" on
- Tuesday: 10am, Poster
#429
- Friday: Oral6B @ 1pm, Poster
#118
(pm)
Let me know if you want to chat!
Also, we manufactured a bunch of these "jigsaws with two solutions." If you want one, just hunt me down in the conference hall :)
Can we use motion to prompt diffusion models?
Our
#ICLR2024
paper does just that. We propose Motion Guidance, a technique that allows users to edit an image by specifying “where things should move.”
See our website, paper, and code for more details (and more illusions)!
Website:
arXiv:
Code:
Colab notebook:
Big thanks to my collaborators
@invernopark
and
@andrewhowens
!
This is an image of Corgis, but when played as a spectrogram sounds like dogs barking!
Really thankful I got the chance to work on this super fun project with first author
@CzyangChen
. Check out his thread for many more examples, and to see how they're made!
These spectrograms look like images, but can also be played as a sound! We call these images that sound.
How do we make them?
Look and listen below to find out, and to see more examples!
Can we use motion to prompt diffusion models?
Our
#ICLR2024
paper does just that. We propose Motion Guidance, a technique that allows users to edit an image by specifying “where things should move.”
We can also make these images that change when viewed in grayscale. Since the human eye can't see color under dim lighting, there is actually a physical mechanism for this illusion: these images change appearance when taken from a bright room to a dim one!
Most orthogonal transformations on an image are pretty meaningless, but luckily permutations are a subset of these transformations. This is where the idea of a “visual anagram” comes from—images that change appearance under arbitrary permutations of its pixels!
But there’s a catch! We found that not every view would work. The view needs to satisfy two conditions. The first is linearity, which ensures the transformed image is the correct mix of noise and signal:
The second condition we call “statistical consistency.” The transformed noise needs to be iid Gaussian, as that’s the assumption in diffusion. It turns out this is only possible if your transformation is orthogonal.
And by extracting low frequencies from a real image, and generating the missing high frequencies with our method, we can make hybrid images from real images. In effect we are solving a (noiseless) inverse problem. Anyways, here's Thomas Edison turning into a lightbulb:
Our method works by decomposing an image into a sum of components. For example into high and low frequencies, or grayscale and color components. We then use a diffusion model to control each of these components individually, in a zero-shot manner.
I had a wonderful time working w/
@ZhaoyingPan
and
@andrewhowens
on our
#NeurIPS2023
paper "Self-Supervised Motion Magnification." We propose a simple method for magnifying tiny motions in video, and also show some neat tricks like magnification targeting and test time adaptation
Our work is inspired by/related to DragGAN (
@XingangP
), DragonDiffusion (Chong Mou), and DragDiffusion (
@YujunPeiyangShi
). The technique from Universal Guided Diffusion (
@arpitbansal297
) is also quite important for our method to work.
Big thanks to
@andrewhowens
for advising me on this project. Please check out links for more info and results!
website:
arXiv:
code:
visualization code:
Finally, using our method with certain decompositions reduces (roughly!) to prior work on spatial or compositional control of diffusion models. Details are in the paper.
@HaareBlond
Thanks! You may be interested in our recent work, led by
@CzyangChen
, which does weird, but really really cool things to sound and spectrograms.
Diffusion models have amazing image creation abilities. But how well does their generative knowledge transfer to discriminative tasks?
We present Diffusion Classifier: strong classification results with pretrained conditional diffusion models, *with no additional training*!
1/9
Our method requires no finetuning, works on real images, and enables fine-grained editing of images with pretty complex motion. Here, we visualize the optical flow, and corresponding points between the original image and the “motion edited” image.
We can also extract motion from an existing video, and apply that motion to images. Here we take the spinning of the earth, and use it to rotate various animal faces.
@danbgoldman
@andrewhowens
@invernopark
Hi Dan, we were thinking of trying to print more. I'll add your name to a list of people who want one and I'll let you know if we figure it out. (Big fan of your work btw!)
Our method also has limitations, such as (a) failures on OOD flow fields (b) potential identity loss (c) and occasional convergence issues. It is also slow to sample from. We hope future work can help alleviate these issues.
@_jasonliu_
This is a cool idea! We were thinking that these images could be a form of steganography. Like, you're a spy and a message only appears when you look at the photo in dim lighting. It could also act as really lossy compression, but I think there's probably more practical methods
We achieve this by doing diffusion guidance through an off-the-shelf optical flow network. Our proposed guidance loss encourages the edited image to have the user specified motion w.r.t. the source image, as estimated by the flow network.
We also wrote a simple GUI to make these dense motion fields. By just clicking and dragging, a user can segment out an object with SAM and create complex flow fields.
@eerac
This might work, you could try it out! I think you would have to be careful with the noise though... An uninvertible transformation might mess up the iid Gaussian-ness of it
@NagabhushanSN95
It's related! I think it's more that high frequency components of the image go away when you downsample. You could check out the hybrid images paper if you want more details:
@HaareBlond
@invernopark
@andrewhowens
Yeah, like you said, latent diffusion doesn't work *well* (but it does kind of work). Audio is really interesting as well! We sort of lucked out tho, because the views that work with this method correspond to visually interpretable views. idk if the same would hold for audio