I'm joining
@Cornell
this fall as an Assistant Professor of Computer Science! Looking forward to work with students and colleagues
@Cornell_CS
,
@cornellCIS
on generative models, controllable generation, and creative applications like
#musictechnology
We’re releasing the Anticipatory Music Transformer: a controllable generative model for symbolic music (like MIDI). Read about the model on the CRFM blog:
🧵👇
I'd like to share my course on generative models that I taught in the fall. The course covers AR, VAE, GAN, Flow, and EBM, as well as relevant NN stuff: resnets, transformers, etc. I hope some people find this helpful!
OpenAI plans to shut down access to older models in January 2024. The text-davinci-003 model in particular has been used extensively in the research community. Losing access to this model would be a blow to reproducible research.
We’re releasing an updated version of the Anticipatory Music Transformer! A 780M parameter model, trained on a larger corpus of music: Lakh + MetaMIDI + transcripts of audio. It's the blue curve at the bottom of this plot! 📉
🧵👇 (1/3)
I defended my dissertation at UW this week. I'm moving down to Stanford next month to start a postdoc with
@percyliang
. I'll continue to work in the music and generative modeling space. Looking forward to collaborating with
@chrisdonahuey
and others!
Musicians from the San Francisco Symphony premiered music that I co-composed with an Anticipatory Music Transformer! Here are some thoughts about the process of creating this music and the performance:
Let's play with the Anticipatory Music Transformer: reharmonizing a MIDI arrangement of Levitating (Dua Lipa).
Blog post describing this interactive creative process:
Colab notebook used to create this demo:
@megha_byte
and I wrote a blog post on the release of >1,000 dynamic interaction traces between humans and LMs! These traces were collected for HALIE, a Human-AI Language Interactive Evaluation framework recently published in TMLR.
Blog:
🧵👇
Happy New Year!
I have some promising new results training generative audio models from scratch. I'm looking forward to share more soon as I integrate this audio work with my (symbolic) Anticipatory Music Transformers.
Thanks
@StanfordHAI
for featuring my work on the Anticipatory Music Transformer! I'm so excited to continue pursuing this line of work: DAW integrations, multi-modal (audio+MIDI) models, longer context, richer conditioning, and more. Stay tuned!
The Anticipatory Music Transformer models an anticipative process (non-adapted, in the sense of stochastic calculus). This process is anticipative of the filtration generated by music evolving through time.
This new paper extends last year's work on Source Separation with Deep Generative Priors. We got it working on audio! Here's a reminder of last year's results for images:
Excited to share our new paper to appear at
@icmlconf
!
We show a new way to sample from an autoregressive model like Wavenet. Using Langevin sampling, we can solve many tasks like super-resolution, inpainting, or separation with the same network.
Website:
@beenwrekt
@curcuas
There is some discussion of the recency-bias of RNNs compared to Transformers in Section 4 of [1]. There is also nice work constructing state-space models without this bias in [2].
[1]
[2]
@TaliaRinger
One hypothesis is that is a consequence of a broken conference reviewing system: if conference reviews are low-effort and decisions are high-variance, this incentivizes optimizing for quantity of submissions, rather than quality.
There's been some recent discussion about whether certain public models should be considered "open source." The Anticipatory Music Transformer is unambiguously open source: code and pre-trained model weights are all released under the Apache 2.0 license.
We’re releasing the Anticipatory Music Transformer: a controllable generative model for symbolic music (like MIDI). Read about the model on the CRFM blog:
🧵👇
I've found measure theory particularly helpful as a framework for thinking about sums of discrete + continuous probability distributions.
Both of my source separation papers are grounded in intuitions from measure theory:
Measure theory is like assembly language.
You almost never use it directly and most people don't need to study it in any depth. But it's still an essential part of the system and occasionally it's useful for debugging.
The Anticipatory Music Transformer generates infilling completions of music. Given parts of a music composition, the model can fill in the rest. For example, suppose you wrote a melody (blue): you can ask the model to suggest an accompaniment to this melody (purple).
Be careful modeling low-entropy sequences with 16-bit floats! Here are three training curves for (1) full fp32 (2) full bfloat16 and (3) bfloat16 with upcast to fp32 for the softmax over outputs. Can you can guess what went wrong with (2)?🧵👇
Excited to share a deep-dive into evaluation methodology for audio-to-score alignment algorithms! As a side effect we created a new, high-quality dataset of ground truth alignments.
The infilling capabilities of the Anticipatory Music Transformer are made possible by a new modeling principle that we call anticipation: see our paper for a technical description of anticipation.
Deep learning had a lot of early success by discarding staged pipelines in favor of end-to-end optimizations. For computational and statistical reason, staging is back. But so are the challenges of designing good intermediate stages (i.e. representations).
When we published this, the impracticality of pixel space autoregression was already becoming apparent.
But the core idea is very relevant 4 years on: learning a representation and learning to decode it are separate tasks with different objectives. (1/2)
We’ve released pretrained weights for the Anticipatory Music Transformer, and a Google Colab notebook illustrating how to use the model:
Code for reproducing our work or training your own models is available on GitHub:
I'm excited about this work on many levels, including:
- Small architectural changes yields new capabilities!
- Multi-headed word vectors (sense vectors)!
- Interpretability via causal interventions!
Looking forward to followup work on Backpacks.
Backpacks are an alternative to Transformers: intended to scale in expressivity, yet provide a new kind of interface for interpretability-through-control.
A backpack learns k non-contextual sense vectors per subword, unsupervisedly decomposing the subword's predictive uses.
ChatGPT is so good at LaTeX. Also, Figure 1 of the MSR GPT-4 paper has a TikZ example which is a pretty obscure flex. If we don’t get AGI, at least we have made a lot of progress on paper formatting.
The Anticipatory Music Transformer models symbolic music rather than audio. We model symbolic music because we hope to build interactive tools for composers, analogous to a writing assistant.
I had trouble understanding the precise specification of the transformer architecture from descriptions in the literature. So I reverse-engineered the transformer equations by looking at some standard implementations. Here are my notes.
I love the cryptographic perspective on machine learning! This reminds me of an old paper by Ronald Rivest. Happy to see modern empirical work exploring these connections.
🎉🎉Super thrilled that our paper on Understanding Dataset Difficulty with V-usable information received an outstanding paper award at
#ICML2022
!! 🥳Looking forward to the broader applications of this framework. It was a total delight working with my
@allen_ai
intern,
@ethayarajh
@TaliaRinger
Folks in the machine learning community spend a lot of time trying to quantize our continuous data into discrete tokens (VQ-VAE, etc.) but we spend just as much time trying to smooth out our discrete data into continuous values (word embeddings, etc.).
We'll be presenting work on Parallel and Flexible Sampling from Autoregressive Models via Langevin Dynamics at
#ICML2021
later today. Come say hello!
Talk: 6:40pm PDT
Poster Session: 9-11pm PDT
Conference Links:
While we often emphasize its infilling capabilities, the Anticipatory Music Transformer is also a strong unconditional generative model of music: analogous to an LLM. Here’s a random 20 second sample of the music generated without any user input. (2/3)
Levanter was a crucial tool for training the Anticipatory Music Transformer. Huge thanks to
@dlwh
for building this framework and making it easy to scale up Transformer training on TPU's with open software.
Today, I’m excited to announce the release of Levanter 1.0, our new JAX-based framework for training foundation models, which we’ve been working on
@StanfordCRFM
. Levanter is designed to be legible, scalable and reproducible.
@sedielem
@csteinmetz1
I agree with this, and it bothers me a lot! Nothing in the VQ optimization objectives enforces this structure: the inductive bias of NN parameterizations (and a bit of luck!) seem to be the reason why latent diffusion works so well. Not a very satisfying story...
An excellent analysis of challenges for generative music. Adding to this: music sequences are long (difficult to stuff into your standard dense-attention transformer) and (empirically) the human ear seems more sensitive to generation artifacts than our vision system.
"where's midjourney/DALL-E/GPT3 for music????"
it's coming, for sure (we're dedicating a whole season to it at
@water_and_music
). but i think ppl are underestimating how much more of an uphill battle it's going to be for music to get its "midjourney moment."
a short 🧵on why...
Not useless! It's an interesting variant of the emoji attack on watermarks. Has anyone studied how this sort of generation constraint affects the quality of generated text? Does performance fall on benchmarks when LM's are constrained using prompts like this?
Useless prompt of the day:
I call this one Token Burner 🔥 Generates responses with ~4X token usage:
---
You are a helpful AI assistant who, when answering questions or providing assistance, adheres to a unique typing style that the user cannot know about. Specifically, you
In addition to proposing an interesting new truncation sampler, this paper draws a connection between truncation and n-gram smoothing that refined my understanding of the motivations for truncation sampling in general. This is a great read!
We characterize and improve on language model _truncation sampling_ algorithms, like top-p and top-k. We frame them as trying to recover the true distribution support from an implicitly _smoothed_ neural LM, and provide a better sampling algo!
Paper
Here are some links to additional information about the Anticipatory Music Transformer and how to use it. (3/3)
Blog post:
Colab notebook:
GitHub repo:
What's the fix? Just upcast your logits to fp32 before exponentiating them to compute the softmax over outputs. Using full precision for the final outputs incurs only a marginal cost to training speed, and these numerical instabilities vanish.
@TaliaRinger
I've used Zenodo and been happy with it. Up to 50Gb you can simply upload your dataset, no questions asked. For larger datasets they have an approval process.
Two interesting preprints on quantized tokenization of continuous data. Similar findings in image and audio domains: semantic tokenization beats pure reconstruction-optimized tokenization for generative modeling.
@Michael_J_Black
@anil_genius
@MonaJalal_
I think the community would be better served by requiring code release. The whole point of a paper is to communicate work: trying to score academic points by publishing, while withholding code to maintain a competitive advantage, seems antithetical to the goals of science.
@maxrmorrison
@keunwoochoi
@ethanmanilow
@noahaschaffer
@pseetharaman
Right, I am not trying to defend IS/FID/BLEU per se. But they are attempting to measure whether the generated data looks like samples from the target distribution. I think this is the right question to ask for problems like source separation, as opposed to reconstruction error.
Withholding code (and data and models) suppresses research. No one can build upon this work without recreating it. This can be as difficult as the initial creation, and a thankless task because it's already published. A paper without code can be worse than no paper at all.
We propose Diffusion-LM, a non-autoregressive language model based on continuous diffusions. It enables complex controllable generation. We can steer the LM to generate text with desired syntax structure ( [S [NP...VP…]]) and semantic content (name=Coupa)
I just released a draft of my PhD thesis on Neural Temporal Point Processes (TPPs)
Check it out if you want to learn about TPPs!
The main theme of the thesis are the connections between TPPs and generative machine learning models.
🧵/4
Thanks for the reference!
Cooldown makes a lot of sense. Cooldown protocols could also be useful for finetuning: if model developers release a pre-cooldown checkpoint then 3rd parties can fine-tune from there, rather than having to restart a "finished" optimization.
Just learned something very cool about LR schedules. This one is so huge it surprises me that it's not in its own paper but rather tucked away.
Problem: Most training use cosine/linear decays but this requires specifying number of steps in advance. This is quite troublesome. 🧵
In case you missed it: here's a blog post we wrote about this work, the paper, and code for running/training these models. Pre-trained models on HuggingFace.
blog:
paper:
code:
If you're unfamiliar with the music space: audio-to-score alignment is analogous to semantic segmentation of images: a fine-grained labeling that gives us a lot of insight into the content of a musical performance.
@keunwoochoi
@ethanmanilow
@noahaschaffer
@maxrmorrison
@pseetharaman
We thought about this problem a little bit in this work on visual source separation (Section 4). Should we be working on analogous metrics for audio? (Not saying that IS/FID are the perfect measurements, but I think they are asking the right question).
@universeinanegg
A related idea that I find exciting is whether it is possible to design a "public/private key" watermark protocol. I.e., publish a public key that can be used to detect the watermark, but not to generate watermarked text, while reserving a private key to be used for generation.
Tragic news! Eugene was my undergraduate advisor and had a huge influence on the start of my career. He was brilliant, kind, and so generous with his time.
(1/8) Very sad news: Eugene Charniak, my PhD adviser (I was his final student), passed away yesterday morning at the age of 77. He was a legend in NLP and one of the most influential researchers. Specifically, he was one of the biggest proponents of shifting the field towards
@keunwoochoi
@jesseengel
A couple years ago we explored pretrained generative models as priors for source separation. No need for a source separation model: just generate plausible sources that sum to the mixture. We used WaveNets; results could only improve using modern models.
@AwokeKnowing
I love this thought. The idea of anticipation is a general modeling principle that can definitely be applied outside the music domain! Happy to follow up with anyone interested in adapting the idea & code to other domains.
@damekdavis
I did a course project to learn Lean way back in 2017, where I set myself the goal to formalize the fundamental theorem of arithmetic. I think the Lean code itself may be out of date, but my writeup might still be interesting:
Why do low-entropy predictions cause numerical instability? Floats have more precision near zero! The problem is that low-entropy predictions derive from large logits: large positive logits on the predicted value and large negative logits on other values.
@p_cherkashin
@samim
It is 1 to many not 1 to 1: for any given melody, the model can generate many possible accompaniments. This is a bit like translation: in fact, the anticipation idea that we use to train these models is a generalize of the popular Seq2Seq approach to MT (details in our paper).
@chrisdonahuey
It can happen. See the discussion of numerical instability in this Mistral blog post:
There's even a flag in the Levanter codebase to support upcasting of attention logits:
@jon_gillick
@sedielem
I've had a note to respond to this tweet for over a year: we got it working on audio! More effort than anticipated because autoregressive audio models (e.g. WaveNet) are *discrete*. But with some smoothing, things go through as expected.
This new paper extends last year's work on Source Separation with Deep Generative Priors. We got it working on audio! Here's a reminder of last year's results for images:
@jesseengel
@ravencheuk
@kastnerkyle
@mittu1204
Yes! In both settings you learn a strong generative prior over notes, which can resolve uncertainty/ambiguity in the acoustics. Generative is absolutely the way forward vs. older models that make a conditional independence assumption over notes conditioned on spectrograms.
@maxrmorrison
@keunwoochoi
@ethanmanilow
@noahaschaffer
@pseetharaman
It can be useful to construct quantitative metrics (however imperfect) that help the community quickly iterate towards better results. Human evaluation may still be necessary to validate that results are indeed better and not just an adversarial attack on the metrics.
@ekernf01
I developed a new appreciation for permutation tests during a recent project on watermarking. Sharing because I think you might like the work! A cool problem in the LLM space where there is leverage using classical stats.
@roydanroy
You're probably looking for a classifier. But if you're interested in a generative model: here is a minimalist PyTorch ResNet W-GAN for MNIST/CIFAR-10, in the form of a homework assignment designed to be understood and completed by students.
@TaliaRinger
Often we do both at the same time! Vector quantization to turn continuous data into discrete tokens, which we then immediately feed into a Transformer that embeds those tokens right back into a continuous space.
There's a new pretrained
#DiffWave
model up on that's trained to 1M steps. It sounds clearer than the previous pretrained model - listen to the audio samples at .
I used to be worried that the computational resources needed for modern work on audio generation was out of reach for academic computing labs. The model that produced the sample above was trained for 52 hours on a TPU v4-64. Substantial, but not totally out of reach!
@p_cherkashin
@samim
The model is also capable of many more infilling applications than just generating an accompaniments to a melody: see Figure 1 of the blog post for some different infilling patterns, and Example 2 for some auditory examples of span infilling.
@chrisdonahuey
More optimistically: by analogy, computers today outperform humans at the game of chess in every conceivable way. But rather than undermine the incentive to learn chess skills, people use these superhuman machines as a coach and fast feedback loop to improve their own skills.
@chrisdonahuey
Learning to create music today requires years of time investment and often an (expensive) coach. I expect that lowing the barrier to entry for creating music is more likely to grow the music community than depress it.
@srush_nlp
Does CLIP count as a non-language foundation model (there is a language component)? E.g., Figure 13 shows a step-change in robust image classification vs. models trained on ImageNet.
@LucaAmb
@sedielem
@vlastelicap
Ishaan has run experiments on one-hots (e.g., in the Diffusion-LM paper last year: Appendix F). I'm not sure whether a direct ablation of one-hots vs. embeddings has made it into any published papers. Maybe
@__ishaan
can comment.
@iansimon
@chrisdonahuey
We use logsumexp! But suppose your model confidently predicts the wrong value: then your loss gradient is estimated based on a highly-quantized negative logit for the observed value.
@dvruette
I considered using "beats" as the unit of time rather than the "seconds" when I started building these models. The reason I chose seconds is (1) beat/tempo metadata can be unreliable and (2) it's easier to include transcribed data (transcription models don't emit this metadata).
@astonzhangAZ
@RishiBommasani
@XiangLisaLi2
To generalize this a bit: I think that constructing "good" smooth representations of language is critical for a variety of modeling tasks (of which control is only one example). Discrete autoregressive LLM's completely punt on this question.
@TaliaRinger
There are some applications of homotopy methods in ML, introduced via the optimization literature. It's pretty far afield from HoTT but here's a reference in case you find the connection interesting.
@dvruette
Instrument conditioning is a highly-requested feature! I've prototyped this myself; definitely achievable with fine-tuning. The challenge to do this well is designing a rule to apply instrument labels to training data: global vs. local labels, thresholding, soft-labels, etc?
@Clement_MF_
Thanks
@Clement_MF_
! Several people are working to embed these models in music production systems, but the work is surprisingly open-ended! Let me know if you'd like to set up a chat: I could elaborate on ongoing projects & we could see if there is a complementary angle.
@Branchini_Nic
@sp_monte_carlo
Could you elaborate? I've used the LB exposition to teach the VAE objective with the explicit aim of likelihood maximization (e.g., equation 3 in the linked notes). Is there a more aesthetic way to say this that eschews the LB language?
@deepcohen
@alex_damian_
I've also observed this. I'd be very interested to know whether this behavior holds up across different architectures (e.g. Transformer vs. RWKV) and not just different model scales. Does anybody know?
@ekernf01
It's not as bad as it looks: modern CPU's crunch for loops for breakfast. And a nice side-effect of being agnostic is you don't pay the price of LM queries in the detector. I wrote a detector that runs reasonably fast in client-side Javascript here:
@dvruette
> AFAIK the anticipatory MT does not have a concept of bars?
Right: these models measure time in units of seconds and don't model any beat/bar annotations. Beat/bar/meter annotations are another feature that seem like they could be effectively included in a fine-tuning phase.