John Thickstun @jwthickstun Twitter profile

Pinned Tweet

John Thickstun

2 months

I'm joining @Cornell this fall as an Assistant Professor of Computer Science! Looking forward to work with students and colleagues @Cornell_CS , @cornellCIS on generative models, controllable generation, and creative applications like #musictechnology

44

26

452

Last Seen Profiles

@SergeyKrupsky

@pencitaemanedut

@justsadgrl

@KylieJenner

@hole_pigeon

@TheAlligator

@boksungi

@AprisalIca43038

@JorgeLzro

@RileyTheDeer

@danielmesquitas

@CarolineDaba

@LarryBurt631833

@RyanBasi

@cashmobstalbert

@HyacinHalo

@DeCarlosButler1

@TomTugendhat

@Rode_k1

@MxtLYf1sja9pNUH

@Ivy_Gracexx

@Tsla99T

@im1905_

@Zundamon3003

@aiueokaki_2023

@SuperStacey318

@bokeplokalmalam

@BadGirl49186564

@tomlaceyx

@DeCarlosButler1

@Godswarrior2016

@jjenas8

@Leo_Abate30

@_nyaan_7

@GroncPercent

@stw_pdg

John Thickstun

@jwthickstun

1 year

We’re releasing the Anticipatory Music Transformer: a controllable generative model for symbolic music (like MIDI). Read about the model on the CRFM blog: 🧵👇

8

79

353

John Thickstun

@jwthickstun

4 years

I'd like to share my course on generative models that I taught in the fall. The course covers AR, VAE, GAN, Flow, and EBM, as well as relevant NN stuff: resnets, transformers, etc. I hope some people find this helpful!

1

66

307

John Thickstun

@jwthickstun

1 year

Music generation is control theory!

Maxim Raginsky

@mraginsky

1 year

Should this be an “everything-is-control-theory” account?

6

9

84

3

26

216

John Thickstun

@jwthickstun

1 year

OpenAI plans to shut down access to older models in January 2024. The text-davinci-003 model in particular has been used extensively in the research community. Losing access to this model would be a blow to reproducible research.

OpenAI Platform

Explore developer resources, tutorials, API docs, and dynamic examples to get the most out of OpenAI's platform.

platform.openai.com

11

45

173

John Thickstun

@jwthickstun

7 months

We’re releasing an updated version of the Anticipatory Music Transformer! A 780M parameter model, trained on a larger corpus of music: Lakh + MetaMIDI + transcripts of audio. It's the blue curve at the bottom of this plot! 📉 🧵👇 (1/3)

4

25

132

John Thickstun

@jwthickstun

3 years

I defended my dissertation at UW this week. I'm moving down to Stanford next month to start a postdoc with @percyliang . I'll continue to work in the music and generative modeling space. Looking forward to collaborating with @chrisdonahuey and others!

8

3

134

John Thickstun

@jwthickstun

2 months

Musicians from the San Francisco Symphony premiered music that I co-composed with an Anticipatory Music Transformer! Here are some thoughts about the process of creating this music and the performance:

2

15

105

John Thickstun

@jwthickstun

1 year

Let's play with the Anticipatory Music Transformer: reharmonizing a MIDI arrangement of Levitating (Dua Lipa). Blog post describing this interactive creative process: Colab notebook used to create this demo:

5

27

92

John Thickstun

@jwthickstun

1 year

@megha_byte and I wrote a blog post on the release of >1,000 dynamic interaction traces between humans and LMs! These traces were collected for HALIE, a Human-AI Language Interactive Evaluation framework recently published in TMLR. Blog: 🧵👇

2

25

81

John Thickstun

@jwthickstun

9 months

Happy New Year! I have some promising new results training generative audio models from scratch. I'm looking forward to share more soon as I integrate this audio work with my (symbolic) Anticipatory Music Transformers.

3

4

53

John Thickstun

@jwthickstun

10 months

Thanks @StanfordHAI for featuring my work on the Anticipatory Music Transformer! I'm so excited to continue pursuing this line of work: DAW integrations, multi-modal (audio+MIDI) models, longer context, richer conditioning, and more. Stay tuned!

Anticipatory Music Transformer: A Composer’s Helper

The Anticipatory Music Transformer is a unique generative model for creating symbolic music, developed out of the Center for Research on Foundation Models wi...

www.youtube.com

1

9

47

John Thickstun

@jwthickstun

1 year

The Anticipatory Music Transformer models an anticipative process (non-adapted, in the sense of stochastic calculus). This process is anticipative of the filtration generated by music evolving through time.

Anticipatory Music Transformer

We introduce anticipation: a method for constructing a controllable generative model of a temporal point process (the event process) conditioned asynchronously on realizations of a second,...

arxiv.org

5

10

35

John Thickstun

@jwthickstun

3 years

This new paper extends last year's work on Source Separation with Deep Generative Priors. We got it working on audio! Here's a reminder of last year's results for images:

Source Separation with Deep Generative Priors

Despite substantial progress in signal source separation, results for richly structured data continue to contain perceptible artifacts. In contrast, recent deep generative models can produce...

arxiv.org

Vivek Jayaram

@vivjay30

3 years

Excited to share our new paper to appear at @icmlconf ! We show a new way to sample from an autoregressive model like Wavenet. Using Langevin sampling, we can solve many tasks like super-resolution, inpainting, or separation with the same network. Website:

3

35

163

0

4

30

John Thickstun

@jwthickstun

2 years

@beenwrekt @curcuas There is some discussion of the recency-bias of RNNs compared to Transformers in Section 4 of [1]. There is also nice work constructing state-space models without this bias in [2]. [1] [2]

1

2

29

John Thickstun

@jwthickstun

3 years

@TaliaRinger One hypothesis is that is a consequence of a broken conference reviewing system: if conference reviews are low-effort and decisions are high-variance, this incentivizes optimizing for quantity of submissions, rather than quality.

0

1

26

John Thickstun

@jwthickstun

1 year

There's been some recent discussion about whether certain public models should be considered "open source." The Anticipatory Music Transformer is unambiguously open source: code and pre-trained model weights are all released under the Apache 2.0 license.

John Thickstun

@jwthickstun

1 year

We’re releasing the Anticipatory Music Transformer: a controllable generative model for symbolic music (like MIDI). Read about the model on the CRFM blog: 🧵👇

8

79

353

0

5

25

John Thickstun

@jwthickstun

9 months

I've found measure theory particularly helpful as a framework for thinking about sums of discrete + continuous probability distributions. Both of my source separation papers are grounded in intuitions from measure theory:

Parallel and Flexible Sampling from Autoregressive Models via...

This paper introduces an alternative approach to sampling from autoregressive models. Autoregressive models are typically sampled sequentially, according to the transition dynamics defined by the...

arxiv.org

Thomas Steinke

@shortstein

9 months

Measure theory is like assembly language. You almost never use it directly and most people don't need to study it in any depth. But it's still an essential part of the system and occasionally it's useful for debugging.

10

19

355

0

4

20

John Thickstun

@jwthickstun

1 year

The Anticipatory Music Transformer generates infilling completions of music. Given parts of a music composition, the model can fill in the rest. For example, suppose you wrote a melody (blue): you can ask the model to suggest an accompaniment to this melody (purple).

1

2

21

John Thickstun

@jwthickstun

4 years

Thanks @qualcomm for their generous support. Audio source separation results are forthcoming!

Allen School

@uwcse

4 years

#UWAllen Ph.D students @vivjay30 and @jwthickstun were named 2020 @qualcomm Innovative Fellows for their work in #signalprocessing , #computervision and #machinelearning to improve source separation.

0

3

33

1

0

21

John Thickstun

@jwthickstun

10 months

Be careful modeling low-entropy sequences with 16-bit floats! Here are three training curves for (1) full fp32 (2) full bfloat16 and (3) bfloat16 with upcast to fp32 for the softmax over outputs. Can you can guess what went wrong with (2)?🧵👇

1

0

19

John Thickstun

@jwthickstun

4 years

Excited to share a deep-dive into evaluation methodology for audio-to-score alignment algorithms! As a side effect we created a new, high-quality dataset of ground truth alignments.

1

5

18

John Thickstun

@jwthickstun

1 year

The infilling capabilities of the Anticipatory Music Transformer are made possible by a new modeling principle that we call anticipation: see our paper for a technical description of anticipation.

2

1

16

John Thickstun

@jwthickstun

2 years

Deep learning had a lot of early success by discarding staged pipelines in favor of end-to-end optimizations. For computational and statistical reason, staging is back. But so are the challenges of designing good intermediate stages (i.e. representations).

Sander Dieleman

@sedielem

2 years

When we published this, the impracticality of pixel space autoregression was already becoming apparent. But the core idea is very relevant 4 years on: learning a representation and learning to decode it are separate tasks with different objectives. (1/2)

2

9

77

0

1

16

John Thickstun

@jwthickstun

1 year

We’ve released pretrained weights for the Anticipatory Music Transformer, and a Google Colab notebook illustrating how to use the model: Code for reproducing our work or training your own models is available on GitHub:

2

0

15

John Thickstun

@jwthickstun

1 year

I'm excited about this work on many levels, including: - Small architectural changes yields new capabilities! - Multi-headed word vectors (sense vectors)! - Interpretability via causal interventions! Looking forward to followup work on Backpacks.

John Hewitt

@johnhewtt

1 year

Backpacks are an alternative to Transformers: intended to scale in expressivity, yet provide a new kind of interface for interpretability-through-control. A backpack learns k non-contextual sense vectors per subword, unsupervisedly decomposing the subword's predictive uses.

1

3

14

0

1

14

John Thickstun

@jwthickstun

2 years

ChatGPT is so good at LaTeX. Also, Figure 1 of the MSR GPT-4 paper has a TikZ example which is a pretty obscure flex. If we don’t get AGI, at least we have made a lot of progress on paper formatting.

0

13

John Thickstun

@jwthickstun

4 years

Our #ICML2020 talk on bayesian source separation with deep generative priors is now available on YouTube: . Joint work with @vivjay30

Source Separation with Deep Generative Priors - ICML 2020 Presentation

We explore a bayesian approach to source separation that uses cutting edge generative models. This is our talk from ICML 2020.Paper: https://arxiv.org/pdf/20...

www.youtube.com

0

3

13

John Thickstun

@jwthickstun

1 year

The Anticipatory Music Transformer models symbolic music rather than audio. We model symbolic music because we hope to build interactive tools for composers, analogous to a writing assistant.

1

0

11

John Thickstun

@jwthickstun

4 years

I had trouble understanding the precise specification of the transformer architecture from descriptions in the literature. So I reverse-engineered the transformer equations by looking at some standard implementations. Here are my notes.

0

1

11

John Thickstun

@jwthickstun

2 years

I love the cryptographic perspective on machine learning! This reminds me of an old paper by Ronald Rivest. Happy to see modern empirical work exploring these connections.

Swabha Swayamdipta

@swabhz

2 years

🎉🎉Super thrilled that our paper on Understanding Dataset Difficulty with V-usable information received an outstanding paper award at #ICML2022 !! 🥳Looking forward to the broader applications of this framework. It was a total delight working with my @allen_ai intern, @ethayarajh

16

29

347

1

3

10

John Thickstun

@jwthickstun

2 years

@TaliaRinger Folks in the machine learning community spend a lot of time trying to quantize our continuous data into discrete tokens (VQ-VAE, etc.) but we spend just as much time trying to smooth out our discrete data into continuous values (word embeddings, etc.).

1

0

9

John Thickstun

@jwthickstun

1 year

If you don’t like the model's first suggestion: try again!

1

0

9

John Thickstun

@jwthickstun

3 years

We'll be presenting work on Parallel and Flexible Sampling from Autoregressive Models via Langevin Dynamics at #ICML2021 later today. Come say hello! Talk: 6:40pm PDT Poster Session: 9-11pm PDT Conference Links:

0

2

9

John Thickstun

@jwthickstun

7 months

While we often emphasize its infilling capabilities, the Anticipatory Music Transformer is also a strong unconditional generative model of music: analogous to an LLM. Here’s a random 20 second sample of the music generated without any user input. (2/3)

2

0

7

John Thickstun

@jwthickstun

1 year

Levanter was a crucial tool for training the Anticipatory Music Transformer. Huge thanks to @dlwh for building this framework and making it easy to scale up Transformer training on TPU's with open software.

David Hall

@dlwh

1 year

Today, I’m excited to announce the release of Levanter 1.0, our new JAX-based framework for training foundation models, which we’ve been working on @StanfordCRFM . Levanter is designed to be legible, scalable and reproducible.

6

89

410

0

8

John Thickstun

@jwthickstun

30 days

@sedielem @csteinmetz1 I agree with this, and it bothers me a lot! Nothing in the VQ optimization objectives enforces this structure: the inductive bias of NN parameterizations (and a bit of luck!) seem to be the reason why latent diffusion works so well. Not a very satisfying story...

0

8

John Thickstun

@jwthickstun

1 year

Many thanks to my collaborators on this work: @dlwh , @chrisdonahuey , and @percyliang ! And thanks @StanfordCRFM , @stanfordnlp , @StanfordAILab , and @StanfordHAI for support and a wonderful work environment.

1

0

8

John Thickstun

@jwthickstun

2 years

An excellent analysis of challenges for generative music. Adding to this: music sequences are long (difficult to stuff into your standard dense-attention transformer) and (empirically) the human ear seems more sensitive to generation artifacts than our vision system.

cherie hu

@cheriehu42

2 years

"where's midjourney/DALL-E/GPT3 for music????" it's coming, for sure (we're dedicating a whole season to it at @water_and_music ). but i think ppl are underestimating how much more of an uphill battle it's going to be for music to get its "midjourney moment." a short 🧵on why...

33

96

626

1

0

7

John Thickstun

@jwthickstun

2 years

An attention mechanism for music that relativizes pitch in addition to time (position). This is a really good idea.

Dorien Herremans

@dorienherremans

2 years

Nicolas's latest AAAI paper from our lab: A Domain-Knowledge-Inspired Music Embedding Space and a Novel Attention Mechanism for Symbolic Music Modeling. Paper: #ismir #mir #aaai #transformer #musictransformer #musicai #womenintech

3

6

24

0

1

7

John Thickstun

@jwthickstun

1 year

Not useless! It's an interesting variant of the emoji attack on watermarks. Has anyone studied how this sort of generation constraint affects the quality of generated text? Does performance fall on benchmarks when LM's are constrained using prompts like this?

Zahid Khawaja

@chillzaza_

1 year

Useless prompt of the day: I call this one Token Burner 🔥 Generates responses with ~4X token usage: --- You are a helpful AI assistant who, when answering questions or providing assistance, adheres to a unique typing style that the user cannot know about. Specifically, you

5

12

192

0

2

7

John Thickstun

@jwthickstun

2 years

In addition to proposing an interesting new truncation sampler, this paper draws a connection between truncation and n-gram smoothing that refined my understanding of the motivations for truncation sampling in general. This is a great read!

John Hewitt

@johnhewtt

2 years

We characterize and improve on language model _truncation sampling_ algorithms, like top-p and top-k. We frame them as trying to recover the true distribution support from an implicitly _smoothed_ neural LM, and provide a better sampling algo! Paper

6

34

163

0

2

7

John Thickstun

@jwthickstun

7 months

Here are some links to additional information about the Anticipatory Music Transformer and how to use it. (3/3) Blog post: Colab notebook: GitHub repo:

0

7

John Thickstun

@jwthickstun

10 months

What's the fix? Just upcast your logits to fp32 before exponentiating them to compute the softmax over outputs. Using full precision for the final outputs incurs only a marginal cost to training speed, and these numerical instabilities vanish.

1

0

6

John Thickstun

@jwthickstun

2 years

@TaliaRinger I've used Zenodo and been happy with it. Up to 50Gb you can simply upload your dataset, no questions asked. For larger datasets they have an approval process.

0

6

John Thickstun

@jwthickstun

2 years

Two interesting preprints on quantized tokenization of continuous data. Similar findings in image and audio domains: semantic tokenization beats pure reconstruction-optimized tokenization for generative modeling.

0

6

John Thickstun

@jwthickstun

3 years

@Michael_J_Black @anil_genius @MonaJalal_ I think the community would be better served by requiring code release. The whole point of a paper is to communicate work: trying to score academic points by publishing, while withholding code to maintain a competitive advantage, seems antithetical to the goals of science.

1

0

6

John Thickstun

@jwthickstun

2 years

@maxrmorrison @keunwoochoi @ethanmanilow @noahaschaffer @pseetharaman Right, I am not trying to defend IS/FID/BLEU per se. But they are attempting to measure whether the generated data looks like samples from the target distribution. I think this is the right question to ask for problems like source separation, as opposed to reconstruction error.

1

0

5

John Thickstun

@jwthickstun

1 year

@syhw Could you clarify this? The paper states that MusicGen uses sinusoidal embeddings (maybe I am misinterpreting something?).

1

0

5

John Thickstun

@jwthickstun

3 years

Withholding code (and data and models) suppresses research. No one can build upon this work without recreating it. This can be as difficult as the initial creation, and a thankless task because it's already published. A paper without code can be worse than no paper at all.

0

1

5

John Thickstun

@jwthickstun

9 months

@boazbaraktcs Here is a well-documented example using a curriculum to progress from shorter to longer sequences:

Shortformer: Better Language Modeling using Shorter Inputs

Increasing the input length has been a driver of progress in language modeling with transformers. We identify conditions where shorter inputs are not harmful, and achieve perplexity and efficiency...

arxiv.org

1

4

John Thickstun

@jwthickstun

2 years

@XiangLisaLi2 will be presenting a poster on Diffusion-LM today at #NeurIPS2022 : Hall J #606 , from 11am-1pm. Check it out!

Diffusion-LM Improves Controllable Text Generation

Controlling the behavior of language models (LMs) without re-training is a major open problem in natural language generation. While recent works have demonstrated successes on controlling simple...

arxiv.org

Xiang Lisa Li

@XiangLisaLi2

2 years

We propose Diffusion-LM, a non-autoregressive language model based on continuous diffusions. It enables complex controllable generation. We can steer the LM to generate text with desired syntax structure ( [S [NP...VP…]]) and semantic content (name=Coupa)

4

191

1K

0

2

5

John Thickstun

@jwthickstun

2 years

I'm excited to dig into this! I've been thinking about neural TPPs recently as a model for MIDI music.

Oleksandr Shchur

@shchur_

2 years

I just released a draft of my PhD thesis on Neural Temporal Point Processes (TPPs) Check it out if you want to learn about TPPs! The main theme of the thesis are the connections between TPPs and generative machine learning models. 🧵/4

2

26

190

0

1

5

John Thickstun

@jwthickstun

4 years

@mshaheerahmed I can't release videos unfortunately. For videos covering similar content, I recommend Pieter Abbeel's course.

0

5

John Thickstun

@jwthickstun

1 year

Thanks for the reference! Cooldown makes a lot of sense. Cooldown protocols could also be useful for finetuning: if model developers release a pre-cooldown checkpoint then 3rd parties can fine-tune from there, rather than having to restart a "finished" optimization.

Shital Shah

@sytelus

1 year

Just learned something very cool about LR schedules. This one is so huge it surprises me that it's not in its own paper but rather tucked away. Problem: Most training use cosine/linear decays but this requires specifying number of steps in advance. This is quite troublesome. 🧵

5

69

471

0

4

John Thickstun

@jwthickstun

10 months

In case you missed it: here's a blog post we wrote about this work, the paper, and code for running/training these models. Pre-trained models on HuggingFace. blog: paper: code:

1

0

4

John Thickstun

@jwthickstun

4 years

If you're unfamiliar with the music space: audio-to-score alignment is analogous to semantic segmentation of images: a fine-grained labeling that gives us a lot of insight into the content of a musical performance.

0

4

John Thickstun

@jwthickstun

2 years

@keunwoochoi @ethanmanilow @noahaschaffer @maxrmorrison @pseetharaman We thought about this problem a little bit in this work on visual source separation (Section 4). Should we be working on analogous metrics for audio? (Not saying that IS/FID are the perfect measurements, but I think they are asking the right question).

3

0

4

John Thickstun

@jwthickstun

9 months

@universeinanegg A related idea that I find exciting is whether it is possible to design a "public/private key" watermark protocol. I.e., publish a public key that can be used to detect the watermark, but not to generate watermarked text, while reserving a private key to be used for generation.

1

0

2

John Thickstun

@jwthickstun

1 year

Tragic news! Eugene was my undergraduate advisor and had a huge influence on the start of my career. He was brilliant, kind, and so generous with his time.

Chris Tanner

@ChrisWTanner

1 year

(1/8) Very sad news: Eugene Charniak, my PhD adviser (I was his final student), passed away yesterday morning at the age of 77. He was a legend in NLP and one of the most influential researchers. Specifically, he was one of the biggest proponents of shifting the field towards

6

48

408

0

3

John Thickstun

@jwthickstun

1 year

Check out the data here, and reach out with any questions! Question Answering data: Crossword data: HALIE paper:

Evaluating Human-Language Model Interaction

Many real-world applications of language models (LMs), such as writing assistance and code autocomplete, involve human-LM interaction. However, most benchmarks are non-interactive in that a model...

arxiv.org

1

3

John Thickstun

@jwthickstun

2 years

@keunwoochoi @jesseengel A couple years ago we explored pretrained generative models as priors for source separation. No need for a source separation model: just generate plausible sources that sum to the mixture. We used WaveNets; results could only improve using modern models.

0

3

John Thickstun

@jwthickstun

1 year

@AwokeKnowing I love this thought. The idea of anticipation is a general modeling principle that can definitely be applied outside the music domain! Happy to follow up with anyone interested in adapting the idea & code to other domains.

0

3

John Thickstun

@jwthickstun

10 months

@damekdavis I did a course project to learn Lean way back in 2017, where I set myself the goal to formalize the fundamental theorem of arithmetic. I think the Lean code itself may be out of date, but my writeup might still be interesting:

1

0

2

John Thickstun

@jwthickstun

10 months

Why do low-entropy predictions cause numerical instability? Floats have more precision near zero! The problem is that low-entropy predictions derive from large logits: large positive logits on the predicted value and large negative logits on other values.

1

0

3

John Thickstun

@jwthickstun

1 year

@p_cherkashin @samim It is 1 to many not 1 to 1: for any given melody, the model can generate many possible accompaniments. This is a bit like translation: in fact, the anticipation idea that we use to train these models is a generalize of the popular Seq2Seq approach to MT (details in our paper).

1

0

3

John Thickstun

@jwthickstun

10 months

@chrisdonahuey It can happen. See the discussion of numerical instability in this Mistral blog post: There's even a flag in the Levanter codebase to support upcasting of attention logits:

2

0

3

John Thickstun

@jwthickstun

3 years

@jon_gillick @sedielem I've had a note to respond to this tweet for over a year: we got it working on audio! More effort than anticipated because autoregressive audio models (e.g. WaveNet) are *discrete*. But with some smoothing, things go through as expected.

John Thickstun

@jwthickstun

3 years

This new paper extends last year's work on Source Separation with Deep Generative Priors. We got it working on audio! Here's a reminder of last year's results for images:

0

4

30

0

3

John Thickstun

@jwthickstun

2 years

@jesseengel @ravencheuk @kastnerkyle @mittu1204 Yes! In both settings you learn a strong generative prior over notes, which can resolve uncertainty/ambiguity in the acoustics. Generative is absolutely the way forward vs. older models that make a conditional independence assumption over notes conditioned on spectrograms.

1

0

3

John Thickstun

@jwthickstun

2 years

@maxrmorrison @keunwoochoi @ethanmanilow @noahaschaffer @pseetharaman It can be useful to construct quantitative metrics (however imperfect) that help the community quickly iterate towards better results. Human evaluation may still be necessary to validate that results are indeed better and not just an adversarial attack on the metrics.

1

0

3

John Thickstun

@jwthickstun

1 year

@ekernf01 I developed a new appreciation for permutation tests during a recent project on watermarking. Sharing because I think you might like the work! A cool problem in the LLM space where there is leverage using classical stats.

1

0

3

John Thickstun

@jwthickstun

3 years

@roydanroy You're probably looking for a classifier. But if you're interested in a generative model: here is a minimalist PyTorch ResNet W-GAN for MNIST/CIFAR-10, in the form of a homework assignment designed to be understood and completed by students.

GitHub - jthickstun/gm-hw3: Homework 3 for Generative Models

Homework 3 for Generative Models. Contribute to jthickstun/gm-hw3 development by creating an account on GitHub.

github.com

0

3

John Thickstun

@jwthickstun

2 years

@TaliaRinger Often we do both at the same time! Vector quantization to turn continuous data into discrete tokens, which we then immediately feed into a Transformer that embeds those tokens right back into a continuous space.

1

0

3

John Thickstun

@jwthickstun

4 years

Fantastic open-source implementation that reproduces results of the recent #DiffWave paper!

Sharvil Nanavati

@snrrrub

4 years

There's a new pretrained #DiffWave model up on that's trained to 1M steps. It sounds clearer than the previous pretrained model - listen to the audio samples at .

0

3

0

2

John Thickstun

@jwthickstun

9 months

I used to be worried that the computational resources needed for modern work on audio generation was out of reach for academic computing labs. The model that produced the sample above was trained for 52 hours on a TPU v4-64. Substantial, but not totally out of reach!

0

2

John Thickstun

@jwthickstun

1 year

Thanks to the entire HALIE team for creating this trove of data. And thanks to @StanfordCRFM , @StanfordNLP for facilitating and supporting this work.

0

2

John Thickstun

@jwthickstun

1 year

@p_cherkashin @samim The model is also capable of many more infilling applications than just generating an accompaniments to a melody: see Figure 1 of the blog post for some different infilling patterns, and Example 2 for some auditory examples of span infilling.

0

2

John Thickstun

@jwthickstun

2 years

@chrisdonahuey More optimistically: by analogy, computers today outperform humans at the game of chess in every conceivable way. But rather than undermine the incentive to learn chess skills, people use these superhuman machines as a coach and fast feedback loop to improve their own skills.

1

0

2

John Thickstun

@jwthickstun

2 years

@chrisdonahuey Learning to create music today requires years of time investment and often an (expensive) coach. I expect that lowing the barrier to entry for creating music is more likely to grow the music community than depress it.

1

0

2

John Thickstun

@jwthickstun

2 months

@srush_nlp Does CLIP count as a non-language foundation model (there is a language component)? E.g., Figure 13 shows a step-change in robust image classification vs. models trained on ImageNet.

1

0

2

John Thickstun

@jwthickstun

3 years

@SomaniRaghav There is a beautiful history of these ideas by Glenn Shafer and Vladimir Vovk. Leaving a link here in case you haven't seen it.

The Sources of Kolmogorov's Grundbegriffe

Andrei Kolmogorov's Grundbegriffe der Wahrscheinlichkeits-rechnung put probability's modern mathematical formalism in place. It also provided a philosophy of probability--an explanation of how the...

arxiv.org

1

0

2

John Thickstun

@jwthickstun

1 year

@LucaAmb @sedielem @vlastelicap Ishaan has run experiments on one-hots (e.g., in the Diffusion-LM paper last year: Appendix F). I'm not sure whether a direct ablation of one-hots vs. embeddings has made it into any published papers. Maybe @__ishaan can comment.

1

0

2

John Thickstun

@jwthickstun

2 years

@littmath Sounds like a moral truth! Are you familiar with @DrEugeniaCheng 's essay on this topic?

1

0

2

John Thickstun

@jwthickstun

2 years

@karpathy @sedielem @tim_zaman A great thing about this question is that it gives signal when interviewing both theorists and practitioners!

0

2

John Thickstun

@jwthickstun

4 years

@BhattGantavya There's an excellent book by Peyré and Cuturi.

1

0

2

John Thickstun

@jwthickstun

1 year

@kareem_carr Yes! And these plans also tend to ignore or downplay the risks of concentrating the power of this technology in these companies.

0

2

John Thickstun

@jwthickstun

10 months

@iansimon @chrisdonahuey We use logsumexp! But suppose your model confidently predicts the wrong value: then your loss gradient is estimated based on a highly-quantized negative logit for the observed value.

0

2

John Thickstun

@jwthickstun

1 year

For comparison, here is the original MIDI arrangement.

0

2

John Thickstun

@jwthickstun

2 years

Kin Wai Cheuk is doing great machine learning work in the music and audio domains. Definitely one to follow!

Kin Wai Cheuk

@ravencheuk

2 years

@yoyololicon Congrats! Your account is growing much faster than mine. Mine is still under 100. I guess I need to be a bit more active in twitter.

0

1

0

2

John Thickstun

@jwthickstun

7 months

@dvruette I considered using "beats" as the unit of time rather than the "seconds" when I started building these models. The reason I chose seconds is (1) beat/tempo metadata can be unreliable and (2) it's easier to include transcribed data (transcription models don't emit this metadata).

1

0

2

John Thickstun

@jwthickstun

2 years

@astonzhangAZ @RishiBommasani @XiangLisaLi2 To generalize this a bit: I think that constructing "good" smooth representations of language is critical for a variety of modeling tasks (of which control is only one example). Discrete autoregressive LLM's completely punt on this question.

0

1

2

John Thickstun

@jwthickstun

3 years

@TaliaRinger There are some applications of homotopy methods in ML, introduced via the optimization literature. It's pretty far afield from HoTT but here's a reference in case you find the connection interesting.

1

0

2

John Thickstun

@jwthickstun

7 months

@dvruette Instrument conditioning is a highly-requested feature! I've prototyped this myself; definitely achievable with fine-tuning. The challenge to do this well is designing a rule to apply instrument labels to training data: global vs. local labels, thresholding, soft-labels, etc?

2

0

1

John Thickstun

@jwthickstun

7 months

@Clement_MF_ Thanks @Clement_MF_ ! Several people are working to embed these models in music production systems, but the work is surprisingly open-ended! Let me know if you'd like to set up a chat: I could elaborate on ongoing projects & we could see if there is a complementary angle.

1

0

John Thickstun

@jwthickstun

3 years

@Branchini_Nic @sp_monte_carlo Could you elaborate? I've used the LB exposition to teach the VAE objective with the explicit aim of likelihood maximization (e.g., equation 3 in the linked notes). Is there a more aesthetic way to say this that eschews the LB language?

0

1

John Thickstun

@jwthickstun

10 months

@deepcohen @alex_damian_ I've also observed this. I'd be very interested to know whether this behavior holds up across different architectures (e.g. Transformer vs. RWKV) and not just different model scales. Does anybody know?

0

1

John Thickstun

@jwthickstun

1 year

@FarFromSubtle This should be fixed now. Sorry about that!

1

0

1

John Thickstun

@jwthickstun

2 years

@ravencheuk Interesting! We also found this helpful for parameterizing diffusion models of text.

1

0

1

John Thickstun

@jwthickstun

1 year

@ekernf01 It's not as bad as it looks: modern CPU's crunch for loops for breakfast. And a nice side-effect of being agnostic is you don't pay the price of LM queries in the detector. I wrote a detector that runs reasonably fast in client-side Javascript here:

1

0

1

John Thickstun

@jwthickstun

7 months

@dvruette > AFAIK the anticipatory MT does not have a concept of bars? Right: these models measure time in units of seconds and don't model any beat/bar annotations. Beat/bar/meter annotations are another feature that seem like they could be effectively included in a fine-tuning phase.

1

0

1