One of the key models in MusicLM is SoundStream, an audio codec. It made vocoders obsolete; and reshaped audio generation as a token prediction task.
SS is not open to public, but a similar neural audio codec Encodec is completely open-source โ
really well done, from SoundStream and AudioLM through MuLan to MusicLM ๐๐
the overall structure of MusicLM
= MuLan + AudioLM
= MuLan + w2v-BERT + SoundStream
MuLan is a text-music joint embedding model.
- contrastive training
- 44M music audio - text description pairs from "internet music videos" *cough cough* youtube *cough cough*
- AST: audio spectrogram transformer
Last Friday was my last day of the two years at Spotify.
I started to work at ByteDance AI Research from today.
(At Mountain View (California) in principle, but joined remotely from NYC)
I left ByteDance last Friday. It was such a 1.8 year โค๏ธ (base-12)
I'm glad I got what I wanted - a novel and intense learning experience. I shipped quite a few stuff, worked on research back-end tools, and made some research impact.
Now, time to move on :)
๐ฑ Weโre hiring 2024 summer research interns on LLMs for drug discovery and biomedical applications. Join me,
@stephenrra
,
@kchonyc
, and other amazing people at NYC to work on the LLM product development of
@PrescientDesign
,
@genentech
โจ
Details:
๐ฅณ PROPOSAL: Foley Sound Synthesis Challenge ๐ฅณ
There are enough challenges out there for speech and music. We propose one for "the other" kind of audio -> sound. Or effects. Or, Foley.
We need to define the problem, dataset, and eval scheme. How? ๐งต๐งถ
I summarized the difference between `tokenizers.Tokenizer`, `transformers.PreTrainedTokenizer`, and `transformers.PreTrainedTokenizerFast`. I even made a github repo just to post this.
Ahem, ahem.
:
I joined Gaudio Lab to - i'd dare say - pioneer some audio/music AI! ๐ฅณ
I'm excited more than ever :D
Oh, and I'll visit Seoul more often. Friends in ๐ฐ๐ท, catch up soon!
the โllama momentโ has come to audio research today! i canโt even imagine what weโll see out of AudioCraft.
whatever you work on in music/audio, do consider using it, as much as you can. if you donโt know what to do, think what you can do with it and get a head start.
Today we're sharing details on AudioCraft, a new family of generative AI models built for generating high-quality, realistic audio & music from text. AudioCraft is a single code base that works for music, sound, compression & generation โ all in the same place.
More details โฌ๏ธ
THIS IS BIG! All the music folks in Google Deepmind focus on one thing: AI music generation while NOT exploiting artists. Nothing is perfect, there're probably still some holes in giving the credit, but this is better than anything ever for very sure.
Thrilled to share
#Lyria
, the world's most sophisticated AI music generation system. From just a text prompt Lyria produces compelling music & vocals. Also: building new Music AI tools for artists to amplify creativity in partnership w/YT & music industry
New AI music model alert! yes, again ๐
#SingSong
, another music generation model by Google;
@chrisdonahuey
et al.
Ok let me do another run for collecting followers. How does it work?
If you belong to an underrepresented group in any sense (gender, race, nationality, financial situation, etc) and need some help on any MIR issues, please just contact me. gnuchoi at the-email-starting-with-G-you-know-what-I-mean๐
for
#icassp2024
attendees, i'm open sourcing my `What to eat around COEX` list. originally written for
@cwu307
but sharing it for a large crowd and make the world better place, reduce p(doom), etc.
๐+๐+๐+๐+๐+๐+๐= 7 papers
๐ฅMIR researchers at ByteDance (SAMI team) made 7 papers accepted to
#ISMIR2021
๐ฅ
๐งตI'll introduce them here one by one :)๐
Hi people!
Me and
@kchonyc
's
#ismir2019
paper, "Deep Unsupervised Drum Transcription" aka ๐ฅ DrummerNet is here.
Paper -->
Blog post -->
Supplementary material -->
to recap, i find the whole roadmap really, really brilliant.
- because there's MuLan, they could use audio-only dataset.
- because there's SoundStream, the music generation task was simplified to token generation, not waveform generation.
Ok now (restrospectively, on high-level) it's kinda simple.
given an training item:
- extract MuLan tokens (M), extract w2v-BERT (S), SS tokens (A)
- train model for M โ S.
- train model for [M;S] โ A
both done by decoder-only transformers.
i'm teaching a class about AI at NYU, Spring 2024. it's "Deep Learning for Media", a course about AI for audio and visual contents.
oof, i thought i became an LLM person.
(it's not a job change, i'm covering one class this semester)
happy to find back a nyu dot edu account!
๐
I joined
@PrescientDesign
recently. I distracted
@kchonyc
with music research circa 2016-2019. This time he offered me to join his realm -- languages! I'm already having a lot of fun, knowing more to come.
<shameless as always>
my papers are 1st and 6th most cited ISMIR paper in the last 5 years!๐ฅ๐ฅ
heard it was mentioned at the
#ismir2021
trivia organized by the titans
@r4b1tt
@urinieto
. i think they should arXiv the trivia and cite my paper thx
AudioLM = w2v-BERT + SoundStream
w2v-BERT is..
- a BERT, but for audio. originally for speech. in AudioLM, an intermediate layer from speech-pretrained model was used.
- it's "coarse" (250bps of bitrate.)
- it takes care of semantic information.
ByteDance/TikTok is hiring research scientists and software developers around music information retrieval and music/audio signal processing at Mountain View, US. Please hit me up!
#ismir2020
we're hiring AI/LLM engineers!
- covering both pre-training and post-training tasks
- purely for product development, based on *extensive understanding in LLMs*
- with real-world impacts on drug discovery in Genentech
- no publication within sight
Frrquency-aware CNNs. Ooops I was working on the same thing last summer but had no time after some experiments. It worked for music classification and source separation. Go try this!
i like textgrad. down to trying it.
but i can't really say i like the way it's explained.. paper/blogpost is purely an analogy to backpropagation, which is cool but can you also just simply describe what it is..?
We're looking for a junior-level MIR researcher (perhaps Master or PhD) in Shanghai; to work with me on music tagging and related problems. Expecting to hire ASAP. Please email me if you're interested!
It seems clear to me that Tensorflow developers are not deeply understanding why researchers struggle with their product. Life is too short for most of researchers to be very good at all Python and machine learning. TF adds another burden, but Pytorch doesn't.
in the training set, no text label is needed because we.. i mean, googlers.. have pre-trained MuLan!
also, if you believe the power neural codec, SoundStream, no need to trained end-to-end with waveforms etc! SoundStream tokens are good enough!
it took me 4 years to get started learning about NYC / Brooklyn jazz scene. now i'm totally immersed in it, after attending ~100 shows in the last 2 years.
subscribe my newsletter "JazzBuzz" and learn about the captivating world - people, music, venues!
TikTok๐ถ is hiring a research scientist in Music/ML @๐ฌ๐ง London office ๐ฅ Join our SAMI team to work on Speech, Audio, and Music intelligence with us :)
Please feel free to reach out to me for any question ๐ง
*QUITE A FEW* papers are accepted to
#ismir2021
from our team in ByteDance ๐๐๐๐๐๐๐ I'll share more details once the proceedings are updated.
And yes we're hiring ๐ฅ๐ฅ๐ฅ๐ฅ๐ฅ๐ฅ๐ฅ
inference is straightforward.
do the same with the training stage except
- use MuLan text model, because we want *text*-to-music.
- after SoundStream tokens are predicted, feed them to SS decoder to generated audio.
Sheet Sage: Lead sheets from music audio
Leverage Jukebox for melody extraction.
Who'd submit this level of amazing work simply to late-breaking/demo session? This guy โ
@chrisdonahuey
Long time no first-authoring! Listen, Read, and Identify network (LRID-Net) identifies singing language by reading the metadata (title, album, artist) and listening to the audio.
Our paper about DCASE Challenge T7 - Foley Sound Synthesis was accepted to the DCASE Workshop ๐ฅณ
I can't make it to Finland๐ซ๐ฎ, but some of the authors will be there to tell you what we went through while organizing the first generative challenge at DCASE.
ByteDance ๐ US Speech / Audio / Music research team is extensively hiring research scientists. If youโre a graduating PhD this year, donโt wait and just DM me! ๐ฅ๐ฅ
DawDreamer has gained many features recently including pip install. A new notebook shows how to load Ableton warp marker files like this video. Faust integration enables custom polyphonic instruments. Hopefully very useful for ML researchers and artists.
teaching "deep learning for media" at NYU was super fun! now, let me disseminate my students' final projects. these are really cool stuff.
they somehow made it in the vary last minute. i swear none of these was at this level just one week before ๐
anyways, ๐งต starts -
looking for an enthusiastic MLE/SWE who *knows* LLMs and their deployment for our internal LLM serving through APIs and web-interfaces. ideally, 1-2yr industry experience + master grad level understanding in LLMs/ML. NYC. amazing team & great use-cases.
DCASE Task 7 - Foley Sound Synthesis has finished. It was the very first generative audio AI challenge. I'm very happy to have organized such a successful event! ๐
The longest ever video of me talking public has become public. "Deep Learning with Audio Signals: Prepare, Process, Design, Expect" in
@QConAI
. In case me tweeting around you isn't enough.
look how shamelessly i'm included here! as always, it was great to connect to all the great researchers in MACLab supervised by
@juhan_nam
at
@ISMIRConf
.
This year, people from the Music and Audio Computing Lab at KAIST, led by
@juhan_nam
, participated in the
@ISMIRConf
, and presented our work through scientific programs, late-breaking demos and music sessions!
we're hiring AI/LLM engineers!
- covering both pre-training and post-training tasks
- purely for product development, based on *extensive understanding in LLMs*
- with real-world impacts on drug discovery in Genentech
- no publication within sight
The
#ismir2019
poster repo is hosting 25 posters and 38-starred now. Would you please 'Like' this tweet if you've ever been the repo and seen any posters there? I wanna know its impact. Thanks!
I've been an audio person for 10+ years. Let me tell you - you don't need 192/24 or anything. If you don't like the audio quality from any legit music streaming service, it's NOT about the codec. get a better connection, quieter place, better earbuds.
Can't wait to share our new Text-to-Audio model, AudioLDM. ๐
This video shows the generation result with a simple text prompt: "A music made by xxx".
More demos coming soon!๐
The paper will be available next Monday on arXiv! ๐
Our model will be open-sourced soon!๐
Um, Spotify will definitely hire 2019 summer research interns for some fun MIR works, so please stay tuned! (i.e. don't say yes to others too soon ๐)