Research Scientist Intern at FAIR Meta. CS PhD student
@UTAustin
. into speech/audio recognition and generation. Previously
@uchicago
Stats,
@BNU_Official
Math
Announcing ๐๐จ๐ข๐๐๐๐ซ๐๐๐ญ๐ช
SotA for both speech editing and zero-shot text-to-speech, Outperforming VALL-E, XTTS-v2, etc.
VoiceCraft works on in-the-wild data such as movies, random videos and podcasts
We fully open source it at
The models weights are up!
We upload the biggest model (830M) and also a smaller model (330M).
Repo:
run ./inference_tts.ipynb or ./inference_speech_editing.ipynb in the folder to try inference
Announcing ๐๐จ๐ข๐๐๐๐ซ๐๐๐ญ๐ช
SotA for both speech editing and zero-shot text-to-speech, Outperforming VALL-E, XTTS-v2, etc.
VoiceCraft works on in-the-wild data such as movies, random videos and podcasts
We fully open source it at
Announcing ๐๐จ๐ข๐๐๐๐ซ๐๐๐ญ๐ช
SotA for both speech editing and zero-shot text-to-speech, Outperforming VALL-E, XTTS-v2, etc.
VoiceCraft works on in-the-wild data such as movies, random videos and podcasts
We fully open source it at
๐๏ธWhisper is essentially an audio-conditioned LLM. Can we prompt it to do unseen tasks?
๐ Introducing PromptingWhisper!
We use simple prompts to adapt Whisper to unseen tasks zero-shot without any finetuning.
๐ Paper:
๐ป Code:
Check out SLAM-LLM๐
Itโs a one stop shop for using LLMs for all kinds of audio tasks - ASR, TTS, audio tagging, audio captioning, spatial audio reasoning, music, and more!
Why is Whisper so robust to background noise? Not because Whisper suppresses them, but because Whisper ๐ฎ๐ง๐๐๐ซ๐ฌ๐ญ๐๐ง๐๐ฌ them!
Check out the amazing work by Yuan Gong
@YGongND
. They reveal this emergent capability of Whisper, and get SOTA *simultaneous* ASR + audio tagging
Best tutorial on diffusion models that Iโve seen, along with
Also the two are complementary: the former use the discrete Markov chain perspective, while the latter uses the ODE perspective (also more involved)
Per several requests, here are the PowerPoint slides for the diffusion model tutorial.
1โฃTraining:
2โฃGuidance:
3โฃResolution:
4โฃSpeed:
Explainer video:
VoiceCraft will be presented at ACL2024๐ฅ๐ฅ๐ฅ
Since itโs release, weโve added a significant amount of features requested by the community, with the help of the community!
Try interactive demo, jupyter notebooks, command line, finetuning, and more, all at
Announcing ๐๐จ๐ข๐๐๐๐ซ๐๐๐ญ๐ช
SotA for both speech editing and zero-shot text-to-speech, Outperforming VALL-E, XTTS-v2, etc.
VoiceCraft works on in-the-wild data such as movies, random videos and podcasts
We fully open source it at
Meet ๐๐๐ ๐ฆ! A LLM that can perceive and reason about ๐๐ฉ๐๐ญ๐ข๐๐ฅ ๐๐จ๐ฎ๐ง๐๐ฌ in a 3D world ๐๐๐
Accepted by ICML 2024, it's nice symbiosis between classic signal processing and LLMs, led by the incredible
@zszheng147
Website, paper, code, data:
๐๐จ๐ข๐๐๐๐ซ๐๐๐ญ works well on recordings with diverse accents, emotions, styles, content, background noise, recording conditions.
Demo:
Paper:
code, model, data:
Introducing AV-SUPERB, a new benchmark for audio-visual models.
We found that audio-visual ML models are highly domain specific - none of the existing models can master all tasks in our benchmark.
A lot of headroom for audio-visual research!
Paper:
VALL-E 2 is out ๐ฅ๐ฅ
VALL-E has caused a ๐ฉ๐๐ซ๐๐๐ข๐ ๐ฆ ๐ฌ๐ก๐ข๐๐ญ in TTS, but there are two remaining issues: repetition and slow sampling. One year later, the same author, my friend and colleage
@SanyuanChenAI
, fixed the issues briliantly in VALL-E 2
Excited to present my two papers on ๐ฃ๐ฟ๐ผ๐บ๐ฝ๐๐ถ๐ป๐ด ๐ช๐ต๐ถ๐๐ฝ๐ฒ๐ฟ and ๐ฉ๐ถ๐๐๐ฎ๐น๐น๐ ๐๐ฟ๐ผ๐๐ป๐ฑ๐ฒ๐ฑ ๐ฆ๐ฝ๐ฒ๐ฒ๐ฐ๐ต tomorrow (Aug 21st) at Interspeech. Both papers will be presented in *Forum Poster Area 4* during 11:00-13:00
#INTERSPEECH2023
Delighted to have 3 papers accepted to Interspeech:
๐๐ก๐ข๐ฌ๐ฉ๐๐ซ ๐ฉ๐ซ๐จ๐ฆ๐ฉ๐ญ๐ข๐ง๐ for unseen tasks
Zero-shot cross-lingual hierarchical ๐ฌ๐ฉ๐๐๐๐ก ๐ฌ๐๐ ๐ฆ๐๐ง๐ญ๐๐ญ๐ข๐จ๐ง
๐๐ข๐๐๐จ ๐ค๐ง๐จ๐ฐ๐ฅ๐๐๐ ๐ ๐๐๐ช๐ฎ๐ข๐ฌ๐ข๐ญ๐ข๐จ๐ง for robotic manipulation
Paper & Code up soon
Introducing VG-HuBERT! Trained by matching English speech and images, it shows emergent syllable and word segmentation.
Surprisingly, VG-HuBERT generalizes zero-shot cross-lingually: also SotA on Estonian, Chinese, French, German.
In this paper, we introduce:
a new task - spatial audio question answering
a new dataset - SpatialSoundQA
a new audio encoder - Spatial-AST
a new model - BAT
Meet ๐๐๐ ๐ฆ! A LLM that can perceive and reason about ๐๐ฉ๐๐ญ๐ข๐๐ฅ ๐๐จ๐ฎ๐ง๐๐ฌ in a 3D world ๐๐๐
Accepted by ICML 2024, it's nice symbiosis between classic signal processing and LLMs, led by the incredible
@zszheng147
Website, paper, code, data:
Join us for the Dynamic-SUPERB call-for-tasks event. Submit your innovative task to challenge the speech foundation models that can understand task instruction. Let's push the boundaries of what speech foundation models can do!
Breaking!
HuggingFace's open-sourced reproduction of Stability's close-sourced style-controlled Text-to-Speech model ๐๐๐ซ๐ฅ๐๐ซ-๐๐๐ is out๐๐๐
released v0.1 is trained on 10k hours of data. Forthcoming v1 trained on 50k hours
Very interesting work by
@Sanabria_RST
and collaborators at Edinburgh, showing how difficult ASR is in real life (even for just English!).
An important work for democratizing speech technologies.
very interesting, two papers both titled โAudio Mambaโ came out from different institutions at (almost) the same time
Korean Audio Mamba
Denish Audio Mamba:
Presenting 2 works at
#ICLR
tomorrow!
๐Generative Pre-training for Speech with Flow Matching
๐5/9 (Wed) Hall B
#68
, 10:45am-12:45pm
๐Listen, Think, and Understand
๐5/9 (Wed) Hall B
#60
, 4:30pm-6:30pm
Please stop by if you're interested! More details...๐
@WenhuChen
Maybe it depends on the type of work you do. If your project involves serious model training, and you only have limited academic level compute, having multiple projects ongoing might be a good idea to make better use of your time
Encodec training code is also open sourced and very easy to use!
Finally everyone can train an audio tokenizer on their own data with their preferred model config
Today we open source the training code for our audio generation and compression research in AudioCraft and share new models.
With this release, we aim at giving people the full recipe to play with our models and develop their own models!
Impressive!
EAT outperforms BEATs and AudioMAE on audio classification, while being a magnitude more efficient.
Consider replacing your audio encoder with EAT!
๐ Excited to share our latest work on audio pre-training: EAT - Efficient Audio Transformer!
๐ Achieving SOTA on AudioSet-2M, AudioSet-20K, ESC-50, and SpeechCommands-v2, EAT boosts pre-training efficiency by 15x compared to previous models.
Robust speech models are no longer robust on this real life English speech dataset (including robust w2v2 and Whisper!)
An interesting work and an exciting direction!
We show that visually grounded self-supervised speech model, VG-HuBERT, exhibits emergent word discovery ability from raw speech signals (as much as 40% of the words in a corpus). VG-HuBERT also significantly improves SoTA on unsup word segmentation on ZeroSpeech and Buckeye.
the amount of support and feature requests is also a bit overwhelming for someone who switched to CS only recently, and have only been written shitty research code.
We show that visually grounded self-supervised speech model, VG-HuBERT, exhibits emergent word discovery ability from raw speech signals (as much as 40% of the words in a corpus). VG-HuBERT also significantly improves SoTA on unsup word segmentation on ZeroSpeech and Buckeye.
Excited to introduce SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound
๐ SemantiCodec (50 tokens/second or 0.71kbps) โ Previous methods (200 tokens/second or 2.0kbps).
๐Our study also reveals that SemantiCodec tokens hold richer semantic information.
Project led by the amazing Alan Baade! We show that the Masked Autoencoder idea can be applied to audio spectrogram domain. The MAE-AST matches or outperforms prior self-supervised models on audio classification benchmarks, while being 3x faster and requiring 50% less memory
Visually Grounded Speech paper is about the emergence of syllabic representation in a textless, visually grounded, self-supervised model - VG-HuBERT. In addition to it's training language English, VG-HuBERT can also do Mandarin, Estonian, French, German...
Project led by the amazing Alan Baade! We show that the Masked Autoencoder idea can be applied to audio spectrogram domain. The MAE-AST matches or outperforms prior self-supervised models on audio classification benchmarks, while being 3x faster and requiring 50% less memory
@rdesh26
@csteinmetz1
@unilightwf
I actually tried ASR as the first exploration of the MAE style pretraining on audio, but it didnโt perform as well compared to w2v2/hubert. Maybe I didnโt try hard enough. Alan took it over and made it work on audio classification tasks
Episode 45 of The Thesis Review:
Luke Zettlemoyer (
@LukeZettlemoyer
), "Learning to Map Sentences to Logical Form"
We discuss his PhD thesis on semantic parsing, the evolution of NLP, foundational work on pretraining LLMs, and a lot more!
Prompting Whisper is on investigating the zero-shot task generalization capabilities of OpenAI's Whisper by prompt engineering. Three tasks are studied: AVSR, Code-switched ASR, and Unseen Speech Translation
Paper:
Why did
@MetaAI
open source their MusicGen model:
Generative models could be an unfair competition for artists, but open sourcing them gives everyone a chance to understand, improve, compete and collaborate with them
A nice perspective!
#GenerativeAI
MMS: Massively Multilingual Speech.
- Can do speech2text and text speech in 1100 languages.
- Can recognize 4000 spoken languages.
- Code and models available under the CC-BY-NC 4.0 license.
- half the word error rate of Whisper.
Code+Models:
Paper:
@reach_vb
This is extremely exciting!
Given that many of these models can do multi-speaker TTS or even zero-shot TTS, any plan on benchmarking these capabilities?
Looking Backward: Streaming Video-to-Video Translation with Feature Banks
This paper introduces StreamV2V, a diffusion model that achieves real-time streaming video-to-video (V2V) translation with user prompts. Unlike prior V2V methods using batches to process limited frames,
๐จBREAKING๐จ
American Idol, Vocal Titan, front man of Queen, superstar ๐๐๐๐ฆ ๐๐๐ฆ๐๐๐ซ๐ญ just joined the Chinese singing competition
#Singer2024
๐ฅ๐ฅ๐ฅ
How far can he go?
I'm thrilled to share that I graduated from UT Austin advised by Dr. Kristen Grauman and joined
@StanfordSVL
as a postdoc working with
@drfeifei
and
@eadeli
on multimodal perception and generation for humans, building on my thesis research on multimodal 3D scene understanding!๐
#CVPR2023
โ Here are 6๏ธโฃ Interesting new papers you should know about from Meta AI โ and how you can access them if youโre not at the conference.
๐งต
We present MusicGen: A simple and controllable music generation model. MusicGen can be prompted by both text and melody.
We release code (MIT) and models (CC-BY NC) for open research, reproducibility, and for the music community:
Personal update: I'll be starting in July 2024 as an Assistant Professor
@UWCheritonCS
and a Faculty Member
@VectorInst
! Looking forward to working with all the amazing folks!
Prospective students: if you are interested in NLP and/or comp. linguistics, please consider applying!
@begusgasper
Thanks for the great resource on language learning from raw audio! Baby acquires language also through joint learning from auditory and visual input. Therefore a shameless plug for our work on word discovery in visually grounded speech models
@YGongND
Whisper is such an legendary model, with the *emergent capabilities* kept being discovered - emergent prompted zero-shot generalization (as report in our work ), and emergent audio understanding capability revealed in Gong's work
@inf800
We evaluated VoiceCraft on internet videos and podcasts, which consist diverse accents, the model handles them pretty well. Check out examples at
Here's my conversation with Mark Zuckerberg, his 3rd time on the podcast, but this time we talked in the Metaverse as photorealistic avatars. This was one of the most incredible experiences of my life. It really felt like we were talking in-person, but we were miles apart ๐คฏ It's
Episode 46 of The Thesis Review:
Yulia Tsvetkov (
@tsvetshop
), "Linguistic Knowledge in Data-Driven NLP"
We discuss Yulia's PhD work that combined ideas from linguistics and NLP, low-resource and multilingual NLP, and a lot of great advice!
We tested Whisper on ๐ฌaudio-visual speech recognition, ๐ code-switched speech recognition, and ๐ speech translation on unseen language pairs.
Results are surprising and reveal fascinating ๐๐ฆ๐๐ซ๐ ๐๐ง๐ญ ๐ฉ๐ซ๐จ๐ฉ๐๐ซ๐ญ๐ข๐๐ฌ of Whisper!
@unilightwf
So true, I hope more โwe investigate the interesting phenomenon ofโ papers receive same appreciation as โwe achieve sota performance onโ paper in speech community (if both are solid research)
@SanyuanChenAI
is the first/co-first author of ๐๐๐ฏ๐๐, ๐๐๐๐๐, and ๐๐๐๐-๐. He recently graduated with his PhD and 2400 citations, and joined Meta
Sanyuan only has 2 followers on X (including me). Does he hold the record of having the highest citation/follower ratio?
For AVSR, CLIP can be Whisper's eyes and allows it to transcribe speech in videos more accurately
for CS-ASR and ST, changing just one special token in the prompt can boost performance significantly by e.g. 45%
It's extremely surprising that the English trained VG-HuBERT can be directly applied for syllabic and word segmentation on unseen languages and achieves SotA, without any adaptation.
Could be useful for democratizing speech tech for zero-resource languages