🚨We investigate a new problem in our new preprint: training multi-speaker TTS with speaker anonymized data! The goal is to protect privacy in the era of giant speech generation models.
📰Paper:
🎵Samples:
🧵A thread:
Speech researchers... you need to read this masterpiece. It's phenomenal.
A Large-Scale Evaluation of Speech Foundation Models
All kudos to the mighty Leo
@leo19941227
, father of the great S3PRL toolkit!
Speech synthesis researchers, let's all read this work.
"It is therefore of utmost importance for speech synthesis researchers to be mindful of their choices of comparison systems and how they may affect MOS results."
Stop using Tacotron as your baseline.
🚨🚨The singing voice conversion challenge 2023 summary paper is out!
Tl; dr:
✅Human level naturalness achieved by top teams!
❌Conversion similarity: still a long way to go!
Kudos to the team 🙌🙌
@lesterphv
@jiatongshi
@shaunliu231
Very sad to find this new paper () on expressive S2ST did not cite my internship work on the same direction…
Anyway, I’m presenting this work at ICASSP next week. Come if you are interested!
My name is Wen-Chin Huang, a Ph.D. student at Nagoya Univ. I work on speech processing and deep learning. Interned twice at Meta. I can speak Man/Eng/Japanese. I am in the 2024 job market, looking for research positions in Japan. Please reach out to me if you are interested!
🚨🚨The singing voice conversion challenge 2023 summary paper is out!
Tl; dr:
✅Human level naturalness achieved by top teams!
❌Conversion similarity: still a long way to go!
Kudos to the team 🙌🙌
@lesterphv
@jiatongshi
@shaunliu231
Today, we presented the summary papers for the Singing Voice Conversion Challenge 2023 and the VoiceMOS Challenge 2023 at ASRU2023! Fully immersed in the great discussions 🗣️
Spoiler: Pretty certain that there will be future editions for these two challenges. Stay tuned ⚡️⚡️
It seems to be not a coincidence that some of the strongest leaders in AI who manage large teams frequently do very low-level technical work.
Jeff Dean doing weekly IC (individual contributor) work while managing 3k+ people at Google Research is the canonical example, but I've
Thank you for coming to my posters at ICASSP2023! Got reminded of how fun talking research is (maybe more than doing research!) Not going to Dublin, so see you in ASRU, which will be in Taiwan 🇹🇼🇹🇼🇹🇼
🚨The Amazing
@erica_cooper
and I wrote an invited review paper on synthetic speech evaluation! Seriously ALL KUDOS to her 🙌🙌🙌 I cannot respect her more for doing so much survey on this topic 🔥
It is published on AST, a Japanese journal. Take a look!
"StyleTTS 2 advances the state-of-the-art by achieving a statistically significant CMOS of +1.07 (p ≪ 0.01) compared to NaturalSpeech."
I don't quite get it... If Naturalspeech has human-level naturalness, what is "more natural than human"??
What… don’t use web-scale data? NVIDIA, the GPU manufacturer, tells us to cut GPU usage…?
“Oh no! OWSM required 64 A100 40GB GPUs to train! High resource requirement! We proposed to just train on…
128 A100 80GB GPUs 😀😀😀😀”
Sorry, my bad, I misunderstood you.
``Less is More: Accurate Speech Recognition & Translation without Web-Scale Data,'' Krishna C. Puvvada, Piotr \.Zelasko, He Huang, Oleksii Hrinchuk, Nithin Rao Koluguri, Kunal Dhawan, Somshubra Majumdar, Elena Rastorgueva, Zhehuai Chen, Vitaly Lavrukhin,…
🔥Fresh and HOT🔥
S3PRL-VC now has a HuggingFace Space demo!!
You can record your own voice and convert it to one of the four pre-defined speakers. Personally, I find it particularly interesting when the input is not English 😎
I am releasing the standalone version of S3PRL-VC!
S3PRL-VC aims to provide a platform to compare different self-supervised speech pretrained representation (S3PR?) in the application of voice conversion. Please try it out!
Repo:
💡New preprint on non-autoregressive sequence-to-sequence voice conversion (non-AR seq2seq VC) ‼️
We made seq2seq VC training fast and simple, and it can work on a 5 min parallel dataset!
Demo:
Code:
Paper:
Daily promotion. Now with a flyer!
Announcing the VoiceMOS Challenge 2024! VMC'24 has been accepted as a special session at SLT2024. There will be 3 tracks. The challenge tentatively starts on 4/10.
Registration form:
Website:
🚨With
@erica_cooper
we released the summary of the VoiceMOS Challenge 2023!
We focused on 0-SHOT evaluation of 3 domains:
🇫🇷 French TTS
🎵 Singing voice conversion
💥 Noisy/enhanced speech
If you will attend ASRU, consider coming to our special session!
Meanwhile, both papers were honored to be selected as the top 3% papers at ASRU2023🎖️
Although we did not win the best paper award, we still want to thank the acknowledgement from the reviewers and the TPC 🙌
Today, we presented the summary papers for the Singing Voice Conversion Challenge 2023 and the VoiceMOS Challenge 2023 at ASRU2023! Fully immersed in the great discussions 🗣️
Spoiler: Pretty certain that there will be future editions for these two challenges. Stay tuned ⚡️⚡️
🔥We warmly invite you to participate in the VoiceMOS Challenge 2024!
VMC'24 has been accepted as a special session at SLT2024. There will be 3 tracks. The challenge tentatively starts on 4/10.
Registration form:
Website:
Yet another paper outperforming NaturalSpeech2...
"We conducted all subjective tests using 11 native judgers, with each metric consisting of 20 sentences per speaker."
How on earth did you get such low stds with only 11 listeners??
🔥It’s official! Together with
@erica_cooper
we are holding the VoiceMOS challenge 2023! This year IT’S REAL… be prepared to tackle some real-world scenarios, with three tracks focusing on the Blizzard and singing voice conversion challenges, as well as speech enhancement data!
Announcing the VoiceMOS Challenge 2023!
Challenge website:
Register to participate:
This edition of the challenge will focus on real-world and challenging zero-shot out-of-domain mean opinion score prediction!
It’s really sad to see Google publishing papers like this. I really feel they just want to beat VALL-E, as almost all experiments compared with it, as if it is the only TTS system right now that is worthwhile comparing.
The VoiceMOS Challenge, which I co-organized with
@erica_cooper
from
@yamagishilab
, was accepted as a special session at INTERSPEECH 2022! We are still welcoming participants so please don't hesitate to register!
My internship at Meta ended perfectly with a lovely dinner with my manager and his family (his wife and daughters, 1.5yo&4yo). A bit chaotic, a bit sweet, just like my internship!
After reading the reviews of the INTERSPEECH submissions in our lab, just can't wait to see how good those highly rated accepted papers are!! 😎😎😎 They've got to be novel, original, full of technical breakthroughs, and of course STATE-OF-THE-ART!!! 😎😎😎😎😎😎
I swear I’ve seen 10+ TTS papers with titles like this, just saying that it’s fast and lightweight without mentioning what technique was used. I’d rather read paper with bad jokes in the title.
So far all the prompt-based VC papers use VALL-E (or AudioLM, whatever you like) style LLM, so they all fall back to ASR+TTS-based VC. The only difference then becomes how the dataset is constructed. It's really getting boring.
After three years, the voice conversion challenge (VCC) is back🤩 This time we’re focusing on *singing* voice 🎤 hoping to attract
#SpeechProc
and
#MusicProc
people🔥 Registration starts today, see for more details 🙌🙌
We are proud to announce the first Singing Voice Conversion Challenge (SVCC2023)! Building on the success of the previous VCC events, we plan to further push the limits of voice conversion by now focusing on singing voices, which is more difficult to model than speech.
Free from all the preparations for INTERSPEECH (trip arrangement, poster/slides, etc.) gives me plenty of time to get an early lead in ICASSP paper progress: currently at 2.5/4 pages 🎉
Check out our new paper on foreign accent conversion (FAC)! Accepted to APSIPA ASC 2023 🇹🇼
Demo:
Code:
Paper:
We found none of the three most recent FAC methods is superior to the other 🤔
I will be presenting joint research efforts dedicated to speech quality assessment by me and Erica in the past two years! Contents will include the latest results in VoiceMOS 2023. 🙌
Dr. Erica Cooper (National Institute of Informatics, Japan) & Mr. Wen-Chin Huang (Nagoya University, Japan) are the founders and organizers of the VoiceMOS Challenge, a shared task challenge for automatic opinion score prediction for synthesized speech.
ASR+TTS VC suffers from error propagation & cannot preserve non-verbal content, but IMO not the mainstream approach rn. maybe that's why they did not provide any reference? S-MOS results not so great as expected. Also wanted to see difference wrt LM-VC.
📢I am releasing a standalone toolkit for seq2seq voice conversion.
For now, this repo only supports reproducing the results of my Voice Transformer Network paper, but the plan is to release codes for my other works based on this seq2seq VC model.
Repo:
What I like about the
#SpeechProc
confs is we encourage rather than picking out good research. The ~50% accept rate makes the confs more like a platform for sharing research progress rather than “ah, this guy survived the brutal review process”.
I like this paper except for it submitted to ACL.. this team (no offense, they do very solid speech/audio research) likes to submit papers to the so called “top conferences” (ICLR, NeruIPS…) where reviewers IMO probably know little about speech/audio..
MS, May 2022: our TTS is finally indistinguishable from natural recording! yay!
Also MS, Jan 2023: we're comparing our new TTS system, VALL-E, with a solid baseline! Its name is... YourTTS!!
As a speech synthesis guy (1) signal metrics (PESQ, etc.) are not accepted as perceptual metrics by synthesis people (2) really hope to see TTS included in the applications rather than ASR, ASV... (3) IMO dataset mismatch is the biggest problem in most codecs but not evaluated..
To my audio friends: is there an official definition of the real-time factor? I always thought that RTF < 1 means faster than real-time, but the EnCodec paper gave the complete opposite definition... (c.f., Sec. 4.6)
I knew it... someday this work will come out... don't know if they are aware that there is something called unit selection (or concatenation) in speech synthesis...
The Singing Voice Conversion Challenge (SVCC) 2023 is still open for registration! Come join us to push the boundary of singing voice conversion, the intersection between
#SpeechProc
and
#MusicProc
. See more details and register today at
"The existing wav2vec-based VC proposals, however, only use the last-layer wav2vec representations."
"...state-of-the-art, any-to-any VC models, AdaIN-VC, FragmentVC and S2VC."
Sigh...
Today I will present my 3rd first-author paper at
@ieeeICASSP
:
LDNet: Unified Listener Dependent Modeling in MOS Prediction for Synthetic Speech
Paper:
Thu, 12 May, 23:00 - 23:45 Singapore Time (UTC +8) @ Gather Area K
Come to our poster!
Attempts to re-train open-sourced VC toolkits on JVS, a Japanese corpus:
knn-vc ('23): bad quality
FreeVC ('22): bad conversion similarity
Diff-HierVC ('23): incomplete files, cannot even start training
DiffVC ('21): cannot synthesize meaningful speech
VQMIVC ('21): somehow okay