Puyuan Peng @PuyuanPeng Twitter profile | Pikagi

Pikagi

Puyuan Peng

@PuyuanPeng

1,123

Followers

497

Following

14

Media

235

Statuses

Research Scientist Intern at FAIR Meta. CS PhD student @UTAustin . into speech/audio recognition and generation. Previously @uchicago Stats, @BNU_Official Math

New York, USA

https://t.co/JUIyozXgWI

Joined December 2019

Don't wanna be here? Send us removal request.

Pinned Tweet

@PuyuanPeng

Puyuan Peng

4 months

Announcing 𝐕𝐨𝐢𝐜𝐞𝐂𝐫𝐚𝐟𝐭🪄 SotA for both speech editing and zero-shot text-to-speech, Outperforming VALL-E, XTTS-v2, etc. VoiceCraft works on in-the-wild data such as movies, random videos and podcasts We fully open source it at

30

140

687

Last Seen Profiles

@profesoracens

@CoachDistance

@LeTemps

@zi_guo_05

@RyonaPink

@ImTiare

@Alrahmaoman

@The_RHS

@robo1797

@hannahmusic

@xXskill_ZXx

@CdexNekokuma

@samskopp

@BlueStarMedia1

@noa1068333

@keahbone

@Martinatthelane

@g_zwartsenberg

@entrepreneur_cm

@_JamesBradley__

@hermansjoost

@bigchuchu01

@Jasmerplays

@yenniebreadd

@JaycklsFalak

@mycheeser

@ginkou_io

@Vermanubis

@sugu_sword

@Zizzani

@279juta

@KBfinalsMVP

@iHearMichelle

@airrack

@Hobikage_spam

@selee4567

@PuyuanPeng

Puyuan Peng

3 months

The models weights are up! We upload the biggest model (830M) and also a smaller model (330M). Repo: run ./inference_tts.ipynb or ./inference_speech_editing.ipynb in the folder to try inference

Tweet card media

GitHub - jasonppy/VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

Zero-Shot Speech Editing and Text-to-Speech in the Wild - jasonppy/VoiceCraft

@PuyuanPeng

Puyuan Peng

4 months

Announcing 𝐕𝐨𝐢𝐜𝐞𝐂𝐫𝐚𝐟𝐭🪄 SotA for both speech editing and zero-shot text-to-speech, Outperforming VALL-E, XTTS-v2, etc. VoiceCraft works on in-the-wild data such as movies, random videos and podcasts We fully open source it at

30

140

687

7

65

349

@PuyuanPeng

Puyuan Peng

3 months

We open sourced it 10 days ago And now it has 3.1k stars already 🚀🚀🚀

Tweet card media

GitHub - jasonppy/VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

Zero-Shot Speech Editing and Text-to-Speech in the Wild - jasonppy/VoiceCraft

@PuyuanPeng

Puyuan Peng

4 months

Announcing 𝐕𝐨𝐢𝐜𝐞𝐂𝐫𝐚𝐟𝐭🪄 SotA for both speech editing and zero-shot text-to-speech, Outperforming VALL-E, XTTS-v2, etc. VoiceCraft works on in-the-wild data such as movies, random videos and podcasts We fully open source it at

30

140

687

11

46

250

@PuyuanPeng

Puyuan Peng

1 year

🎙️Whisper is essentially an audio-conditioned LLM. Can we prompt it to do unseen tasks? 🚀 Introducing PromptingWhisper! We use simple prompts to adapt Whisper to unseen tasks zero-shot without any finetuning. 📄 Paper: 💻 Code:

Tweet card media

GitHub - jasonppy/PromptingWhisper: Promting Whisper for Audio-Visual Speech Recognition, Code-Sw...

Promting Whisper for Audio-Visual Speech Recognition, Code-Switched Speech Recognition, and Zero-Shot Speech Translation - jasonppy/PromptingWhisper

2

26

153

@PuyuanPeng

Puyuan Peng

2 months

Check out SLAM-LLM🚀 It’s a one stop shop for using LLMs for all kinds of audio tasks - ASR, TTS, audio tagging, audio captioning, spatial audio reasoning, music, and more!

Tweet card media

GitHub - X-LANCE/SLAM-LLM: Speech, Language, Audio, Music Processing with Large Language Model

Speech, Language, Audio, Music Processing with Large Language Model - X-LANCE/SLAM-LLM

2

35

141

@PuyuanPeng

Puyuan Peng

1 year

Why is Whisper so robust to background noise? Not because Whisper suppresses them, but because Whisper 𝐮𝐧𝐝𝐞𝐫𝐬𝐭𝐚𝐧𝐝𝐬 them! Check out the amazing work by Yuan Gong @YGongND . They reveal this emergent capability of Whisper, and get SOTA *simultaneous* ASR + audio tagging

Tweet media one

Tweet media two

4

19

128

@PuyuanPeng

Puyuan Peng

2 months

Best tutorial on diffusion models that I’ve seen, along with Also the two are complementary: the former use the discrete Markov chain perspective, while the latter uses the ODE perspective (also more involved)

Tweet card media

Miika Aittala: Elucidating the Design Space of Diffusion-Based...

Abstract: We argue that the theory and practice of diffusion-based generative models are currently unnecessarily convoluted and seek to remedy the situation ...

www.youtube.com

@jbhuang0604

Jia-Bin Huang

2 months

Per several requests, here are the PowerPoint slides for the diffusion model tutorial. 1⃣Training: 2⃣Guidance: 3⃣Resolution: 4⃣Speed: Explainer video:

2

48

229

1

19

86

@PuyuanPeng

Puyuan Peng

2 months

VoiceCraft will be presented at ACL2024🔥🔥🔥 Since it’s release, we’ve added a significant amount of features requested by the community, with the help of the community! Try interactive demo, jupyter notebooks, command line, finetuning, and more, all at

Tweet card media

GitHub - jasonppy/VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

Zero-Shot Speech Editing and Text-to-Speech in the Wild - jasonppy/VoiceCraft

@PuyuanPeng

Puyuan Peng

4 months

Announcing 𝐕𝐨𝐢𝐜𝐞𝐂𝐫𝐚𝐟𝐭🪄 SotA for both speech editing and zero-shot text-to-speech, Outperforming VALL-E, XTTS-v2, etc. VoiceCraft works on in-the-wild data such as movies, random videos and podcasts We fully open source it at

30

140

687

3

25

74

@PuyuanPeng

Puyuan Peng

2 months

Meet 𝐁𝐀𝐓 🦇! A LLM that can perceive and reason about 𝐒𝐩𝐚𝐭𝐢𝐚𝐥 𝐒𝐨𝐮𝐧𝐝𝐬 in a 3D world 🚀🚀🚀 Accepted by ICML 2024, it's nice symbiosis between classic signal processing and LLMs, led by the incredible @zszheng147 Website, paper, code, data:

2

4

49

@PuyuanPeng

Puyuan Peng

4 months

𝐕𝐨𝐢𝐜𝐞𝐂𝐫𝐚𝐟𝐭 works well on recordings with diverse accents, emotions, styles, content, background noise, recording conditions. Demo: Paper: code, model, data:

Tweet card media

GitHub - jasonppy/VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

Zero-Shot Speech Editing and Text-to-Speech in the Wild - jasonppy/VoiceCraft

2

8

49

@PuyuanPeng

Puyuan Peng

10 months

Introducing AV-SUPERB, a new benchmark for audio-visual models. We found that audio-visual ML models are highly domain specific - none of the existing models can master all tasks in our benchmark. A lot of headroom for audio-visual research! Paper:

1

8

36

@PuyuanPeng

Puyuan Peng

18 days

VALL-E 2 is out 🔥🔥 VALL-E has caused a 𝐩𝐚𝐫𝐚𝐝𝐢𝐠𝐦 𝐬𝐡𝐢𝐟𝐭 in TTS, but there are two remaining issues: repetition and slow sampling. One year later, the same author, my friend and colleage @SanyuanChenAI , fixed the issues briliantly in VALL-E 2

Tweet media one

3

10

47

@PuyuanPeng

Puyuan Peng

11 months

Excited to present my two papers on 𝗣𝗿𝗼𝗺𝗽𝘁𝗶𝗻𝗴 𝗪𝗵𝗶𝘀𝗽𝗲𝗿 and 𝗩𝗶𝘀𝘂𝗮𝗹𝗹𝘆 𝗚𝗿𝗼𝘂𝗻𝗱𝗲𝗱 𝗦𝗽𝗲𝗲𝗰𝗵 tomorrow (Aug 21st) at Interspeech. Both papers will be presented in *Forum Poster Area 4* during 11:00-13:00 #INTERSPEECH2023

2

1

37

@PuyuanPeng

Puyuan Peng

4 months

Excited to announce that I’ll be interning at FAIR this summer with Wei-Ning Hsu @mhnt1580 , the author of HuBERT and Audiobox!

3

0

33

@PuyuanPeng

Puyuan Peng

4 months

@GozukaraFurkan No training needed - To clone or edit a voice, it only needs a 3 seconds reference of that voice during inference.

1

5

29

@PuyuanPeng

Puyuan Peng

1 year

Delighted to have 3 papers accepted to Interspeech: 𝐖𝐡𝐢𝐬𝐩𝐞𝐫 𝐩𝐫𝐨𝐦𝐩𝐭𝐢𝐧𝐠 for unseen tasks Zero-shot cross-lingual hierarchical 𝐬𝐩𝐞𝐞𝐜𝐡 𝐬𝐞𝐠𝐦𝐞𝐧𝐭𝐚𝐭𝐢𝐨𝐧 𝐕𝐢𝐝𝐞𝐨 𝐤𝐧𝐨𝐰𝐥𝐞𝐝𝐠𝐞 𝐚𝐜𝐪𝐮𝐢𝐬𝐢𝐭𝐢𝐨𝐧 for robotic manipulation Paper & Code up soon

2

0

31

@PuyuanPeng

Puyuan Peng

1 year

Introducing VG-HuBERT! Trained by matching English speech and images, it shows emergent syllable and word segmentation. Surprisingly, VG-HuBERT generalizes zero-shot cross-lingually: also SotA on Estonian, Chinese, French, German.

Tweet media one

Tweet media two

Tweet media three

Tweet media four

1

3

28

@PuyuanPeng

Puyuan Peng

2 months

In this paper, we introduce: a new task - spatial audio question answering a new dataset - SpatialSoundQA a new audio encoder - Spatial-AST a new model - BAT

Tweet media one

Tweet media two

Tweet media three

Tweet media four

@PuyuanPeng

Puyuan Peng

2 months

Meet 𝐁𝐀𝐓 🦇! A LLM that can perceive and reason about 𝐒𝐩𝐚𝐭𝐢𝐚𝐥 𝐒𝐨𝐮𝐧𝐝𝐬 in a 3D world 🚀🚀🚀 Accepted by ICML 2024, it's nice symbiosis between classic signal processing and LLMs, led by the incredible @zszheng147 Website, paper, code, data:

2

4

49

1

5

25

@PuyuanPeng

Puyuan Peng

2 months

This is a BIG campaign for benchmarking generalized speech/audio models🔥 🔥🔥 Join us to welcome the GPT era for speech and audio🚀🚀🚀

@HungyiLee2

Hung-yi Lee (李宏毅)

2 months

Join us for the Dynamic-SUPERB call-for-tasks event. Submit your innovative task to challenge the speech foundation models that can understand task instruction. Let's push the boundaries of what speech foundation models can do!

Tweet media one

1

19

75

0

3

25

@PuyuanPeng

Puyuan Peng

3 months

Breaking! HuggingFace's open-sourced reproduction of Stability's close-sourced style-controlled Text-to-Speech model 𝐏𝐚𝐫𝐥𝐞𝐫-𝐓𝐓𝐒 is out🚀🚀🚀 released v0.1 is trained on 10k hours of data. Forthcoming v1 trained on 50k hours

0

0

23

@PuyuanPeng

Puyuan Peng

1 year

Very interesting work by @Sanabria_RST and collaborators at Edinburgh, showing how difficult ASR is in real life (even for just English!). An important work for democratizing speech technologies.

Tweet media one

3

4

19

@PuyuanPeng

Puyuan Peng

1 month

very interesting, two papers both titled “Audio Mamba” came out from different institutions at (almost) the same time Korean Audio Mamba Denish Audio Mamba:

Tweet card media

Audio Mamba: Bidirectional State Space Model for Audio...

Transformers have rapidly become the preferred choice for audio classification, surpassing methods based on CNNs. However, Audio Spectrogram Transformers (ASTs) exhibit quadratic scaling due to...

@ArxivSound

arXiv Sound

1 month

``Audio Mamba: Selective State Spaces for Self-Supervised Audio Representations,'' Sarthak Yadav, Zheng-Hua Tan,

1

3

42

1

1

16

@PuyuanPeng

Puyuan Peng

2 months

Cool works from MIT speech group to be presented on ICLR

@alex_h_liu

Alexander H. Liu

2 months

Presenting 2 works at #ICLR tomorrow! 📃Generative Pre-training for Speech with Flow Matching 📍5/9 (Wed) Hall B #68 , 10:45am-12:45pm 📃Listen, Think, and Understand 📍5/9 (Wed) Hall B #60 , 4:30pm-6:30pm Please stop by if you're interested! More details...👇

Tweet media one

Tweet media two

2

12

61

1

3

16

@PuyuanPeng

Puyuan Peng

2 years

Had a nice beer after work

Tweet media one

0

1

15

@PuyuanPeng

Puyuan Peng

2 months

@WenhuChen Maybe it depends on the type of work you do. If your project involves serious model training, and you only have limited academic level compute, having multiple projects ongoing might be a good idea to make better use of your time

1

0

15

@PuyuanPeng

Puyuan Peng

11 months

Encodec training code is also open sourced and very easy to use! Finally everyone can train an audio tokenizer on their own data with their preferred model config

@jadecopet

Jade Copet

11 months

Today we open source the training code for our audio generation and compression research in AudioCraft and share new models. With this release, we aim at giving people the full recipe to play with our models and develop their own models!

4

26

146

0

1

14

@PuyuanPeng

Puyuan Peng

5 months

Impressive! EAT outperforms BEATs and AudioMAE on audio classification, while being a magnitude more efficient. Consider replacing your audio encoder with EAT!

@zszheng147

Zhisheng Zheng

6 months

🌟 Excited to share our latest work on audio pre-training: EAT - Efficient Audio Transformer! 🚀 Achieving SOTA on AudioSet-2M, AudioSet-20K, ESC-50, and SpeechCommands-v2, EAT boosts pre-training efficiency by 15x compared to previous models.

Tweet media one

2

3

16

1

1

14

@PuyuanPeng

Puyuan Peng

1 year

Robust speech models are no longer robust on this real life English speech dataset (including robust w2v2 and Whisper!) An interesting work and an exciting direction!

0

0

14

@PuyuanPeng

Puyuan Peng

2 years

This work has been accepted by Interspeech 2022! The code and model weights are available at . See you guys in Incheon in September!

GitHub - jasonppy/word-discovery: Word Discovery in Visually Grounded, Self-Supervised Speech Models

Word Discovery in Visually Grounded, Self-Supervised Speech Models - jasonppy/word-discovery

@PuyuanPeng

Puyuan Peng

2 years

We show that visually grounded self-supervised speech model, VG-HuBERT, exhibits emergent word discovery ability from raw speech signals (as much as 40% of the words in a corpus). VG-HuBERT also significantly improves SoTA on unsup word segmentation on ZeroSpeech and Buckeye.

Tweet media one

Tweet media two

Tweet media three

0

0

8

0

0

14

@PuyuanPeng

Puyuan Peng

4 months

@oleg__chomp Making it multilingual is our on-going work

0

0

10

@PuyuanPeng

Puyuan Peng

11 months

@wjwwilliams @ISCAInterspeech @Speechmatics speechmatics has been doing amazingly well so far! Nothing cooler than being the live ASR provider for a top speech research conference!!

1

0

9

@PuyuanPeng

Puyuan Peng

3 months

the amount of support and feature requests is also a bit overwhelming for someone who switched to CS only recently, and have only been written shitty research code.

0

0

8

@PuyuanPeng

Puyuan Peng

2 years

We show that visually grounded self-supervised speech model, VG-HuBERT, exhibits emergent word discovery ability from raw speech signals (as much as 40% of the words in a corpus). VG-HuBERT also significantly improves SoTA on unsup word segmentation on ZeroSpeech and Buckeye.

Tweet media one

Tweet media two

Tweet media three

@Arxiv_Daily

arXiv Daily

2 years

Word Discovery in Visually Grounded, Self-Supervised Speech Models by @PuyuanPeng et al. #Computation #Language

0

0

5

0

0

8

@PuyuanPeng

Puyuan Peng

2 months

Amazing work Haohe! Explicitly leveraging semantic representation gives you a huge bump in both expressiveness and compressibility.

@LiuHaohe

Haohe Liu

2 months

Excited to introduce SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound 🎉 SemantiCodec (50 tokens/second or 0.71kbps) ≈ Previous methods (200 tokens/second or 2.0kbps). 🎉Our study also reveals that SemantiCodec tokens hold richer semantic information.

Tweet media one

Tweet media two

7

26

126

1

0

8

@PuyuanPeng

Puyuan Peng

4 months

@reach_vb @rdesh26 Yea let’s do that!

0

0

7

@PuyuanPeng

Puyuan Peng

3 months

The 330M model in many cases performs just as good as the 830M one

1

0

7

@PuyuanPeng

Puyuan Peng

2 years

Our MAE-AST paper has been accepted by Interspeech 2022! Paper , code and model weights

Tweet card media

GitHub - AlanBaade/MAE-AST-Public: Public Code for the paper MAE-AST: Masked Autoencoding Audio...

Public Code for the paper MAE-AST: Masked Autoencoding Audio Spectrogram Transformer - AlanBaade/MAE-AST-Public

@PuyuanPeng

Puyuan Peng

2 years

Project led by the amazing Alan Baade! We show that the Masked Autoencoder idea can be applied to audio spectrogram domain. The MAE-AST matches or outperforms prior self-supervised models on audio classification benchmarks, while being 3x faster and requiring 50% less memory

0

0

6

1

1

7

@PuyuanPeng

Puyuan Peng

1 month

Awesome work @ardasnck ! couldn’t think of a better domain for memba than audio

@ardasnck

Arda SENOCAK

1 month

Thanks for sharing our work @_akhaliq 🤩 Code is coming very soon 🐍🎙️

2

2

26

1

0

7

@PuyuanPeng

Puyuan Peng

1 month

Adam Lambert 要来参加歌手2024 了，首场会不会唱Believe？

Tweet media one

0

0

6

@PuyuanPeng

Puyuan Peng

11 months

Visually Grounded Speech paper is about the emergence of syllabic representation in a textless, visually grounded, self-supervised model - VG-HuBERT. In addition to it's training language English, VG-HuBERT can also do Mandarin, Estonian, French, German...

Tweet card media

Syllable Discovery and Cross-Lingual Generalization in a Visually...

In this paper, we show that representations capturing syllabic units emerge when training a self-supervised speech model with a visually-grounded training objective. We demonstrate that a nearly...

0

0

6

@PuyuanPeng

Puyuan Peng

2 years

Project led by the amazing Alan Baade! We show that the Masked Autoencoder idea can be applied to audio spectrogram domain. The MAE-AST matches or outperforms prior self-supervised models on audio classification benchmarks, while being 3x faster and requiring 50% less memory

@Arxiv_Daily

arXiv Daily

2 years

MAE-AST: Masked Autoencoding Audio Spectrogram Transformer by Alan Baade et al. including @PuyuanPeng #Autoencoder #Computation

0

3

9

0

0

6

@PuyuanPeng

Puyuan Peng

4 months

@realmrfakename Thanks! Have been discussing the licensing issue, might change it in the coming days

0

0

6

@PuyuanPeng

Puyuan Peng

3 months

The repo also supports model training

0

0

6

@PuyuanPeng

Puyuan Peng

29 days

第一次感觉作文稳了，可惜晚了九年

Tweet media one

1

0

7

@PuyuanPeng

Puyuan Peng

1 year

@rdesh26 @csteinmetz1 @unilightwf I actually tried ASR as the first exploration of the MAE style pretraining on audio, but it didn’t perform as well compared to w2v2/hubert. Maybe I didn’t try hard enough. Alan took it over and made it work on audio classification tasks

Tweet card media

MAE-AST: Masked Autoencoding Audio Spectrogram Transformer

In this paper, we propose a simple yet powerful improvement over the recent Self-Supervised Audio Spectrogram Transformer (SSAST) model for speech and audio classification. Specifically, we...

2

0

6

@PuyuanPeng

Puyuan Peng

1 year

Absolutely thrilled to see this podcast is still going on (and thriving!) It has been an source of motivation and wisdom for my own PhD journey

@thesisreview

The Thesis Review Podcast

1 year

Episode 45 of The Thesis Review: Luke Zettlemoyer ( @LukeZettlemoyer ), "Learning to Map Sentences to Logical Form" We discuss his PhD thesis on semantic parsing, the evolution of NLP, foundational work on pretraining LLMs, and a lot more!

Tweet media one

3

27

139

0

0

5

@PuyuanPeng

Puyuan Peng

11 months

Prompting Whisper is on investigating the zero-shot task generalization capabilities of OpenAI's Whisper by prompt engineering. Three tasks are studied: AVSR, Code-switched ASR, and Unseen Speech Translation Paper:

1

0

5

@PuyuanPeng

Puyuan Peng

1 year

Why did @MetaAI open source their MusicGen model: Generative models could be an unfair competition for artists, but open sourcing them gives everyone a chance to understand, improve, compete and collaborate with them A nice perspective! #GenerativeAI

Tweet media one

0

0

5

@PuyuanPeng

Puyuan Peng

1 year

The most beautiful world map I've ever seen. Incredible work @MetaAI !

Tweet media one

@ylecun

Yann LeCun

1 year

MMS: Massively Multilingual Speech. - Can do speech2text and text speech in 1100 languages. - Can recognize 4000 spoken languages. - Code and models available under the CC-BY-NC 4.0 license. - half the word error rate of Whisper. Code+Models: Paper:

177

1K

5K

0

0

5

@PuyuanPeng

Puyuan Peng

4 months

@reach_vb This is extremely exciting! Given that many of these models can do multi-speaker TTS or even zero-shot TTS, any plan on benchmarking these capabilities?

0

0

3

@PuyuanPeng

Puyuan Peng

4 months

@paulo_zip have been discussing the licensing issue, might change it in the coming days

1

0

4

@PuyuanPeng

Puyuan Peng

1 year

super cool

@coqui_ai

coqui

1 year

Listen until the end 😉

0

3

16

0

1

4

@PuyuanPeng

Puyuan Peng

1 month

Check out the amazing streaming V2V model by my friend @LiangJeff95

@_akhaliq

AK

1 month

Looking Backward: Streaming Video-to-Video Translation with Feature Banks This paper introduces StreamV2V, a diffusion model that achieves real-time streaming video-to-video (V2V) translation with user prompts. Unlike prior V2V methods using batches to process limited frames,

4

38

159

1

1

4

@PuyuanPeng

Puyuan Peng

1 month

🚨BREAKING🚨 American Idol, Vocal Titan, front man of Queen, superstar 𝐀𝐝𝐚𝐦 𝐋𝐚𝐦𝐛𝐞𝐫𝐭 just joined the Chinese singing competition #Singer2024 🔥🔥🔥 How far can he go?

1

2

4

@PuyuanPeng

Puyuan Peng

25 days

Congratulations Changan! Bright future ahead🚀🚀🚀

@changan_vr

Changan Chen

25 days

I'm thrilled to share that I graduated from UT Austin advised by Dr. Kristen Grauman and joined @StanfordSVL as a postdoc working with @drfeifei and @eadeli on multimodal perception and generation for humans, building on my thesis research on multimodal 3D scene understanding!🚀

5

1

130

0

0

4

@PuyuanPeng

Puyuan Peng

1 year

very cool! 3 out 6 highlighted papers are done by collaborations between UT and Meta. UT Vision, Let’s go!!

@AIatMeta

AI at Meta

1 year

#CVPR2023 — Here are 6️⃣ Interesting new papers you should know about from Meta AI — and how you can access them if you’re not at the conference. 🧵

12

94

450

0

0

4

@PuyuanPeng

Puyuan Peng

1 year

Kudos to Meta for open sourcing everything! Can’t wait to try out the code👻

@FelixKreuk

Felix Kreuk

1 year

We present MusicGen: A simple and controllable music generation model. MusicGen can be prompted by both text and melody. We release code (MIT) and models (CC-BY NC) for open research, reproducibility, and for the music community:

36

423

2K

0

0

4

@PuyuanPeng

Puyuan Peng

9 months

@jiatongshi Awesome news! Congrats to the team and the multilingual speech processing community!

0

0

3

@PuyuanPeng

Puyuan Peng

1 year

Awesome news! Congrats also to @kevingimpel and Karen! Keep the brilliant work coming @fredahshi

@fredahshi

Freda Shi

1 year

Personal update: I'll be starting in July 2024 as an Assistant Professor @UWCheritonCS and a Faculty Member @VectorInst ! Looking forward to working with all the amazing folks! Prospective students: if you are interested in NLP and/or comp. linguistics, please consider applying!

33

20

322

0

0

3

@PuyuanPeng

Puyuan Peng

1 year

@csteinmetz1 @keunwoochoi awesome work @YGongND !

0

0

3

@PuyuanPeng

Puyuan Peng

9 months

@amypavel @mina1004h @yolohao Big congratulations on getting the UIST Best paper awards two years in a row!

0

0

3

@PuyuanPeng

Puyuan Peng

2 months

We also introduce a SoTA spatial sound encoder Spatial-AST, and SpatialSoundQA dataset. SpatialSoundQA: Spatial-AST:

Tweet card media

GitHub - zszheng147/Spatial-AST: 🦇 Encoder of BAT (Learning to Reason about Spatial Sounds with...

🦇 Encoder of BAT (Learning to Reason about Spatial Sounds with Large Language Models) - zszheng147/Spatial-AST

0

0

2

@PuyuanPeng

Puyuan Peng

1 year

@begusgasper Thanks for the great resource on language learning from raw audio! Baby acquires language also through joint learning from auditory and visual input. Therefore a shameless plug for our work on word discovery in visually grounded speech models

Tweet card media

Word Discovery in Visually Grounded, Self-Supervised Speech Models

We present a method for visually-grounded spoken term discovery. After training either a HuBERT or wav2vec2.0 model to associate spoken captions with natural images, we show that powerful word...

1

0

3

@PuyuanPeng

Puyuan Peng

11 months

@unilightwf None of Valle, speartts, naturalspeech2 are open sourced, how are they compared?

1

0

3

@PuyuanPeng

Puyuan Peng

1 year

@YGongND Whisper is such an legendary model, with the *emergent capabilities* kept being discovered - emergent prompted zero-shot generalization (as report in our work ), and emergent audio understanding capability revealed in Gong's work

Tweet card media

Prompting the Hidden Talent of Web-Scale Speech Models for...

We investigate the emergent abilities of the recently proposed web-scale speech model Whisper, by adapting it to unseen tasks with prompt engineering. We selected three tasks: audio-visual speech...

1

0

3

@PuyuanPeng

Puyuan Peng

4 months

@wagieeacc Weights will be available by the end of March

1

0

3

@PuyuanPeng

Puyuan Peng

4 months

@ahmedashrafay will be available by the end of March

0

0

2

@PuyuanPeng

Puyuan Peng

4 months

@inf800 We evaluated VoiceCraft on internet videos and podcasts, which consist diverse accents, the model handles them pretty well. Check out examples at

0

0

2

@PuyuanPeng

Puyuan Peng

1 year

BatGPT, a disruptive model!

@agihippo

yi 🦛

1 year

BatGPT from Wuhan university. For real?? 😲

Tweet media one

54

92

870

0

0

2

@PuyuanPeng

Puyuan Peng

9 months

incredible

@lexfridman

Lex Fridman

9 months

Here's my conversation with Mark Zuckerberg, his 3rd time on the podcast, but this time we talked in the Metaverse as photorealistic avatars. This was one of the most incredible experiences of my life. It really felt like we were talking in-person, but we were miles apart 🤯 It's

4K

8K

52K

0

0

2

@PuyuanPeng

Puyuan Peng

11 months

awesome news! It seems that one of my favorite academic podcasts just turned into a biweekly one (from an annual podcast🍻)

@thesisreview

The Thesis Review Podcast

11 months

Episode 46 of The Thesis Review: Yulia Tsvetkov ( @tsvetshop ), "Linguistic Knowledge in Data-Driven NLP" We discuss Yulia's PhD work that combined ideas from linguistics and NLP, low-resource and multilingual NLP, and a lot of great advice!

Tweet media one

1

10

54

0

0

2

@PuyuanPeng

Puyuan Peng

1 year

We tested Whisper on 🎬audio-visual speech recognition, 🌎 code-switched speech recognition, and 🌐 speech translation on unseen language pairs. Results are surprising and reveal fascinating 𝐞𝐦𝐞𝐫𝐠𝐞𝐧𝐭 𝐩𝐫𝐨𝐩𝐞𝐫𝐭𝐢𝐞𝐬 of Whisper!

1

0

1

@PuyuanPeng

Puyuan Peng

3 months

@mzboito Lol didn’t know Psy also attend ICASSP

1

0

2

@PuyuanPeng

Puyuan Peng

1 year

@unilightwf So true, I hope more “we investigate the interesting phenomenon of” papers receive same appreciation as “we achieve sota performance on” paper in speech community (if both are solid research)

0

0

2

@PuyuanPeng

Puyuan Peng

7 months

@RuohanGao1 @umdcs @UofMaryland Congratulations Ruohan!

0

0

1

@PuyuanPeng

Puyuan Peng

1 year

@SreyanG Congrats! Excited to learn the details!

0

0

1

@PuyuanPeng

Puyuan Peng

2 months

@hermanhwdong @umichsmtd Looking forward to the amazing works from your lab!

0

0

1

@PuyuanPeng

Puyuan Peng

1 year

@anuj_diwan @GoogleAI @DeepMind @ankurbpn Take a swim in the TPU pool 🏊‍♂️

0

0

1

@PuyuanPeng

Puyuan Peng

2 months

@Fred43165052058 The interactive demo is here

Tweet card media

VoiceCraft - a Hugging Face Space by pyp1

1

1

1

@PuyuanPeng

Puyuan Peng

10 months

@kartik_goyal_ @ICatGT Congrats Kartik!!

0

0

1

@PuyuanPeng

Puyuan Peng

18 days

@SanyuanChenAI is the first/co-first author of 𝐖𝐚𝐯𝐋𝐌, 𝐁𝐄𝐀𝐓𝐒, and 𝐕𝐀𝐋𝐋-𝐄. He recently graduated with his PhD and 2400 citations, and joined Meta Sanyuan only has 2 followers on X (including me). Does he hold the record of having the highest citation/follower ratio?

0

0

1

@PuyuanPeng

Puyuan Peng

1 year

For AVSR, CLIP can be Whisper's eyes and allows it to transcribe speech in videos more accurately for CS-ASR and ST, changing just one special token in the prompt can boost performance significantly by e.g. 45%

1

0

1

@PuyuanPeng

Puyuan Peng

1 year

link to Yuan Gong's paper

0

0

1

@PuyuanPeng

Puyuan Peng

1 year

@shinjiw_at_cmu Congratulations!

0

0

1

@PuyuanPeng

Puyuan Peng

10 months

@mufeng_tang @NeurIPSConf Congrats Mufeng!

0

0

1

@PuyuanPeng

Puyuan Peng

1 year

It's extremely surprising that the English trained VG-HuBERT can be directly applied for syllabic and word segmentation on unseen languages and achieves SotA, without any adaptation. Could be useful for democratizing speech tech for zero-resource languages

1

0

1

@PuyuanPeng

Puyuan Peng

2 months

@giannis_daras Congratulations!

1

0

1

@PuyuanPeng

Puyuan Peng

18 days

This is a joint work with folks at Microsoft Research - a legendary group that has produced so many foundational works in speech, NLP, and CV

0

0

1

@PuyuanPeng

Puyuan Peng

3 months

@chenwanch1 @mzboito is he training a generative models on his own songs :P

0

0

1

@PuyuanPeng

Puyuan Peng

10 months

@PiotrZelasko Audioset-balanced is a 22000 example dataset for audio-visual sound event classification

0

0

1

@PuyuanPeng

Puyuan Peng

1 year

@RafaelValleArt Sounds like an amazing opportunity! Do you also look for research interns?

1

0

1

@PuyuanPeng

Puyuan Peng

1 year

Joint work with Shang-Wen (Daniel) Li @ShangwenLi1 @MetaAI , Okko Räsänen @ojrasanen @TampereUni , Abdelrahman Mohamed @AbdoMohamedML at Rembrand, and David Hawarth @UTCompSci

0

0

1

@PuyuanPeng

Puyuan Peng

1 year

Too excited to remember to attach the actual paper... there you go

Tweet card media

The Edinburgh International Accents of English Corpus: Towards the...

English is the most widely spoken language in the world, used daily by millions of people as a first or second language in many different contexts. As a result, there are many varieties of...

0

0

1

@PuyuanPeng

Puyuan Peng

1 year

Joint work between UT Austin and CMU, with Brian Yan @brianyan918 , Shinji Watanabe @shinjiw_at_cmu , and David Hawarth

0

0

1

@PuyuanPeng

Puyuan Peng

1 year

@mzboito Awesome, thanks!

0

0

1

@PuyuanPeng

Puyuan Peng

10 months

@PiotrZelasko Flickr8k spoken caption is a small dataset suitable for speech-image retrieval task

0

0

1