Puyuan Peng Profile Banner
Puyuan Peng Profile
Puyuan Peng

@PuyuanPeng

1,123
Followers
497
Following
14
Media
235
Statuses

Research Scientist Intern at FAIR Meta. CS PhD student @UTAustin . into speech/audio recognition and generation. Previously @uchicago Stats, @BNU_Official Math

New York, USA
Joined December 2019
Don't wanna be here? Send us removal request.
Pinned Tweet
@PuyuanPeng
Puyuan Peng
4 months
Announcing ๐•๐จ๐ข๐œ๐ž๐‚๐ซ๐š๐Ÿ๐ญ๐Ÿช„ SotA for both speech editing and zero-shot text-to-speech, Outperforming VALL-E, XTTS-v2, etc. VoiceCraft works on in-the-wild data such as movies, random videos and podcasts We fully open source it at
30
140
687
@PuyuanPeng
Puyuan Peng
3 months
The models weights are up! We upload the biggest model (830M) and also a smaller model (330M). Repo: run ./inference_tts.ipynb or ./inference_speech_editing.ipynb in the folder to try inference
@PuyuanPeng
Puyuan Peng
4 months
Announcing ๐•๐จ๐ข๐œ๐ž๐‚๐ซ๐š๐Ÿ๐ญ๐Ÿช„ SotA for both speech editing and zero-shot text-to-speech, Outperforming VALL-E, XTTS-v2, etc. VoiceCraft works on in-the-wild data such as movies, random videos and podcasts We fully open source it at
30
140
687
7
65
349
@PuyuanPeng
Puyuan Peng
3 months
We open sourced it 10 days ago And now it has 3.1k stars already ๐Ÿš€๐Ÿš€๐Ÿš€
@PuyuanPeng
Puyuan Peng
4 months
Announcing ๐•๐จ๐ข๐œ๐ž๐‚๐ซ๐š๐Ÿ๐ญ๐Ÿช„ SotA for both speech editing and zero-shot text-to-speech, Outperforming VALL-E, XTTS-v2, etc. VoiceCraft works on in-the-wild data such as movies, random videos and podcasts We fully open source it at
30
140
687
11
46
250
@PuyuanPeng
Puyuan Peng
1 year
๐ŸŽ™๏ธWhisper is essentially an audio-conditioned LLM. Can we prompt it to do unseen tasks? ๐Ÿš€ Introducing PromptingWhisper! We use simple prompts to adapt Whisper to unseen tasks zero-shot without any finetuning. ๐Ÿ“„ Paper: ๐Ÿ’ป Code:
2
26
153
@PuyuanPeng
Puyuan Peng
2 months
Check out SLAM-LLM๐Ÿš€ Itโ€™s a one stop shop for using LLMs for all kinds of audio tasks - ASR, TTS, audio tagging, audio captioning, spatial audio reasoning, music, and more!
2
35
141
@PuyuanPeng
Puyuan Peng
1 year
Why is Whisper so robust to background noise? Not because Whisper suppresses them, but because Whisper ๐ฎ๐ง๐๐ž๐ซ๐ฌ๐ญ๐š๐ง๐๐ฌ them! Check out the amazing work by Yuan Gong @YGongND . They reveal this emergent capability of Whisper, and get SOTA *simultaneous* ASR + audio tagging
Tweet media one
Tweet media two
4
19
128
@PuyuanPeng
Puyuan Peng
2 months
Best tutorial on diffusion models that Iโ€™ve seen, along with Also the two are complementary: the former use the discrete Markov chain perspective, while the latter uses the ODE perspective (also more involved)
@jbhuang0604
Jia-Bin Huang
2 months
Per several requests, here are the PowerPoint slides for the diffusion model tutorial. 1โƒฃTraining: 2โƒฃGuidance: 3โƒฃResolution: 4โƒฃSpeed: Explainer video:
2
48
229
1
19
86
@PuyuanPeng
Puyuan Peng
2 months
VoiceCraft will be presented at ACL2024๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ Since itโ€™s release, weโ€™ve added a significant amount of features requested by the community, with the help of the community! Try interactive demo, jupyter notebooks, command line, finetuning, and more, all at
@PuyuanPeng
Puyuan Peng
4 months
Announcing ๐•๐จ๐ข๐œ๐ž๐‚๐ซ๐š๐Ÿ๐ญ๐Ÿช„ SotA for both speech editing and zero-shot text-to-speech, Outperforming VALL-E, XTTS-v2, etc. VoiceCraft works on in-the-wild data such as movies, random videos and podcasts We fully open source it at
30
140
687
3
25
74
@PuyuanPeng
Puyuan Peng
2 months
Meet ๐๐€๐“ ๐Ÿฆ‡! A LLM that can perceive and reason about ๐’๐ฉ๐š๐ญ๐ข๐š๐ฅ ๐’๐จ๐ฎ๐ง๐๐ฌ in a 3D world ๐Ÿš€๐Ÿš€๐Ÿš€ Accepted by ICML 2024, it's nice symbiosis between classic signal processing and LLMs, led by the incredible @zszheng147 Website, paper, code, data:
2
4
49
@PuyuanPeng
Puyuan Peng
4 months
๐•๐จ๐ข๐œ๐ž๐‚๐ซ๐š๐Ÿ๐ญ works well on recordings with diverse accents, emotions, styles, content, background noise, recording conditions. Demo: Paper: code, model, data:
2
8
49
@PuyuanPeng
Puyuan Peng
10 months
Introducing AV-SUPERB, a new benchmark for audio-visual models. We found that audio-visual ML models are highly domain specific - none of the existing models can master all tasks in our benchmark. A lot of headroom for audio-visual research! Paper:
1
8
36
@PuyuanPeng
Puyuan Peng
18 days
VALL-E 2 is out ๐Ÿ”ฅ๐Ÿ”ฅ VALL-E has caused a ๐ฉ๐š๐ซ๐š๐๐ข๐ ๐ฆ ๐ฌ๐ก๐ข๐Ÿ๐ญ in TTS, but there are two remaining issues: repetition and slow sampling. One year later, the same author, my friend and colleage @SanyuanChenAI , fixed the issues briliantly in VALL-E 2
Tweet media one
3
10
47
@PuyuanPeng
Puyuan Peng
11 months
Excited to present my two papers on ๐—ฃ๐—ฟ๐—ผ๐—บ๐—ฝ๐˜๐—ถ๐—ป๐—ด ๐—ช๐—ต๐—ถ๐˜€๐—ฝ๐—ฒ๐—ฟ and ๐—ฉ๐—ถ๐˜€๐˜‚๐—ฎ๐—น๐—น๐˜† ๐—š๐—ฟ๐—ผ๐˜‚๐—ป๐—ฑ๐—ฒ๐—ฑ ๐—ฆ๐—ฝ๐—ฒ๐—ฒ๐—ฐ๐—ต tomorrow (Aug 21st) at Interspeech. Both papers will be presented in *Forum Poster Area 4* during 11:00-13:00 #INTERSPEECH2023
2
1
37
@PuyuanPeng
Puyuan Peng
4 months
Excited to announce that Iโ€™ll be interning at FAIR this summer with Wei-Ning Hsu @mhnt1580 , the author of HuBERT and Audiobox!
3
0
33
@PuyuanPeng
Puyuan Peng
4 months
@GozukaraFurkan No training needed - To clone or edit a voice, it only needs a 3 seconds reference of that voice during inference.
1
5
29
@PuyuanPeng
Puyuan Peng
1 year
Delighted to have 3 papers accepted to Interspeech: ๐–๐ก๐ข๐ฌ๐ฉ๐ž๐ซ ๐ฉ๐ซ๐จ๐ฆ๐ฉ๐ญ๐ข๐ง๐  for unseen tasks Zero-shot cross-lingual hierarchical ๐ฌ๐ฉ๐ž๐ž๐œ๐ก ๐ฌ๐ž๐ ๐ฆ๐ž๐ง๐ญ๐š๐ญ๐ข๐จ๐ง ๐•๐ข๐๐ž๐จ ๐ค๐ง๐จ๐ฐ๐ฅ๐ž๐๐ ๐ž ๐š๐œ๐ช๐ฎ๐ข๐ฌ๐ข๐ญ๐ข๐จ๐ง for robotic manipulation Paper & Code up soon
2
0
31
@PuyuanPeng
Puyuan Peng
1 year
Introducing VG-HuBERT! Trained by matching English speech and images, it shows emergent syllable and word segmentation. Surprisingly, VG-HuBERT generalizes zero-shot cross-lingually: also SotA on Estonian, Chinese, French, German.
Tweet media one
Tweet media two
Tweet media three
Tweet media four
1
3
28
@PuyuanPeng
Puyuan Peng
2 months
In this paper, we introduce: a new task - spatial audio question answering a new dataset - SpatialSoundQA a new audio encoder - Spatial-AST a new model - BAT
Tweet media one
Tweet media two
Tweet media three
Tweet media four
@PuyuanPeng
Puyuan Peng
2 months
Meet ๐๐€๐“ ๐Ÿฆ‡! A LLM that can perceive and reason about ๐’๐ฉ๐š๐ญ๐ข๐š๐ฅ ๐’๐จ๐ฎ๐ง๐๐ฌ in a 3D world ๐Ÿš€๐Ÿš€๐Ÿš€ Accepted by ICML 2024, it's nice symbiosis between classic signal processing and LLMs, led by the incredible @zszheng147 Website, paper, code, data:
2
4
49
1
5
25
@PuyuanPeng
Puyuan Peng
2 months
This is a BIG campaign for benchmarking generalized speech/audio models๐Ÿ”ฅ ๐Ÿ”ฅ๐Ÿ”ฅ Join us to welcome the GPT era for speech and audio๐Ÿš€๐Ÿš€๐Ÿš€
@HungyiLee2
Hung-yi Lee (ๆŽๅฎๆฏ…)
2 months
Join us for the Dynamic-SUPERB call-for-tasks event. Submit your innovative task to challenge the speech foundation models that can understand task instruction. Let's push the boundaries of what speech foundation models can do!
Tweet media one
1
19
75
0
3
25
@PuyuanPeng
Puyuan Peng
3 months
Breaking! HuggingFace's open-sourced reproduction of Stability's close-sourced style-controlled Text-to-Speech model ๐๐š๐ซ๐ฅ๐ž๐ซ-๐“๐“๐’ is out๐Ÿš€๐Ÿš€๐Ÿš€ released v0.1 is trained on 10k hours of data. Forthcoming v1 trained on 50k hours
0
0
23
@PuyuanPeng
Puyuan Peng
1 year
Very interesting work by @Sanabria_RST and collaborators at Edinburgh, showing how difficult ASR is in real life (even for just English!). An important work for democratizing speech technologies.
Tweet media one
3
4
19
@PuyuanPeng
Puyuan Peng
1 month
very interesting, two papers both titled โ€œAudio Mambaโ€ came out from different institutions at (almost) the same time Korean Audio Mamba Denish Audio Mamba:
@ArxivSound
arXiv Sound
1 month
``Audio Mamba: Selective State Spaces for Self-Supervised Audio Representations,'' Sarthak Yadav, Zheng-Hua Tan,
1
3
42
1
1
16
@PuyuanPeng
Puyuan Peng
2 months
Cool works from MIT speech group to be presented on ICLR
@alex_h_liu
Alexander H. Liu
2 months
Presenting 2 works at #ICLR tomorrow! ๐Ÿ“ƒGenerative Pre-training for Speech with Flow Matching ๐Ÿ“5/9 (Wed) Hall B #68 , 10:45am-12:45pm ๐Ÿ“ƒListen, Think, and Understand ๐Ÿ“5/9 (Wed) Hall B #60 , 4:30pm-6:30pm Please stop by if you're interested! More details...๐Ÿ‘‡
Tweet media one
Tweet media two
2
12
61
1
3
16
@PuyuanPeng
Puyuan Peng
2 years
Had a nice beer after work
Tweet media one
0
1
15
@PuyuanPeng
Puyuan Peng
2 months
@WenhuChen Maybe it depends on the type of work you do. If your project involves serious model training, and you only have limited academic level compute, having multiple projects ongoing might be a good idea to make better use of your time
1
0
15
@PuyuanPeng
Puyuan Peng
11 months
Encodec training code is also open sourced and very easy to use! Finally everyone can train an audio tokenizer on their own data with their preferred model config
@jadecopet
Jade Copet
11 months
Today we open source the training code for our audio generation and compression research in AudioCraft and share new models. With this release, we aim at giving people the full recipe to play with our models and develop their own models!
4
26
146
0
1
14
@PuyuanPeng
Puyuan Peng
5 months
Impressive! EAT outperforms BEATs and AudioMAE on audio classification, while being a magnitude more efficient. Consider replacing your audio encoder with EAT!
@zszheng147
Zhisheng Zheng
6 months
๐ŸŒŸ Excited to share our latest work on audio pre-training: EAT - Efficient Audio Transformer! ๐Ÿš€ Achieving SOTA on AudioSet-2M, AudioSet-20K, ESC-50, and SpeechCommands-v2, EAT boosts pre-training efficiency by 15x compared to previous models.
Tweet media one
2
3
16
1
1
14
@PuyuanPeng
Puyuan Peng
1 year
Robust speech models are no longer robust on this real life English speech dataset (including robust w2v2 and Whisper!) An interesting work and an exciting direction!
0
0
14
@PuyuanPeng
Puyuan Peng
2 years
This work has been accepted by Interspeech 2022! The code and model weights are available at . See you guys in Incheon in September!
@PuyuanPeng
Puyuan Peng
2 years
We show that visually grounded self-supervised speech model, VG-HuBERT, exhibits emergent word discovery ability from raw speech signals (as much as 40% of the words in a corpus). VG-HuBERT also significantly improves SoTA on unsup word segmentation on ZeroSpeech and Buckeye.
Tweet media one
Tweet media two
Tweet media three
0
0
8
0
0
14
@PuyuanPeng
Puyuan Peng
4 months
@oleg__chomp Making it multilingual is our on-going work
0
0
10
@PuyuanPeng
Puyuan Peng
11 months
@wjwwilliams @ISCAInterspeech @Speechmatics speechmatics has been doing amazingly well so far! Nothing cooler than being the live ASR provider for a top speech research conference!!
1
0
9
@PuyuanPeng
Puyuan Peng
3 months
the amount of support and feature requests is also a bit overwhelming for someone who switched to CS only recently, and have only been written shitty research code.
0
0
8
@PuyuanPeng
Puyuan Peng
2 years
We show that visually grounded self-supervised speech model, VG-HuBERT, exhibits emergent word discovery ability from raw speech signals (as much as 40% of the words in a corpus). VG-HuBERT also significantly improves SoTA on unsup word segmentation on ZeroSpeech and Buckeye.
Tweet media one
Tweet media two
Tweet media three
@Arxiv_Daily
arXiv Daily
2 years
Word Discovery in Visually Grounded, Self-Supervised Speech Models by @PuyuanPeng et al. #Computation #Language
0
0
5
0
0
8
@PuyuanPeng
Puyuan Peng
2 months
Amazing work Haohe! Explicitly leveraging semantic representation gives you a huge bump in both expressiveness and compressibility.
@LiuHaohe
Haohe Liu
2 months
Excited to introduce SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound ๐ŸŽ‰ SemantiCodec (50 tokens/second or 0.71kbps) โ‰ˆ Previous methods (200 tokens/second or 2.0kbps). ๐ŸŽ‰Our study also reveals that SemantiCodec tokens hold richer semantic information.
Tweet media one
Tweet media two
7
26
126
1
0
8
@PuyuanPeng
Puyuan Peng
4 months
@reach_vb @rdesh26 Yea letโ€™s do that!
0
0
7
@PuyuanPeng
Puyuan Peng
3 months
The 330M model in many cases performs just as good as the 830M one
1
0
7
@PuyuanPeng
Puyuan Peng
2 years
Our MAE-AST paper has been accepted by Interspeech 2022! Paper , code and model weights
@PuyuanPeng
Puyuan Peng
2 years
Project led by the amazing Alan Baade! We show that the Masked Autoencoder idea can be applied to audio spectrogram domain. The MAE-AST matches or outperforms prior self-supervised models on audio classification benchmarks, while being 3x faster and requiring 50% less memory
0
0
6
1
1
7
@PuyuanPeng
Puyuan Peng
1 month
Awesome work @ardasnck ! couldnโ€™t think of a better domain for memba than audio
@ardasnck
Arda SENOCAK
1 month
Thanks for sharing our work @_akhaliq ๐Ÿคฉ Code is coming very soon ๐Ÿ๐ŸŽ™๏ธ
2
2
26
1
0
7
@PuyuanPeng
Puyuan Peng
1 month
Adam Lambert ่ฆๆฅๅ‚ๅŠ ๆญŒๆ‰‹2024 ไบ†๏ผŒ้ฆ–ๅœบไผšไธไผšๅ”ฑBelieve๏ผŸ
Tweet media one
0
0
6
@PuyuanPeng
Puyuan Peng
11 months
Visually Grounded Speech paper is about the emergence of syllabic representation in a textless, visually grounded, self-supervised model - VG-HuBERT. In addition to it's training language English, VG-HuBERT can also do Mandarin, Estonian, French, German...
0
0
6
@PuyuanPeng
Puyuan Peng
2 years
Project led by the amazing Alan Baade! We show that the Masked Autoencoder idea can be applied to audio spectrogram domain. The MAE-AST matches or outperforms prior self-supervised models on audio classification benchmarks, while being 3x faster and requiring 50% less memory
@Arxiv_Daily
arXiv Daily
2 years
MAE-AST: Masked Autoencoding Audio Spectrogram Transformer by Alan Baade et al. including @PuyuanPeng #Autoencoder #Computation
0
3
9
0
0
6
@PuyuanPeng
Puyuan Peng
4 months
@realmrfakename Thanks! Have been discussing the licensing issue, might change it in the coming days
0
0
6
@PuyuanPeng
Puyuan Peng
3 months
The repo also supports model training
0
0
6
@PuyuanPeng
Puyuan Peng
29 days
็ฌฌไธ€ๆฌกๆ„Ÿ่ง‰ไฝœๆ–‡็จณไบ†๏ผŒๅฏๆƒœๆ™šไบ†ไนๅนด
Tweet media one
1
0
7
@PuyuanPeng
Puyuan Peng
1 year
@rdesh26 @csteinmetz1 @unilightwf I actually tried ASR as the first exploration of the MAE style pretraining on audio, but it didnโ€™t perform as well compared to w2v2/hubert. Maybe I didnโ€™t try hard enough. Alan took it over and made it work on audio classification tasks
2
0
6
@PuyuanPeng
Puyuan Peng
1 year
Absolutely thrilled to see this podcast is still going on (and thriving!) It has been an source of motivation and wisdom for my own PhD journey
@thesisreview
The Thesis Review Podcast
1 year
Episode 45 of The Thesis Review: Luke Zettlemoyer ( @LukeZettlemoyer ), "Learning to Map Sentences to Logical Form" We discuss his PhD thesis on semantic parsing, the evolution of NLP, foundational work on pretraining LLMs, and a lot more!
Tweet media one
3
27
139
0
0
5
@PuyuanPeng
Puyuan Peng
11 months
Prompting Whisper is on investigating the zero-shot task generalization capabilities of OpenAI's Whisper by prompt engineering. Three tasks are studied: AVSR, Code-switched ASR, and Unseen Speech Translation Paper:
1
0
5
@PuyuanPeng
Puyuan Peng
1 year
Why did @MetaAI open source their MusicGen model: Generative models could be an unfair competition for artists, but open sourcing them gives everyone a chance to understand, improve, compete and collaborate with them A nice perspective! #GenerativeAI
Tweet media one
0
0
5
@PuyuanPeng
Puyuan Peng
1 year
The most beautiful world map I've ever seen. Incredible work @MetaAI !
Tweet media one
@ylecun
Yann LeCun
1 year
MMS: Massively Multilingual Speech. - Can do speech2text and text speech in 1100 languages. - Can recognize 4000 spoken languages. - Code and models available under the CC-BY-NC 4.0 license. - half the word error rate of Whisper. Code+Models: Paper:
177
1K
5K
0
0
5
@PuyuanPeng
Puyuan Peng
4 months
@reach_vb This is extremely exciting! Given that many of these models can do multi-speaker TTS or even zero-shot TTS, any plan on benchmarking these capabilities?
0
0
3
@PuyuanPeng
Puyuan Peng
4 months
@paulo_zip have been discussing the licensing issue, might change it in the coming days
1
0
4
@PuyuanPeng
Puyuan Peng
1 year
super cool
@coqui_ai
coqui
1 year
Listen until the end ๐Ÿ˜‰
0
3
16
0
1
4
@PuyuanPeng
Puyuan Peng
1 month
Check out the amazing streaming V2V model by my friend @LiangJeff95
@_akhaliq
AK
1 month
Looking Backward: Streaming Video-to-Video Translation with Feature Banks This paper introduces StreamV2V, a diffusion model that achieves real-time streaming video-to-video (V2V) translation with user prompts. Unlike prior V2V methods using batches to process limited frames,
4
38
159
1
1
4
@PuyuanPeng
Puyuan Peng
1 month
๐ŸšจBREAKING๐Ÿšจ American Idol, Vocal Titan, front man of Queen, superstar ๐€๐๐š๐ฆ ๐‹๐š๐ฆ๐›๐ž๐ซ๐ญ just joined the Chinese singing competition #Singer2024 ๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ How far can he go?
1
2
4
@PuyuanPeng
Puyuan Peng
25 days
Congratulations Changan! Bright future ahead๐Ÿš€๐Ÿš€๐Ÿš€
@changan_vr
Changan Chen
25 days
I'm thrilled to share that I graduated from UT Austin advised by Dr. Kristen Grauman and joined @StanfordSVL as a postdoc working with @drfeifei and @eadeli on multimodal perception and generation for humans, building on my thesis research on multimodal 3D scene understanding!๐Ÿš€
5
1
130
0
0
4
@PuyuanPeng
Puyuan Peng
1 year
very cool! 3 out 6 highlighted papers are done by collaborations between UT and Meta. UT Vision, Letโ€™s go!!
@AIatMeta
AI at Meta
1 year
#CVPR2023 โ€” Here are 6๏ธโƒฃ Interesting new papers you should know about from Meta AI โ€” and how you can access them if youโ€™re not at the conference. ๐Ÿงต
12
94
450
0
0
4
@PuyuanPeng
Puyuan Peng
1 year
Kudos to Meta for open sourcing everything! Canโ€™t wait to try out the code๐Ÿ‘ป
@FelixKreuk
Felix Kreuk
1 year
We present MusicGen: A simple and controllable music generation model. MusicGen can be prompted by both text and melody. We release code (MIT) and models (CC-BY NC) for open research, reproducibility, and for the music community:
36
423
2K
0
0
4
@PuyuanPeng
Puyuan Peng
9 months
@jiatongshi Awesome news! Congrats to the team and the multilingual speech processing community!
0
0
3
@PuyuanPeng
Puyuan Peng
1 year
Awesome news! Congrats also to @kevingimpel and Karen! Keep the brilliant work coming @fredahshi
@fredahshi
Freda Shi
1 year
Personal update: I'll be starting in July 2024 as an Assistant Professor @UWCheritonCS and a Faculty Member @VectorInst ! Looking forward to working with all the amazing folks! Prospective students: if you are interested in NLP and/or comp. linguistics, please consider applying!
33
20
322
0
0
3
@PuyuanPeng
Puyuan Peng
9 months
@amypavel @mina1004h @yolohao Big congratulations on getting the UIST Best paper awards two years in a row!
0
0
3
@PuyuanPeng
Puyuan Peng
1 year
@begusgasper Thanks for the great resource on language learning from raw audio! Baby acquires language also through joint learning from auditory and visual input. Therefore a shameless plug for our work on word discovery in visually grounded speech models
1
0
3
@PuyuanPeng
Puyuan Peng
11 months
@unilightwf None of Valle, speartts, naturalspeech2 are open sourced, how are they compared?
1
0
3
@PuyuanPeng
Puyuan Peng
1 year
@YGongND Whisper is such an legendary model, with the *emergent capabilities* kept being discovered - emergent prompted zero-shot generalization (as report in our work ), and emergent audio understanding capability revealed in Gong's work
1
0
3
@PuyuanPeng
Puyuan Peng
4 months
@wagieeacc Weights will be available by the end of March
1
0
3
@PuyuanPeng
Puyuan Peng
4 months
@ahmedashrafay will be available by the end of March
0
0
2
@PuyuanPeng
Puyuan Peng
4 months
@inf800 We evaluated VoiceCraft on internet videos and podcasts, which consist diverse accents, the model handles them pretty well. Check out examples at
0
0
2
@PuyuanPeng
Puyuan Peng
1 year
BatGPT, a disruptive model!
@agihippo
yi ๐Ÿฆ›
1 year
BatGPT from Wuhan university. For real?? ๐Ÿ˜ฒ
Tweet media one
54
92
870
0
0
2
@PuyuanPeng
Puyuan Peng
9 months
incredible
@lexfridman
Lex Fridman
9 months
Here's my conversation with Mark Zuckerberg, his 3rd time on the podcast, but this time we talked in the Metaverse as photorealistic avatars. This was one of the most incredible experiences of my life. It really felt like we were talking in-person, but we were miles apart ๐Ÿคฏ It's
4K
8K
52K
0
0
2
@PuyuanPeng
Puyuan Peng
11 months
awesome news! It seems that one of my favorite academic podcasts just turned into a biweekly one (from an annual podcast๐Ÿป)
@thesisreview
The Thesis Review Podcast
11 months
Episode 46 of The Thesis Review: Yulia Tsvetkov ( @tsvetshop ), "Linguistic Knowledge in Data-Driven NLP" We discuss Yulia's PhD work that combined ideas from linguistics and NLP, low-resource and multilingual NLP, and a lot of great advice!
Tweet media one
1
10
54
0
0
2
@PuyuanPeng
Puyuan Peng
1 year
We tested Whisper on ๐ŸŽฌaudio-visual speech recognition, ๐ŸŒŽ code-switched speech recognition, and ๐ŸŒ speech translation on unseen language pairs. Results are surprising and reveal fascinating ๐ž๐ฆ๐ž๐ซ๐ ๐ž๐ง๐ญ ๐ฉ๐ซ๐จ๐ฉ๐ž๐ซ๐ญ๐ข๐ž๐ฌ of Whisper!
1
0
1
@PuyuanPeng
Puyuan Peng
3 months
@mzboito Lol didnโ€™t know Psy also attend ICASSP
1
0
2
@PuyuanPeng
Puyuan Peng
1 year
@unilightwf So true, I hope more โ€œwe investigate the interesting phenomenon ofโ€ papers receive same appreciation as โ€œwe achieve sota performance onโ€ paper in speech community (if both are solid research)
0
0
2
@PuyuanPeng
Puyuan Peng
7 months
0
0
1
@PuyuanPeng
Puyuan Peng
1 year
@SreyanG Congrats! Excited to learn the details!
0
0
1
@PuyuanPeng
Puyuan Peng
2 months
@hermanhwdong @umichsmtd Looking forward to the amazing works from your lab!
0
0
1
@PuyuanPeng
Puyuan Peng
1 year
@anuj_diwan @GoogleAI @DeepMind @ankurbpn Take a swim in the TPU pool ๐ŸŠโ€โ™‚๏ธ
0
0
1
@PuyuanPeng
Puyuan Peng
10 months
@kartik_goyal_ @ICatGT Congrats Kartik!!
0
0
1
@PuyuanPeng
Puyuan Peng
18 days
@SanyuanChenAI is the first/co-first author of ๐–๐š๐ฏ๐‹๐Œ, ๐๐„๐€๐“๐’, and ๐•๐€๐‹๐‹-๐„. He recently graduated with his PhD and 2400 citations, and joined Meta Sanyuan only has 2 followers on X (including me). Does he hold the record of having the highest citation/follower ratio?
0
0
1
@PuyuanPeng
Puyuan Peng
1 year
For AVSR, CLIP can be Whisper's eyes and allows it to transcribe speech in videos more accurately for CS-ASR and ST, changing just one special token in the prompt can boost performance significantly by e.g. 45%
1
0
1
@PuyuanPeng
Puyuan Peng
1 year
link to Yuan Gong's paper
0
0
1
@PuyuanPeng
Puyuan Peng
1 year
@shinjiw_at_cmu Congratulations!
0
0
1
@PuyuanPeng
Puyuan Peng
10 months
0
0
1
@PuyuanPeng
Puyuan Peng
1 year
It's extremely surprising that the English trained VG-HuBERT can be directly applied for syllabic and word segmentation on unseen languages and achieves SotA, without any adaptation. Could be useful for democratizing speech tech for zero-resource languages
1
0
1
@PuyuanPeng
Puyuan Peng
2 months
@giannis_daras Congratulations!
1
0
1
@PuyuanPeng
Puyuan Peng
18 days
This is a joint work with folks at Microsoft Research - a legendary group that has produced so many foundational works in speech, NLP, and CV
0
0
1
@PuyuanPeng
Puyuan Peng
3 months
@chenwanch1 @mzboito is he training a generative models on his own songs :P
0
0
1
@PuyuanPeng
Puyuan Peng
10 months
@PiotrZelasko Audioset-balanced is a 22000 example dataset for audio-visual sound event classification
0
0
1
@PuyuanPeng
Puyuan Peng
1 year
@RafaelValleArt Sounds like an amazing opportunity! Do you also look for research interns?
1
0
1
@PuyuanPeng
Puyuan Peng
1 year
Joint work with Shang-Wen (Daniel) Li @ShangwenLi1 @MetaAI , Okko Rรคsรคnen @ojrasanen @TampereUni , Abdelrahman Mohamed @AbdoMohamedML at Rembrand, and David Hawarth @UTCompSci
0
0
1
@PuyuanPeng
Puyuan Peng
1 year
Joint work between UT Austin and CMU, with Brian Yan @brianyan918 , Shinji Watanabe @shinjiw_at_cmu , and David Hawarth
0
0
1
@PuyuanPeng
Puyuan Peng
1 year
@mzboito Awesome, thanks!
0
0
1
@PuyuanPeng
Puyuan Peng
10 months
@PiotrZelasko Flickr8k spoken caption is a small dataset suitable for speech-image retrieval task
0
0
1