arXiv Sound @ArxivSound Twitter profile

Pinned Tweet

arXiv Sound

2 years

[IMPORTANT] arXiv sound does not post some papers submitted to arXiv or . This is because they do not appear in the RSS of arXiv. We apologize for your inconvenience.

1

0

5

Last Seen Profiles

@iwanthazelgrace

@Doodoana12

@tamer2824761194

@shaista1408

@Omar_Al_Ameri71

@HTR18MED

@jan_mhmood71310

@GrovMateus13813

@DaTruthDT

@SmeepingAround

@bondageltd

@NST_Online

@Korayz78

@OpenSnow

@Corretor2023

@5starBlinks

@FilipTitlbach

@maspapaa

@ABC13Elita

@ChaoticGabz_

@guororororoi

@carrah2010

@akari_komiyama

@twiryx69

@sophsssmith

@fgabieee

@priyankaSatire

@vice_higher_edu

@TammyMurphyNJ

@iShooks

@snnlino

@Mikhail_163

@keji202469

@kir_riber

@DernLucie96897

@okmanny5

arXiv Sound

@ArxivSound

3 months

``Analyzing Musical Characteristics of National Anthems in Relation to Global Indices,'' S M Rakib Hasan, Aakar Dhakal, Ms. Ayesha Siddiqua, Mohammad Mominur Rahman, Md Maidul Islam, Mohammed Arfat Raihan Chowdhury, S M Masfequier Rahman Swapno, SM Nuruz…

Analyzing Musical Characteristics of National Anthems in Relation...

Music plays a huge part in shaping peoples' psychology and behavioral patterns. This paper investigates the connection between national anthems and different global indices with computational...

arxiv.org

1

224

2K

arXiv Sound

@ArxivSound

4 years

``WaveGrad: Estimating Gradients for Waveform Generation. (arXiv:2009.00713v1 []),'' Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, William Chan,

0

22

90

arXiv Sound

@ArxivSound

3 years

``RAVE: A variational autoencoder for fast and high-quality neural audio synthesis. (arXiv:2111.05011v1 [cs.LG]),'' Antoine Caillon, Philippe Esling,

1

3

52

arXiv Sound

@ArxivSound

11 months

``A Review of Differentiable Digital Signal Processing for Music & Speech Synthesis. (arXiv:2308.15422v1 []),'' Ben Hayes, Jordie Shier, György Fazekas, Andrew McPherson, Charalampos Saitis,

1

11

50

arXiv Sound

@ArxivSound

5 months

``OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification,'' Yifan Peng, Yui Sudo, Muhammad Shakeel, Shinji Watanabe,

OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech...

There has been an increasing interest in large speech models that can perform multiple tasks in a single model. Such models usually adopt an encoder-decoder or decoder-only architecture due to...

arxiv.org

0

6

50

arXiv Sound

@ArxivSound

2 months

``Mamba in Speech: Towards an Alternative to Self-Attention,'' Xiangyu Zhang, Qiquan Zhang, Hexin Liu, Tianyi Xiao, Xinyuan Qian, Beena Ahmed, Eliathamby Ambikairajah, Haizhou Li, Julien Epps,

Mamba in Speech: Towards an Alternative to Self-Attention

Transformer and its derivatives have achieved success in diverse tasks across computer vision, natural language processing, and speech processing. To reduce the complexity of computations within...

arxiv.org

5

6

48

arXiv Sound

@ArxivSound

1 month

``LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning,'' Masaya Kawamura, Ryuichi Yamamoto, Yuma Shirahata, Takuya Hasumi, Kentaro Tachibana,

LibriTTS-P: A Corpus with Speaking Style and Speaker Identity...

We introduce LibriTTS-P, a new corpus based on LibriTTS-R that includes utterance-level descriptions (i.e., prompts) of speaking style and speaker-level prompts of speaker characteristics. We...

arxiv.org

1

9

47

arXiv Sound

@ArxivSound

1 month

``XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model,'' Edresson Casanova, Kelly Davis, Eren G\"olge, G\"orkem G\"oknar, Iulian Gulea, Logan Hart, Aya Aljafari, Joshua Meyer, Reuben Morais, Samuel Olayemi, Julian Weber,

XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model

Most Zero-shot Multi-speaker TTS (ZS-TTS) systems support only a single language. Although models like YourTTS, VALL-E X, Mega-TTS 2, and Voicebox explored Multilingual ZS-TTS they are limited to...

arxiv.org

0

10

47

arXiv Sound

@ArxivSound

1 year

``Moisesdb: A dataset for source separation beyond 4-stems. (arXiv:2307.15913v1 []),'' Igor Pereira, Felipe Araújo, Filip Korzeniowski, Richard Vogl,

1

23

45

arXiv Sound

@ArxivSound

9 months

``MelHuBERT: A simplified HuBERT on Mel spectrograms. (arXiv:2211.09944v2 [] UPDATED),'' Tzu-Quan Lin, Hung-yi Lee, Hao Tang,

0

4

45

arXiv Sound

@ArxivSound

2 years

``Style Transfer of Audio Effects with Differentiable Signal Processing. (arXiv:2207.08759v1 []),'' Christian J. Steinmetz, Nicholas J. Bryan, Joshua D. Reiss,

1

9

44

arXiv Sound

@ArxivSound

4 months

``Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data,'' Takaaki Saeki, Gary Wang, Nobuyuki Morioka, Isaac Elias, Kyle Kastner, Andrew Rosenberg, Bhuvana Ramabhadran, Heiga Zen, Fran\c{c}oise Beaufays, Hadar Shemtov,

Extending Multilingual Speech Synthesis to 100+ Languages without...

Collecting high-quality studio recordings of audio is challenging, which limits the language coverage of text-to-speech (TTS) systems. This paper proposes a framework for scaling a multilingual...

arxiv.org

0

11

42

arXiv Sound

@ArxivSound

1 month

``Audio Mamba: Selective State Spaces for Self-Supervised Audio Representations,'' Sarthak Yadav, Zheng-Hua Tan,

Audio Mamba: Selective State Spaces for Self-Supervised Audio...

Despite its widespread adoption as the prominent neural architecture, the Transformer has spurred several independent lines of work to address its limitations. One such approach is selective state...

arxiv.org

1

3

42

arXiv Sound

@ArxivSound

8 months

``WavMark: Watermarking for Audio Generation. (arXiv:2308.12770v2 [] UPDATED),'' Guangyu Chen, Yu Wu, Shujie Liu, Tao Liu, Xiaoyong Du, Furu Wei,

0

6

38

arXiv Sound

@ArxivSound

3 months

``Llama-VITS: Enhancing TTS Synthesis with Semantic Awareness,'' Xincan Feng, Akifumi Yoshimoto,

Llama-VITS: Enhancing TTS Synthesis with Semantic Awareness

Recent advancements in Natural Language Processing (NLP) have seen Large-scale Language Models (LLMs) excel at producing high-quality text for various purposes. Notably, in Text-To-Speech (TTS)...

arxiv.org

0

11

40

arXiv Sound

@ArxivSound

1 month

``VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers,'' Sanyuan Chen, Shujie Liu, Long Zhou, Yanqing Liu, Xu Tan, Jinyu Li, Sheng Zhao, Yao Qian, Furu Wei,

VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot...

This paper introduces VALL-E 2, the latest advancement in neural codec language models that marks a milestone in zero-shot text-to-speech synthesis (TTS), achieving human parity for the first...

arxiv.org

0

6

40

arXiv Sound

@ArxivSound

2 years

``NNSVS: A Neural Network-Based Singing Voice Synthesis Toolkit. (arXiv:2210.15987v1 []),'' Ryuichi Yamamoto, Reo Yoneyama, Tomoki Toda,

0

13

39

arXiv Sound

@ArxivSound

1 month

``YODAS: Youtube-Oriented Dataset for Audio and Speech,'' Xinjian Li, Shinnosuke Takamichi, Takaaki Saeki, William Chen, Sayaka Shiota, Shinji Watanabe,

YODAS: Youtube-Oriented Dataset for Audio and Speech

In this study, we introduce YODAS (YouTube-Oriented Dataset for Audio and Speech), a large-scale, multilingual dataset comprising currently over 500k hours of speech data in more than 100...

arxiv.org

0

11

40

arXiv Sound

@ArxivSound

2 years

``Differentiable WORLD Synthesizer-based Neural Vocoder With Application To End-To-End Audio Style Transfer. (arXiv:2208.07282v1 []),'' Shahan Nercessian,

0

11

39

arXiv Sound

@ArxivSound

15 days

``mHuBERT-147: A Compact Multilingual HuBERT Model,'' Marcely Zanon Boito, Vivek Iyer, Nikolaos Lagos, Laurent Besacier, Ioan Calapodescu,

mHuBERT-147: A Compact Multilingual HuBERT Model

We present mHuBERT-147, the first general-purpose massively multilingual HuBERT speech representation model trained on 90K hours of clean, open-license data. To scale up the multi-iteration HuBERT...

arxiv.org

0

6

39

arXiv Sound

@ArxivSound

3 months

``Long-form music generation with latent diffusion,'' Zach Evans, Julian D. Parker, CJ Carr, Zack Zukowski, Josiah Taylor, Jordi Pons,

Long-form music generation with latent diffusion

Audio-based generative models for music have seen great strides recently, but so far have not managed to produce full-length music tracks with coherent musical structure. We show that by training...

arxiv.org

0

8

38

arXiv Sound

@ArxivSound

1 month

``Autoregressive Diffusion Transformer for Text-to-Speech Synthesis,'' Zhijun Liu, Shuai Wang, Sho Inoue, Qibing Bai, Haizhou Li,

Autoregressive Diffusion Transformer for Text-to-Speech Synthesis

Audio language models have recently emerged as a promising approach for various audio generation tasks, relying on audio tokenizers to encode waveforms into sequences of discrete symbols. Audio...

arxiv.org

0

5

37

arXiv Sound

@ArxivSound

3 months

``WavLLM: Towards Robust and Adaptive Speech Large Language Model,'' Shujie Hu, Long Zhou, Shujie Liu, Sanyuan Chen, Hongkun Hao, Jing Pan, Xunying Liu, Jinyu Li, Sunit Sivasankaran, Linquan Liu, Furu Wei,

WavLLM: Towards Robust and Adaptive Speech Large Language Model

The recent advancements in large language models (LLMs) have revolutionized the field of natural language processing, progressively broadening their scope to multimodal perception and generation....

arxiv.org

0

6

36

arXiv Sound

@ArxivSound

8 months

``Music ControlNet: Multiple Time-varying Controls for Music Generation. (arXiv:2311.07069v1 []),'' Shih-Lun Wu, Chris Donahue, Shinji Watanabe, Nicholas J. Bryan,

2

3

34

arXiv Sound

@ArxivSound

1 month

``The Interspeech 2024 Challenge on Speech Processing Using Discrete Units,'' Xuankai Chang, Jiatong Shi, Jinchuan Tian, Yuning Wu, Yuxun Tang, Yihan Wu, Shinji Watanabe, Yossi Adi, Xie Chen, Qin Jin,

The Interspeech 2024 Challenge on Speech Processing Using Discrete Units

Representing speech and audio signals in discrete units has become a compelling alternative to traditional high-dimensional feature vectors. Numerous studies have highlighted the efficacy of...

arxiv.org

0

7

33

arXiv Sound

@ArxivSound

2 years

``Diffsound: Discrete Diffusion Model for Text-to-sound Generation. (arXiv:2207.09983v1 []),'' Dongchao Yang, Jianwei Yu, Helin Wang, Wen Wang, Chao Weng, Yuexian Zou, Dong Yu,

0

8

34

arXiv Sound

@ArxivSound

1 month

``Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis,'' Hubert Siuzdak,

Vocos: Closing the gap between time-domain and Fourier-based...

Recent advancements in neural vocoding are predominantly driven by Generative Adversarial Networks (GANs) operating in the time-domain. While effective, this approach neglects the inductive bias...

arxiv.org

0

4

33

arXiv Sound

@ArxivSound

6 months

``Masked Audio Generation using a Single Non-Autoregressive Transformer. (arXiv:2401.04577v1 []),'' Alon Ziv, Itai Gat, Gael Le Lan, Tal Remez, Felix Kreuk, Alexandre Défossez, Jade Copet, Gabriel Synnaeve, Yossi Adi,

0

6

33

arXiv Sound

@ArxivSound

4 years

``VoiceGrad: Non-Parallel Any-to-Many Voice Conversion with Annealed Langevin Dynamics. (arXiv:2010.02977v1 []),'' Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, Nobukatsu Hojo,

1

10

32

arXiv Sound

@ArxivSound

10 months

``Decoder-only Architecture for Speech Recognition with CTC Prompts and Text Data Augmentation. (arXiv:2309.08876v1 []),'' Emiru Tsunoo, Hayato Futami, Yosuke Kashiwagi, Siddhant Arora, Shinji Watanabe,

0

5

32

arXiv Sound

@ArxivSound

11 months

``SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models. (arXiv:2308.16692v1 []),'' Xin Zhang, Dong Zhang, Shimin Li, Yaqian Zhou, Xipeng Qiu,

0

3

31

arXiv Sound

@ArxivSound

4 months

``An Empirical Study of Speech Language Models for Prompt-Conditioned Speech Synthesis,'' Yifan Peng, Ilia Kulikov, Yilin Yang, Sravya Popuri, Hui Lu, Changhan Wang, Hongyu Gong,

An Empirical Study of Speech Language Models for...

Speech language models (LMs) are promising for high-quality speech synthesis through in-context learning. A typical speech LM takes discrete semantic units as content and a short utterance as...

arxiv.org

0

6

32

arXiv Sound

@ArxivSound

3 years

``MP3net: coherent, minute-long music generation from raw audio with a simple convolutional GAN. (arXiv:2101.04785v1 []),'' Korneel van den Broek,

0

4

31

arXiv Sound

@ArxivSound

2 months

``SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound,'' Haohe Liu, Xuenan Xu, Yi Yuan, Mengyue Wu, Wenwu Wang, Mark D. Plumbley,

SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound

Large language models (LLMs) have significantly advanced audio processing through audio codecs that convert audio into discrete tokens, enabling the application of language modelling techniques to...

arxiv.org

0

5

32

arXiv Sound

@ArxivSound

18 days

``Exploring the Capability of Mamba in Speech Applications,'' Koichi Miyazaki, Yoshiki Masuyama, Masato Murata,

0

7

31

arXiv Sound

@ArxivSound

3 years

``Audio representations for deep learning in sound synthesis: A review. (arXiv:2201.02490v1 []),'' Anastasia Natsiou, Sean O'Leary,

0

9

31

arXiv Sound

@ArxivSound

2 months

``Benchmarking Representations for Speech, Music, and Acoustic Events,'' Moreno La Quatra, Alkis Koudounas, Lorenzo Vaiani, Elena Baralis, Luca Cagliero, Paolo Garza, Sabato Marco Siniscalchi,

Benchmarking Representations for Speech, Music, and Acoustic Events

Limited diversity in standardized benchmarks for evaluating audio representation learning (ARL) methods may hinder systematic comparison of current methods' capabilities. We present ARCH, a...

arxiv.org

0

5

31

arXiv Sound

@ArxivSound

10 months

``Fast-HuBERT: An Efficient Training Framework for Self-Supervised Speech Representation Learning. (arXiv:2309.13860v2 [] UPDATED),'' Guanrou Yang, Ziyang Ma, Zhisheng Zheng, Yakun Song, Zhikang Niu, Xie Chen,

0

2

30

arXiv Sound

@ArxivSound

6 months

``StreamVC: Real-Time Low-Latency Voice Conversion. (arXiv:2401.03078v1 []),'' Yang Yang, Yury Kartynnik, Yunpeng Li, Jiuqiang Tang, Xing Li, George Sung, Matthias Grundmann,

0

9

30

arXiv Sound

@ArxivSound

2 years

``Multi-instrument Music Synthesis with Spectrogram Diffusion. (arXiv:2206.05408v2 [] UPDATED),'' Curtis Hawthorne, Ian Simon, Adam Roberts, Neil Zeghidour, Josh Gardner, Ethan Manilow, Jesse Engel,

0

3

30

arXiv Sound

@ArxivSound

8 months

``CREPE Notes: A new method for segmenting pitch contours into discrete notes. (arXiv:2311.08884v1 []),'' Xavier Riley, Simon Dixon,

0

5

30

arXiv Sound

@ArxivSound

2 months

``WavCraft: Audio Editing and Generation with Large Language Models,'' Jinhua Liang, Huan Zhang, Haohe Liu, Yin Cao, Qiuqiang Kong, Xubo Liu, Wenwu Wang, Mark D. Plumbley, Huy Phan, Emmanouil Benetos,

WavCraft: Audio Editing and Generation with Large Language Models

We introduce WavCraft, a collective system that leverages large language models (LLMs) to connect diverse task-specific models for audio content creation and editing. Specifically, WavCraft...

arxiv.org

0

6

30

arXiv Sound

@ArxivSound

25 days

``How Should We Extract Discrete Audio Tokens from Self-Supervised Models?,'' Pooneh Mousavi, Jarod Duret, Salah Zaiem, Luca Della Libera, Artem Ploujnikov, Cem Subakan, Mirco Ravanelli,

How Should We Extract Discrete Audio Tokens from Self-Supervised Models?

Discrete audio tokens have recently gained attention for their potential to bridge the gap between audio and language processing. Ideal audio tokens must preserve content, paralinguistic elements,...

arxiv.org

0

4

30

arXiv Sound

@ArxivSound

11 months

``General Purpose Audio Effect Removal. (arXiv:2308.16177v1 []),'' Matthew Rice, Christian J. Steinmetz, George Fazekas, Joshua D. Reiss,

0

4

29

arXiv Sound

@ArxivSound

2 years

``Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform. (arXiv:2210.15975v1 []),'' Masaya Kawamura, Yuma Shirahata, Ryuichi Yamamoto, Kentaro Tachibana,

0

13

29

arXiv Sound

@ArxivSound

1 year

``InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt. (arXiv:2301.13662v1 []),'' Dongchao Yang, Songxiang Liu, Rongjie Huang, Guangzhi Lei, Chao Weng, Helen Meng, Dong Yu,

0

9

29

arXiv Sound

@ArxivSound

2 years

``Period VITS: Variational Inference with Explicit Pitch Modeling for End-to-end Emotional Speech Synthesis. (arXiv:2210.15964v1 []),'' Yuma Shirahata, Ryuichi Yamamoto, Eunwoo Song, Ryo Terashima, Jae-Min Kim, Kentaro Tachibana,

0

9

29

arXiv Sound

@ArxivSound

5 months

``Amphion: An Open-Source Audio, Music and Speech Generation Toolkit,'' Xueyao Zhang, Liumeng Xue, Yicheng Gu, Yuancheng Wang, Haorui He, Chaoren Wang, Xi Chen, Zihao Fang, Haopeng Chen, Junan Zhang, Tze Ying Tang, Lexiao Zou, Mingxuan Wang, Jun Han, Kai…

Amphion: An Open-Source Audio, Music and Speech Generation Toolkit

Amphion is an open-source toolkit for Audio, Music, and Speech Generation, targeting to ease the way for junior researchers and engineers into these fields. It presents a unified framework that is...

arxiv.org

0

6

29

arXiv Sound

@ArxivSound

4 months

``The NeurIPS 2023 Machine Learning for Audio Workshop: Affective Audio Benchmarks and Novel Data,'' Alice Baird, Rachel Manzelli, Panagiotis Tzirakis, Chris Gagne, Haoqi Li, Sadie Allen, Sander Dieleman, Brian Kulis, Shrikanth S. Narayanan, Alan Cowen,

The NeurIPS 2023 Machine Learning for Audio Workshop: Affective...

The NeurIPS 2023 Machine Learning for Audio Workshop brings together machine learning (ML) experts from various audio domains. There are several valuable audio-driven ML tasks, from speech emotion...

arxiv.org

0

4

27

arXiv Sound

@ArxivSound

2 years

``Speech Enhancement and Dereverberation with Diffusion-based Generative Models. (arXiv:2208.05830v1 []),'' Julius Richter, Simon Welker, Jean-Marie Lemercier, Bunlong Lay, Timo Gerkmann,

0

4

28

arXiv Sound

@ArxivSound

10 months

``HiFi-WaveGAN: Generative Adversarial Network with Auxiliary Spectrogram-Phase Loss for High-Fidelity Singing Voice Generation. (arXiv:2210.12740v3 [] UPDATED),'' Chunhui Wang, Chang Zeng, Jun Chen, Xing He,

0

7

29

arXiv Sound

@ArxivSound

12 days

``Less is More: Accurate Speech Recognition & Translation without Web-Scale Data,'' Krishna C. Puvvada, Piotr \.Zelasko, He Huang, Oleksii Hrinchuk, Nithin Rao Koluguri, Kunal Dhawan, Somshubra Majumdar, Elena Rastorgueva, Zhehuai Chen, Vitaly Lavrukhin,…

Less is More: Accurate Speech Recognition & Translation...

Recent advances in speech recognition and translation rely on hundreds of thousands of hours of Internet speech data. We argue that state-of-the art accuracy can be reached without relying on...

arxiv.org

0

2

28

arXiv Sound

@ArxivSound

10 months

``DDSP-based Neural Waveform Synthesis of Polyphonic Guitar Performance from String-wise MIDI Input. (arXiv:2309.07658v1 []),'' Nicolas Jonason, Xin Wang, Erica Cooper, Lauri Juvela, Bob L. T. Sturm, Junichi Yamagishi,

0

7

29

arXiv Sound

@ArxivSound

1 year

@ArxivSound is back!

0

1

29

arXiv Sound

@ArxivSound

3 years

``SpecAugment++: A Hidden Space Data Augmentation Method for Acoustic Scene Classification. (arXiv:2103.16858v1 []),'' Helin Wang, Yuexian Zou, Wenwu Wang,

0

9

29

arXiv Sound

@ArxivSound

1 year

@ArxivSound is back (again)!

0

6

29

arXiv Sound

@ArxivSound

1 year

``Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers. (arXiv:2307.03183v1 []),'' Yuan Gong, Sameer Khurana, Leonid Karlinsky, James Glass,

0

2

29

arXiv Sound

@ArxivSound

26 days

``MMM: Multi-Layer Multi-Residual Multi-Stream Discrete Speech Representation from Self-supervised Learning Model,'' Jiatong Shi, Xutai Ma, Hirofumi Inaguma, Anna Sun, Shinji Watanabe,

MMM: Multi-Layer Multi-Residual Multi-Stream Discrete Speech...

Speech discrete representation has proven effective in various downstream applications due to its superior compression rate of the waveform, fast convergence during training, and compatibility...

arxiv.org

0

5

28

arXiv Sound

@ArxivSound

2 years

``Audio Self-supervised Learning: A Survey. (arXiv:2203.01205v1 []),'' Shuo Liu, Adria Mallol-Ragolta, Emilia Parada-Cabeleiro, Kun Qian, Xin Jing, Alexander Kathan, Bin Hu, Bjoern W. Schuller,

0

4

27

arXiv Sound

@ArxivSound

3 years

``Continual-wav2vec2: an Application of Continual Learning for Self-Supervised Automatic Speech Recognition. (arXiv:2107.13530v1 []),'' Samuel Kessler, Bethan Thomas, Salah Karout,

0

4

28

arXiv Sound

@ArxivSound

2 months

``Evaluating Text-to-Speech Synthesis from a Large Discrete Token-based Speech Language Model,'' Siyang Wang, \'Eva Sz\'ekely,

Evaluating Text-to-Speech Synthesis from a Large Discrete...

Recent advances in generative language modeling applied to discrete speech tokens presented a new avenue for text-to-speech (TTS) synthesis. These speech language models (SLMs), similarly to their...

arxiv.org

0

7

27

arXiv Sound

@ArxivSound

1 month

``Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning,'' Yixiao Zhang, Yukara Ikemiya, Woosung Choi, Naoki Murata, Marco A. Mart\'inez-Ram\'irez, Liwei Lin, Gus Xia, Wei-Hsiang Liao, Yuki Mitsufuji, Simon D…

Instruct-MusicGen: Unlocking Text-to-Music Editing for Music...

Recent advances in text-to-music editing, which employ text queries to modify music (e.g.\ by changing its style or adjusting instrumental components), present unique challenges and opportunities...

arxiv.org

0

8

28

arXiv Sound

@ArxivSound

4 months

``Audiosockets: A Python socket package for Real-Time Audio Processing,'' Nicolas Shu, David V. Anderson,

Audiosockets: A Python socket package for Real-Time Audio Processing

There are many packages in Python which allow one to perform real-time processing on audio data. Unfortunately, due to the synchronous nature of the language, there lacks a framework which allows...

arxiv.org

1

3

28

arXiv Sound

@ArxivSound

9 days

``VAE-based Phoneme Alignment Using Gradient Annealing and SSL Acoustic Features,'' Tomoki Koriyama,

VAE-based Phoneme Alignment Using Gradient Annealing and SSL...

This paper presents an accurate phoneme alignment model that aims for speech analysis and video content creation. We propose a variational autoencoder (VAE)-based alignment model in which a...

arxiv.org

0

11

30

arXiv Sound

@ArxivSound

3 months

``Musical Word Embedding for Music Tagging and Retrieval,'' SeungHeon Doh, Jongpil Lee, Dasaem Jeong, Juhan Nam,

Musical Word Embedding for Music Tagging and Retrieval

Word embedding has become an essential means for text-based information retrieval. Typically, word embeddings are learned from large quantities of general and unstructured text data. However, in...

arxiv.org

0

6

27

arXiv Sound

@ArxivSound

5 months

``An Embarrassingly Simple Approach for LLM with Strong ASR Capacity,'' Ziyang Ma, Guanrou Yang, Yifan Yang, Zhifu Gao, Jiaming Wang, Zhihao Du, Fan Yu, Qian Chen, Siqi Zheng, Shiliang Zhang, Xie Chen,

An Embarrassingly Simple Approach for LLM with Strong ASR Capacity

In this paper, we focus on solving one of the most important tasks in the field of speech processing, i.e., automatic speech recognition (ASR), with speech foundation encoders and large language...

arxiv.org

1

4

27

arXiv Sound

@ArxivSound

2 years

``End-to-end LPCNet: A Neural Vocoder With Fully-Differentiable LPC Estimation. (arXiv:2202.11301v1 []),'' Krishna Subramani, Jean-Marc Valin, Umut Isik, Paris Smaragdis, Arvindh Krishnaswamy,

1

8

27

arXiv Sound

@ArxivSound

1 month

``Seed-TTS: A Family of High-Quality Versatile Speech Generation Models,'' Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, Mingqing Gong, Peisong Huang, Qingqing Huang, Zhiying…

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

We introduce Seed-TTS, a family of large-scale autoregressive text-to-speech (TTS) models capable of generating speech that is virtually indistinguishable from human speech. Seed-TTS serves as a...

arxiv.org

0

1

26

arXiv Sound

@ArxivSound

1 month

``4D ASR: Joint Beam Search Integrating CTC, Attention, Transducer, and Mask Predict Decoders,'' Yui Sudo, Muhammad Shakeel, Yosuke Fukumoto, Brian Yan, Jiatong Shi, Yifan Peng, Shinji Watanabe,

4D ASR: Joint Beam Search Integrating CTC, Attention, Transducer,...

End-to-end automatic speech recognition (E2E-ASR) can be classified into several network architectures, such as connectionist temporal classification (CTC), recurrent neural network transducer...

arxiv.org

0

2

26

arXiv Sound

@ArxivSound

4 months

``NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models,'' Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, Zhizheng Wu, Tao Qin, Xiang-Yang Li, …

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec...

While recent large-scale text-to-speech (TTS) models have achieved significant progress, they still fall short in speech quality, similarity, and prosody. Considering speech intricately...

arxiv.org

1

9

25

arXiv Sound

@ArxivSound

5 months

``Diffusion Models for Audio Restoration,'' Jean-Marie Lemercier, Julius Richter, Simon Welker, Eloi Moliner, Vesa V\"alim\"aki, Timo Gerkmann,

0

8

26

arXiv Sound

@ArxivSound

3 years

``Neural HMMs are all you need (for high-quality attention-free TTS). (arXiv:2108.13320v3 [] UPDATED),'' Shivam Mehta, Éva Székely, Jonas Beskow, Gustav Eje Henter,

0

6

27

arXiv Sound

@ArxivSound

2 years

``Neural Vocoder is All You Need for Speech Super-resolution. (arXiv:2203.14941v1 []),'' Haohe Liu, Woosung Choi, Xubo Liu, Qiuqiang Kong, Qiao Tian, DeLiang Wang,

0

4

27

arXiv Sound

@ArxivSound

25 days

``NAST: Noise Aware Speech Tokenization for Speech Language Models,'' Shoval Messica, Yossi Adi,

NAST: Noise Aware Speech Tokenization for Speech Language Models

Speech tokenization is the task of representing speech signals as a sequence of discrete units. Such representations can be later used for various downstream tasks including automatic speech...

arxiv.org

0

6

26

arXiv Sound

@ArxivSound

8 months

``Diff-HierVC: Diffusion-based Hierarchical Voice Conversion with Robust Pitch Generation and Masked Prior for Zero-shot Speaker Adaptation. (arXiv:2311.04693v1 []),'' Ha-Yeong Choi, Sang-Hoon Lee, Seong-Whan Lee,

0

5

26

arXiv Sound

@ArxivSound

3 years

``Music Demixing Challenge at ISMIR 2021. (arXiv:2108.13559v1 []),'' Yuki Mitsufuji, Giorgio Fabbro, Stefan Uhlich, Fabian-Robert Stöter,

0

5

25

arXiv Sound

@ArxivSound

3 years

``Tiny Transformers for Environmental Sound Classification at the Edge. (arXiv:2103.12157v1 []),'' David Elliott, Carlos E. Otero, Steven Wyatt, Evan Martino,

0

2

26

arXiv Sound

@ArxivSound

5 months

``BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data,'' Mateusz {\L}ajszczak Guillermo C\'ambara Yang Li Fatih Beyhan Arent van Korlaar Fan Yang Arnaud J…

BASE TTS: Lessons from building a billion-parameter Text-to-Speech...

We introduce a text-to-speech (TTS) model called BASE TTS, which stands for $\textbf{B}$ig $\textbf{A}$daptive $\textbf{S}$treamable TTS with $\textbf{E}$mergent abilities. BASE TTS is the largest...

arxiv.org

0

7

26

arXiv Sound

@ArxivSound

5 months

``OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer,'' Yifan Peng Jinchuan Tian William Chen Siddhant Arora Brian Yan Yui Sudo Muhammad Shakeel Kwanghee …

OWSM v3.1: Better and Faster Open Whisper-Style Speech Models...

Recent studies have highlighted the importance of fully open foundation models. The Open Whisper-style Speech Model (OWSM) is an initial step towards reproducing OpenAI Whisper using public data...

arxiv.org

0

6

26

arXiv Sound

@ArxivSound

1 month

``Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speech,'' Shivam Mehta, Harm Lameris, Rajiv Punmiya, Jonas Beskow, \'Eva Sz\'ekely, Gustav Eje Henter,

Should you use a probabilistic duration model in TTS? Probably!...

Converting input symbols to output audio in TTS requires modelling the durations of speech sounds. Leading non-autoregressive (NAR) TTS models treat duration modelling as a regression problem. The...

arxiv.org

0

8

25

arXiv Sound

@ArxivSound

2 years

``u-HuBERT: Unified Mixed-Modal Speech Pretraining And Zero-Shot Transfer to Unlabeled Modality. (arXiv:2207.07036v2 [] UPDATED),'' Wei-Ning Hsu, Bowen Shi,

0

4

24

arXiv Sound

@ArxivSound

3 years

``Multimodal Self-Supervised Learning of General Audio Representations. (arXiv:2104.12807v2 [] UPDATED),'' Luyu Wang, Pauline Luc, Adria Recasens, Jean-Baptiste Alayrac, Aaron van den Oord,

0

3

25

arXiv Sound

@ArxivSound

2 months

``EnCodecMAE: Leveraging neural codecs for universal audio representation learning,'' Leonardo Pepino, Pablo Riera, Luciana Ferrer,

EnCodecMAE: Leveraging neural codecs for universal audio...

The goal of universal audio representation learning is to obtain foundational models that can be used for a variety of downstream tasks involving speech, music and environmental sounds. To...

arxiv.org

0

5

25

arXiv Sound

@ArxivSound

3 months

``Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization,'' Navonil Majumder, Chia-Yu Hung, Deepanway Ghosal, Wei-Ning Hsu, Rada Mihalcea, Soujanya Poria,

Tango 2: Aligning Diffusion-based Text-to-Audio Generations...

Generative multimodal content is increasingly prevalent in much of the content creation arena, as it has the potential to allow artists and media personnel to create pre-production mockups by...

arxiv.org

0

1

25

arXiv Sound

@ArxivSound

6 months

``Accent-VITS:accent transfer for end-to-end TTS. (arXiv:2312.16850v2 [] UPDATED),'' Linhan Ma, Yongmao Zhang, Xinfa Zhu, Yi Lei, Ziqian Ning, Pengcheng Zhu, Lei Xie,

0

3

24

arXiv Sound

@ArxivSound

1 month

``LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes,'' Trung Dang, David Aponte, Dung Tran, Kazuhito Koishida,

LiveSpeech: Low-Latency Zero-shot Text-to-Speech via...

Prior works have demonstrated zero-shot text-to-speech by using a generative language model on audio tokens obtained via a neural audio codec. It is still challenging, however, to adapt them to...

arxiv.org

0

2

25

arXiv Sound

@ArxivSound

3 years

``Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling. (arXiv:2103.14574v1 []),'' Isaac Elias, Heiga Zen, Jonathan Shen, Yu Zhang, Jia Ye, RJ Ryan, Yonghui Wu,

0

7

25

arXiv Sound

@ArxivSound

9 months

``One model to rule them all ? Towards End-to-End Joint Speaker Diarization and Speech Recognition. (arXiv:2310.01688v1 []),'' Samuele Cornell, Jee-weon Jung, Shinji Watanabe, Stefano Squartini,

0

4

25

arXiv Sound

@ArxivSound

3 months

``SpeechAlign: Aligning Speech Generation to Human Preferences,'' Dong Zhang, Zhaowei Li, Shimin Li, Xin Zhang, Pengyu Wang, Yaqian Zhou, Xipeng Qiu,

SpeechAlign: Aligning Speech Generation to Human Preferences

Speech language models have significantly advanced in generating realistic speech, with neural codec language models standing out. However, the integration of human feedback to align speech...

arxiv.org

0

3

25

arXiv Sound

@ArxivSound

25 days

``GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities,'' Sreyan Ghosh, Sonal Kumar, Ashish Seth, Chandra Kiran Reddy Evuru, Utkarsh Tyagi, S Sakshi, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha,

GAMA: A Large Audio-Language Model with Advanced Audio...

Perceiving and understanding non-speech sounds and non-verbal speech is essential to making decisions that help us interact with our surroundings. In this paper, we propose GAMA, a novel...

arxiv.org

0

6

24

arXiv Sound

@ArxivSound

4 years

``Multiple F0 Estimation in Vocal Ensembles using Convolutional Neural Networks. (arXiv:2009.04172v1 []),'' Helena Cuesta, Brian McFee, Emilia Gómez,

0

5

23

arXiv Sound

@ArxivSound

4 months

``MSLM-S2ST: A Multitask Speech Language Model for Textless Speech-to-Speech Translation with Speaker Style Preservation,'' Yifan Peng, Ilia Kulikov, Yilin Yang, Sravya Popuri, Hui Lu, Changhan Wang, Hongyu Gong,

MSLM-S2ST: A Multitask Speech Language Model for Textless...

There have been emerging research interest and advances in speech-to-speech translation (S2ST), translating utterances from one language to another. This work proposes Multitask Speech Language...

arxiv.org

0

4

24

arXiv Sound

@ArxivSound

7 months

``StyleSinger: Style Transfer for Out-Of-Domain Singing Voice Synthesis. (arXiv:2312.10741v1 []),'' Yu Zhang, Rongjie Huang, Ruiqi Li, JinZheng He, Yan Xia, Feiyang Chen, Xinyu Duan, Baoxing Huai, Zhou Zhao,

0

4

24

arXiv Sound

@ArxivSound

1 year

``Foley Sound Synthesis at the DCASE 2023 Challenge. (arXiv:2304.12521v1 []),'' Keunwoo Choi, Jaekwon Im, Laurie Heller, Brian McFee, Keisuke Imoto, Yuki Okamoto, Mathieu Lagrange, Shinosuke Takamichi,

Foley Sound Synthesis at the DCASE 2023 Challenge

The addition of Foley sound effects during post-production is a common technique used to enhance the perceived acoustic properties of multimedia content. Traditionally, Foley sound has been...

arxiv.org

0

8

24

arXiv Sound

@ArxivSound

29 days

``VISinger2+: End-to-End Singing Voice Synthesis Augmented by Self-Supervised Learning Representation,'' Yifeng Yu, Jiatong Shi, Yuning Wu, Shinji Watanabe,

VISinger2+: End-to-End Singing Voice Synthesis Augmented by...

Singing Voice Synthesis (SVS) has witnessed significant advancements with the advent of deep learning techniques. However, a significant challenge in SVS is the scarcity of labeled singing voice...

arxiv.org

0

4

24

arXiv Sound

@ArxivSound

2 years

``BigVGAN: A Universal Neural Vocoder with Large-Scale Training. (arXiv:2206.04658v1 []),'' Sang-gil Lee, Wei Ping, Boris Ginsburg, Bryan Catanzaro, Sungroh Yoon,

0

5

24

arXiv Sound

@ArxivSound

2 months

``AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining,'' Haohe Liu, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Qiao Tian, Yuping Wang, Wenwu Wang, Yuxuan Wang, Mark D. Plumbley,

AudioLDM 2: Learning Holistic Audio Generation with...

Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of specific...

arxiv.org

0

2

24

arXiv Sound

@ArxivSound

3 years

``BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation. (arXiv:2103.06695v1 []),'' Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Kunio Kashino,

0

6

24

arXiv Sound

@ArxivSound

2 years

``Self-supervised Graphs for Audio Representation Learning with Limited Labeled Data. (arXiv:2202.00097v1 [cs.LG]),'' Amir Shirian, Krishna Somandepalli, Tanaya Guha,

0

3

24

arXiv Sound

@ArxivSound

1 month

``DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation,'' Zachary Novack, Julian McAuley, Taylor Berg-Kirkpatrick, Nicholas Bryan,

DITTO-2: Distilled Diffusion Inference-Time T-Optimization for...

Controllable music generation methods are critical for human-centered AI-based music creation, but are currently limited by speed, quality, and control design trade-offs. Diffusion Inference-Time...

arxiv.org

0

6

24