海市科技网站建设,用凡科可以做视频网站吗,企业邮箱注册申请免费注册,用dw做一个个人网站综述系列2021_A Survey on Neural Speech Synthesis论文#xff1a;2106.15561.pdf (arxiv.org)论文从两个方面对神经语音合成领域的发展现状进行了梳理总结#xff08;逻辑框架如图1所示#xff09;#xff1a;核心模块#xff1a;分别从文本分析#xff08;textanalysi…综述系列2021_A Survey on Neural Speech Synthesis论文2106.15561.pdf (arxiv.org)论文从两个方面对神经语音合成领域的发展现状进行了梳理总结逻辑框架如图1所示核心模块分别从文本分析textanalysis、声学模型acoustic model、声码器vocoder、完全端到端模型fully end-to-end model等方面进行介绍。进阶主题分别从快速语音合成fast TTS、低资源语音合成low-resourceTTS、鲁棒语音合成robust TTS、富有表现力的语音合成expressive TTS、可适配语音合成adaptive TTS等方面进行介绍。TTS 核心模块研究员们根据神经语音合成系统的核心模块提出了一个分类体系。每个模块分别对应特定的数据转换流程1文本分析模块将文本字符转换成音素或语言学特征2声学模型将语言学特征、音素或字符序列转换成声学特征3声码器将语言学特征或声学特征转换成语音波形4完全端到端模型将字符或音素序列转换成语音波形。2021_A Survey on Audio Synthesis and Audio-Visual Multimodal Processing音频合成与视听多模态处理综述论文2108.00443.pdf (arxiv.org)SOTA2022_NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality论文2205.04421v2.pdf (arxiv.org) TTS经典论文2016_WAVENET: A GENERATIVE MODEL FOR RAW AUDIO 论文1609.03499.pdf (arxiv.org)【34】本文的四大特点如下WaveNet 直接生成自然的语音波形。提出了一种可以学习和生成长语音波形的新结构。训练的模型可以产生各种特征语音因为状态建模。它在各种语音生成包括音乐中也表现出色。WaveNet模型结构WaveNet 具有 30 个救援块的结构。 将整数数组作为输入从第一个区域块到第 30 个区域性块依次进入。 从每个区域块生成的输出通过 Skip 连接合并并将其用作模型的输出。2018_NATURAL TTS SYNTHESIS BY CONDITIONING WAVENET ON MEL SPECTROGRAM PREDICTIONS 论文2108.00443.pdf (arxiv.org)随着深度学习方法如 WaveNet 和 Tacotron的应用TTS TTS 发展迅速。 因此现在无需复杂的工作流程即可训练数据从而从文本中生成高质量的语音【12】。论文的三大特点如下基于 Attention 的 Seq-to-Seq提出了TTS模型结构。端到端模型只需对语句、语音和对的数据即可进行训练无需执行任何操作。在语音合成质量测试 MOS 中得分较高。合成质量好。2017.3_Deep Voice: Real-time Neural Text-to-Speech论文https://arxiv.org/abs/1702.078252017.5_Deep Voice 2: Multi-Speaker Neural Text-to-Speech2018_DEEP VOICE 3: SCALING TEXT-TO-SPEECH WITH CONVOLUTIONAL SEQUENCELEARNING论文参考文献【1】[논문리뷰]Tacotron2 - 새내기 코드 여행 (joungheekim.github.io)【2】[Speech Synthesis] Tacotron 논문 정리 (hcnoh.github.io)【3】[논문리뷰]WaveNet - 새내기 코드 여행 (joungheekim.github.io)【4】Understanding WaveNet architecture | by Satyam Kumar | MediumReferences [1] Sercan Ömer Arik, Mike Chrzanowski, Adam Coates, Gregory Frederick Diamos, Andrew Gibiansky, Yongguo Kang, Xian Li, John Miller, Andrew Y. Ng, Jonathan Raiman, Shubho Sengupta, Mohammad Shoeybi: Deep Voice: Real-time Neural Text-to-Speech. ICML 2017: 195-204[2] Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan O.Arık, Ajay Kannan, Sharan Naran: DEEP VOICE 3: 2000-SPEAKER NEURAL TEXT-TO-SPEECH. CoRR abs/1710.07654 (2017)[3] Sercan Ömer Arik, Gregory F. Diamos, Andrew Gibiansky, John Miller, Kainan Peng, Wei Ping, Jonathan Raiman, Yanqi Zhou: Deep Voice 2: Multi-Speaker Neural Text-to-Speech. CoRR abs/1705.08947 (2017)[4] Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W. Senior, Koray Kavukcuoglu: WaveNet: A Generative Model for Raw Audio. CoRR abs/1609.03499 (2016)[5] Soroush Mehri, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham Jain, Jose Sotelo, Aaron C. Courville, Yoshua Bengio: SampleRNN: An Unconditional End-to-End Neural Audio Generation Model. CoRR abs/1612.07837 (2016)[6] Sotelo, J., Mehri, S., Kumar, K., Santos, J. F., Kastner, K., Courville, A., Bengio, Y. (2017). Char2Wav: End-to-end speech synthesis.[7] Yuxuan Wang, R. J. Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc V. Le, Yannis Agiomyrgiannakis, Rob Clark, Rif A. Saurous: Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model. CoRR abs/1703.10135 (2017)[8] Wang, W., Xu, S., Xu, B. (2016). First Step Towards End-to-End Parametric TTS Synthesis: Generating Spectral Parameters with Neural Attention. INTERSPEECH.