Multiple emotional voice conversion in Vietnamese HMM-based speech synthesis using non-negative matrix factorization

Phung, Trung-Nghia

International Journal of Advanced and Applied Sciences

Int. j. adv. appl. sci.

EISSN: 2313-3724

Print ISSN: 2313-626X

Volume 4, Issue 8 (August 2017), Pages: 1-5

Title: Multiple emotional voice conversion in Vietnamese HMM-based speech synthesis using non-negative matrix factorization

Author(s): Trung-Nghia Phung *

Affiliation(s):

Thai Nguyen University of Information and Communication Technology, Thai Nguyen 25000, Vietnam

https://doi.org/10.21833/ijaas.2017.08.001

Full Text - PDF XML

Abstract:

Most of current text-to-speech (TTS) systems can synthesize only single voice with neutral emotion. If different emotional voices are required to be synthesized, the system has to be trained again with the new emotional voices. The training process normally requires a huge amount of emotional speech data that is usually impractical. The state of the art TTS using Hidden Markov Model (HMM), called as HMM-based TTS, can synthesize speech with various emotions by using speaker adaption methods. However, both of the emotional voices synthesized and adapted by HMM-based TTS are “over-smooth”. When these voices are over-smooth, the detail structures clearly linked to speaker emotions may be missing. We can also synthesize multiple voices by using some voice conversion (VC) methods combined with HMM-based TTS. However, current voice conversions still cannot synthesize target speech while keeping the detail information related to speaker emotions of the target voice and just using limited amount data of target voices. In this paper, we proposed to use exemplar-based emotional voice conversion combined with HMM-based TTS to synthesize multiple high-quality emotional voices with a few amount of target data. The evaluation results using the Vietnamese emotional speech data corpus confirmed the merits of the proposed method.

This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

Keywords: HMM-based speech synthesis, Voice adaption, Exemplar-based voice conversion, Non-negative matrix factorization, Emotional speech synthesis

Article History: Received 16 May 2017, Received in revised form 23 June 2017, Accepted 23 June 2017

Digital Object Identifier:

https://doi.org/10.21833/ijaas.2017.08.001

Citation:

Phung TN (2017). Multiple emotional voice conversion in Vietnamese HMM-based speech synthesis using non-negative matrix factorization. International Journal of Advanced and Applied Sciences, 4(8): 1-5

http://www.science-gate.com/IJAAS/V4I8/Phung.html

References:

Aihara R, Takashima R, Takiguchi T, and Ariki Y (2012). GMM-based emotional voice conversion using spectrum and prosody features. American Journal of Signal Processing, 2(5): 134-138. https://doi.org/10.5923/j.ajsp.20120205.06

Beller G, Obin N, and Rodet X (2008). Articulation degree as a prosodic dimension of expressive speech. In the 4th International Conference on Speech Prosody, Campinas, Brazil. Available online at: citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.527.2960

Chappell DT and Hansen JH (1998). Speaker-specific pitch contour modeling and modification. In the IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, Seattle, USA, 2: 885-888. https://doi.org/10.1109/icassp.1998.675407

Gillett B and King S (2003). Transforming F0 contours. In the 8th European Conference on Speech Communication and Technology, Geneva, Switzerland. Available online at: http://hdl.handle.net/1842/1078

Helander EE and Nurminen J (2007). A novel method for prosody prediction in voice conversion. In the IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, Honolulu, USA: 509-512. https://doi.org/10.1109/icassp.2007.366961

Kawahara H (1997). Speech representation and transformation using adaptive interpolation of weighted spectrum: Vocoder revisited. In the IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE, Munich, Germany, 2: 1303-1306. https://doi.org/10.1109/icassp.1997.596185

Lavner Y, Rosenhouse J, and Gath I (2001). The prototype model in speaker identification by human listeners. International Journal of Speech Technology, 4(1): 63-74. https://doi.org/10.1023/A:1009656816383

Phan TS, Duong TC, Dinh AT, Vu TT, and Luong, CM (2013). Improvement of naturalness for an HMM-based Vietnamese speech synthesis using the prosodic information. In the IEEE RIVF International Conference on Computing and Communication Technologies, Research, Innovation, and Vision for the Future, IEEE, Hanoi, Vietnam: 276-281. https://doi.org/10.1109/rivf.2013.6719907

Phung TN, Mai CL, and Akagi M (2012). A concatenative speech synthesis for monosyllabic languages with limited data. In the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, IEEE, Hollywood, USA: 1-10.

Takashi NOSE, Tachibana M, and Kobayashi T (2009). HMM-based style control for expressive speech synthesis with arbitrary speaker's voice using model adaptation. IEICE Transactions on Information and Systems, 92(3): 489-497.

Tokuda K, Zen H, and Black AW (2002). An HMM-based speech synthesis system applied to English. In the IEEE Workshop on Speech Synthesis: 227-230. https://doi.org/10.1109/WSS. 2002.1224415

Tomoki T and Tokuda K (2007). A speech parameter generation algorithm considering global variance for HMM-based speech synthesis. IEICE Transactions on Information and Systems, 90(5): 816-824.

Wu Z, Virtanen T, Kinnunen T, Chng ES, and Li H (2013). Exemplar-based voice conversion using non-negative spectrogram deconvolution. In the 8th ISCA Speech Synthesis Workshop, Barcelona, Spain: 201-206.

Zen H, Nose T, Yamagishi J, Sako S, Masuko T, Black AW, and Tokuda K (2007). The HMM-based speech synthesis system (HTS) version 2.0. In the 6th ISCA Workshop on Speech Synthesis, Bonn, German: 294-299.