International Journal of Advanced and Applied Sciences

Int. j. adv. appl. sci.

EISSN: 2313-3724

Print ISSN: 2313-626X

Volume 4, Issue 8  (August 2017), Pages:  1-5

Title:  Multiple emotional voice conversion in Vietnamese HMM-based speech synthesis using non-negative matrix factorization

Author(s):  Trung-Nghia Phung *


Thai Nguyen University of Information and Communication Technology, Thai Nguyen 25000, Vietnam

Full Text - PDF          XML


Most of current text-to-speech (TTS) systems can synthesize only single voice with neutral emotion. If different emotional voices are required to be synthesized, the system has to be trained again with the new emotional voices. The training process normally requires a huge amount of emotional speech data that is usually impractical. The state of the art TTS using Hidden Markov Model (HMM), called as HMM-based TTS, can synthesize speech with various emotions by using speaker adaption methods. However, both of the emotional voices synthesized and adapted by HMM-based TTS are “over-smooth”. When these voices are over-smooth, the detail structures clearly linked to speaker emotions may be missing. We can also synthesize multiple voices by using some voice conversion (VC) methods combined with HMM-based TTS. However, current voice conversions still cannot synthesize target speech while keeping the detail information related to speaker emotions of the target voice and just using limited amount data of target voices. In this paper, we proposed to use exemplar-based emotional voice conversion combined with HMM-based TTS to synthesize multiple high-quality emotional voices with a few amount of target data. The evaluation results using the Vietnamese emotional speech data corpus confirmed the merits of the proposed method. 

© 2017 The Authors. Published by IASE.

This is an open access article under the CC BY-NC-ND license (

Keywords: HMM-based speech synthesis, Voice adaption, Exemplar-based voice conversion, Non-negative matrix factorization, Emotional speech synthesis

Article History: Received 16 May 2017, Received in revised form 23 June 2017, Accepted 23 June 2017

Digital Object Identifier:


Phung TN (2017). Multiple emotional voice conversion in Vietnamese HMM-based speech synthesis using non-negative matrix factorization. International Journal of Advanced and Applied Sciences, 4(8): 1-5


Aihara R, Takashima R, Takiguchi T, and Ariki Y (2012). GMM-based emotional voice conversion using spectrum and prosody features. American Journal of Signal Processing, 2(5): 134-138.
Beller G, Obin N, and Rodet X (2008). Articulation degree as a prosodic dimension of expressive speech. In the 4th International Conference on Speech Prosody, Campinas, Brazil. Available online at:
Chappell DT and Hansen JH (1998). Speaker-specific pitch contour modeling and modification. In the IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, Seattle, USA, 2: 885-888.
Gillett B and King S (2003). Transforming F0 contours. In the 8th European Conference on Speech Communication and Technology, Geneva, Switzerland. Available online at:
Helander EE and Nurminen J (2007). A novel method for prosody prediction in voice conversion. In the IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, Honolulu, USA: 509-512.
Kawahara H (1997). Speech representation and transformation using adaptive interpolation of weighted spectrum: Vocoder revisited. In the IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE, Munich, Germany, 2: 1303-1306.
Lavner Y, Rosenhouse J, and Gath I (2001). The prototype model in speaker identification by human listeners. International Journal of Speech Technology, 4(1): 63-74.
Phan TS, Duong TC, Dinh AT, Vu TT, and Luong, CM (2013). Improvement of naturalness for an HMM-based Vietnamese speech synthesis using the prosodic information. In the IEEE RIVF International Conference on Computing and Communication Technologies, Research, Innovation, and Vision for the Future, IEEE, Hanoi, Vietnam: 276-281.
Phung TN, Mai CL, and Akagi M (2012). A concatenative speech synthesis for monosyllabic languages with limited data. In the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, IEEE, Hollywood, USA: 1-10.
Takashi NOSE, Tachibana M, and Kobayashi T (2009). HMM-based style control for expressive speech synthesis with arbitrary speaker's voice using model adaptation. IEICE Transactions on Information and Systems, 92(3): 489-497.
Tokuda K, Zen H, and Black AW (2002). An HMM-based speech synthesis system applied to English. In the IEEE Workshop on Speech Synthesis: 227-230. 2002.1224415
Tomoki T and Tokuda K (2007). A speech parameter generation algorithm considering global variance for HMM-based speech synthesis. IEICE Transactions on Information and Systems, 90(5): 816-824.
Wu Z, Virtanen T, Kinnunen T, Chng ES, and Li H (2013). Exemplar-based voice conversion using non-negative spectrogram deconvolution. In the 8th ISCA Speech Synthesis Workshop, Barcelona, Spain: 201-206.
Zen H, Nose T, Yamagishi J, Sako S, Masuko T, Black AW, and Tokuda K (2007). The HMM-based speech synthesis system (HTS) version 2.0. In the 6th ISCA Workshop on Speech Synthesis, Bonn, German: 294-299.