Abastract

The existing cross-lingual voice cloning approaches face some obvious drawbacks in real applications: 1) such as the need of recordings from bilingual speakers, or a large amount of multi-speaker audio-text pairs; 2) they need to design a specific method of phoneme sharing cross different languages; 3) they add extra modules to encode speaker and language, which complicates the building pipeline and may be hard to train.
This paper proposes a novel cross-lingual voice cloning framework by utilizing bottleneck (BN) features obtained from speaker-independent automatic speech recognition system in the target language. Firstly, we use audio-text pairs of a single speaker in the target language to train a latent prosody model, which models the relationships between the text and the BN features. Then the acoustic model, which translates the BN features to the acoustic features, is trained with multi-speaker’s audio data in the target language. Finally, the acoustic model is fine-tuned with target speaker’s speech in the original language without the corresponding texts, as the BN features are served as the bridge. Our approach has the following advantages: 1) no recordings from bilingual speakers are required; 2) audio-text pairs are not required for acoustic model training; 3) no extra complicated modules are needed to encode speaker or language. Experimental results show that, with only several minutes of audio from a new English speaker, our proposed system can synthesize this speaker’s Mandarin speech with decent naturalness and speaker similarity.


Baseline Approach

We compared the effects of our proposed and baseline approaches on Mandarin speakers. The baseline approach is a speaker adaptation training approach. The model is based on Tacotron2 [1], which predicts the acoustic features from the phoneme sequence directly. As shown in Figure 1, the generative model is trained using multi-speaker’s corpus firstly, then fine-tuned with a few target speaker’s audio-text pairs.



Figure1. Block diagram of baseline system.

Proposed Approach

The proposed voicing cloning framework consists of two parts: latent prosody model and acoustic model shown as Figure 2. Firstly, audio-text pairs from a single speaker in the target language is used to train a Tacotron2-based latent prosody model, which takes text sequence as input, predicting the corresponding BN features with automatic time alignment. Secondly, the CBHG-structured [2] acoustic model is trained with multi-speakers audio data in the target language, which translates BN features to acoustic features. For an unseen speaker, the acoustic model is fine-tuned using a few audio samples of this speaker without the need of the corresponding texts, as the input BN features to the acoustic model are from the output of the BN extractor. In the synthesis stage, for any text in the target language, the corresponding BN features can be predicted by the latent prosody model, and then the acoustic model predicts the acoustic features. Finally, given the acoustic features, a speaker-independent neural vocoder (e.g. LPCNet [3]) is used to synthesize the speech that sounds like the target speakers voice.


Figure2. Block diagram of our proposed system.


Audio samples

  • The generative model is trained with the THCHS-30, which is an open Chinese speech corpus of 60 speakers. The audio samples contain the original recordings of the target speaker and the synthesized speech, and the contents of all synthesized speech are Chinese Mandarin. Our results consist of four parts:
  • Part 1 and part 2 are the adaptive effects of Mandarin female and male speaker respectively. We compared the effects of our approach and baseline approach with different amounts of data.
  • Part 3 and part 4 are respectively the adaptive effects of English female and male speaker of our approach with different amounts of data.

Part 1. Mandarin female speaker

Recordings of target speaker Baseline (50 sentences) Our approach (50 sentences) Baseline (200 sentences) Our approach (200 sentences)










Part 2. Mandarin male speaker

Recordings of target speaker Baseline (50 sentences) Our approach (50 sentences) Baseline (200 sentences) Our approach (200 sentences)










Part 3. English female speaker.

Recordings of target speaker Our approach (50 sentences) Our approach (200 sentences)





Part 4. English male speaker.

Recordings of target speaker Our approach (50 sentences) Our approach (200 sentences)






References

[1] Shen J, Pang R, Weiss R J, “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018: 4779-4783.

[2] Wang Y, Skerry-Ryan R J, Stanton D, “Tacotron: Towards end-to-end speech synthesis,” in INTERSPEECH 2017: 4006-4010

[3] Valin Jean-Marc and Skoglund Jan, “LPCNet: Improving neural speech synthesis through linear prediction,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2019, pp. 5891–5895.