Skip to the content.

Improving Nasal Electrolaryngeal Speech Using Voice Conversion

Abstract

Electrolaryngeal (EL) speech produced by laryngectomees using an electrolarynx has low intelligibility due to insufficient excitation signals for speech production and fixed pitch. Although recent EL speech voice conversion (ELVC) research has made good progress, the research has mainly focused on cervical EL (CEL) speech. This study experiments on ELVC of speech produced by a novel nasal EL (NEL) device. Specifically, we evaluate the impact of using Mel-spectrogram and WavLM features as inputs to the ELVC system. We also propose a data augmentation method using text-to-speech (TTS) and exemplar-based VC. We find that while WavLM features have a significant effect on ELVC of CEL speech, the model using Mel-spectrogram performs better in both subjective and objective evaluations of ELVC of NEL speech due to the unique acoustic properties of NEL speech. In addition, NEL speech synthesized using Mel-spectrogram is closer to real NEL speech than NEL speech synthesized using WavLM features.

Architecture and Experimental Settings

ETN_VTN

  1. VTN-VC: Two pretraining stages with a large-scale NL corpus; VC training stage with the EL dataset.

  2. ETN-VC: Two pretraining stages with a large-scale NL corpus; VC pretraining stage with the sEL and EL corpora; VC training stage was further fine-tuned on the EL dataset.

EL dataset

  1. CEL-NL: 320 utterances of parallel corpus
  2. NEL-NL: 320 utterances of parallel corpus

sEL dataset

The sNL/sEL data were generated based on the above EL dataset.

Evaluation Metric

Intelligibility related:

Speech Quality related:

F0 (Pitch) related:

Duration related:

Semantic Consistency related:

Experimental Results

We evaluate the performance of VTN-VC on both CEL-to-NL and NEL-to-NL conversion, highlighting how the acoustic differences between CEL and NEL speech influence model behavior. To further analyze the characteristics of sNEL speech, we synthesized it using Mel-spectrogram and WavLM features, and examined their effectiveness in reconstructing the key resonant regions of NEL speech. We also compare VTN-VC and ETN-VC on the NEL-to-NL task to assess the impact of different training strategies and feature representations.

The baseline CER and SER for unprocessed NEL speech are 90.8% and 89.5%, respectively, while the zero-shot Seed-VC model achieves 92.3% and 84.8%. The following presents the results of NEL speech converted by the VTN-VC and ETN-VC models.

ab_test

Audio Sample

sample 1 (CEL using WavLM features and NEL using Mel-spectrograms, respectively.)

Model transcription: 他捐了很多衣物給災區 (Ta juan le hen duo yiwu gei zaiqu)
CEL speech
NEL speech
VTN-VC CEL
VTN-VC NEL
ETN-VC CEL
ETN-VC NEL
NL speech

sample 2 (CEL using WavLM features and NEL using Mel-spectrograms, respectively.)

Model transcription: 電視報導那裡發生地震 (dian shi bao dao na li fa sheng di zhen)
CEL speech
NEL speech
VTN-VC CEL
VTN-VC NEL
ETN-VC CEL
ETN-VC NEL
NL speech

sample 3 (CEL using WavLM features and NEL using Mel-spectrograms, respectively.)

Model transcription: 我把不用的傢俱送人了 (Wo ba bu yong de jia ju song ren le)
CEL speech
NEL speech
VTN-VC CEL
VTN-VC NEL
ETN-VC CEL
ETN-VC NEL
NL speech

sample 4 (CEL using WavLM features and NEL using Mel-spectrograms, respectively.)

Model transcription: 昨天他向我借了三百塊 (Zuo tian ta xiang wo jie le san bai kuai)
CEL speech
NEL speech
VTN-VC CEL
VTN-VC NEL
ETN-VC CEL
ETN-VC NEL
NL speech

Spectrogram

Spectro