Demo for "Attention-Guided Generative Adversarial Network for Whisper to Normal Speech Conversion"

Abstract: Whispered speech is a special way of pronunciation without using vocal cord vibration. A whispered speech does not contain a fundamental frequency, and its energy is about 20dB lower than that of a normal speech. Converting a whispered speech into a normal speech can improve speech quality and intelligibility. In this paper, a novel attention-guided generative adversarial network model incorporating an autoencoder, a Siamese neural network, and an identity mapping loss function for whisper to normal speech conversion (AGAN-W2SC) is proposed. The proposed method avoids the challenge of estimating the fundamental frequency of the normal voiced speech converted from a whispered speech. Specifically, the proposed model is more amendable to practical applications because it does not need to align speech features for training. Experimental results demonstrate that the proposed AGAN-W2SC can obtain improved speech quality and intelligibility compared with dynamic-time-warping-based methods.

Comparing of different models(Whisper, Normal, GMM, BLSTM, CycleGAN, Ours ):

Whisper Normal GMM BLSTM CycleGAN Ours
A B C D E F
1. test001.wav.
A
B
C
D
E
F
2. test002.wav.
A
B
C
D
E
F
3. test003.wav.
A
B
C
D
E
F