Speech enhancement is a challenging and critical task in the speech processing research area. In this paper, we propose a novel speech enhancement model based on Wasserstein generative adversarial networks, called WSEM. The proposed model operates on frame-level speech segments by using an adjacent frames extension mechanism, to enforce the mapping from noisy speech to the clean target, which makes it distinctly different from other related GAN-based models. We compare the performance of WSEM with related works on benchmark datasets under different signal-to-noise (SNR) conditions, experimental results show that WSEM performs comparable to the state-of-the-art approaches in all the tests, and it performs especially well in low SNR environments.
Proc. SPIE. 11384, Eleventh International Conference on Signal Processing Systems
KEYWORDS: Signal to noise ratio, Detection and tracking algorithms, Data modeling, Denoising, Interference (communication), Feature extraction, Neural networks, Speech recognition, Performance modeling, Binary data
Achieving stationary speech enhancement in low signal-to-noise ratio (SNR) environments is a challenging problem. Because noise energy is dominant in noisy speech at low SNR level, the existence of numerous obvious random noises may lead neural network to forget some useful information obtained by early training. Moreover, it is difficult for a single neural network to obtain effective speech features and noise features. Therefore, this paper designs to utilize multiple neural networks in two stages to discriminately learn a certain type of noise features and reduce the introduction of interference. Experiment results demonstrate that proposed method leads to consistently better source-to-distortion ratio (SDR) and perceptual evaluation of speech quality (PESQ) than baseline models in low SNR condition. And the results indicate that the method can suppress the forgetting of early information of neural network.