Robust and Efficient Text-based Speech Editing in Noisy Contexts using Rectified Flow

0. Contents

Abstract
Baseline Models
Noisy Speech
Clean Speech
Ablation Study of Reflow-TSE

1. Abstract

Although text-based speech editing (TSE) has made remarkable progress in clean speech, much of the speech is recorded in various noisy environments, and how to effectively edit these noisy speech remains a challenge. Background noise degrades the quality of generated speech, and the edited speech fails to maintain noise context consistency. We propose Reflow-TSE, a robust and efficient TSE model for noise-consistent speech editing. For noise robustness, 1) we design a noise condition module to extract frame-level background noise sequences from noisy speech, and 2) we introduce an enhanced context conditioned prediction module that predicts masked noise sequences along with conventional duration and pitch, using these conditions to guide generation. For boosting efficiency, 3) we introduce the rectified flow model, leveraging speech context and predicted conditions to achieve high-quality editing with limited sampling steps. Experimental results show that with just 2 steps sampling, Reflow-TSE achieves context-consistent noisy speech editing, a capability absent in other models. Additionally, for clean speech, Reflow-TSE also matches or surpasses baseline models.

Fig1: The architecture of Reflow-TSE.

2. Baseline Models

(1)Ground Truth: The waveform reconstructed by the HiFiGAN^[1] vocoder from ground truth Mel-spectrograms. [code]
(2)A³T^[2]: A³T adopts an alignment-aware acoustic-text pretraining method, reconstructing high-quality spectrograms from partially masked inputs. [code]
(3)FluentSpeech^[3]: FluentSpeech introduces diffusion models into a context-aware mask prediction network, iteratively refining the edited Mel-spectrogram based on contextual features. [code]
(4)FluentEditor^[4]: FluentEditor further improves editing fluency by adding acoustic and prosodic consistency training to FluentSpeech. [code]
(5)FluentSpeech-N: FluentSpeech with added noise condition module and masked noise predictor, but without the rectified flow decoder.
(6)Mix Method: This approach first generates clean speech using Reflow-TSE, then resamples the noise waveform extracted by the noise extractor to match the speech length, and finally combines the clean speech with the noise.

3. Noisy Speech

Demos of insertion and replacement operations on noisy speech using Reflow-TSE and other baseline models.

Insertion Operation

Ground Truth	Reflow-TSE 2-steps	Reflow-TSE 8-steps	A³T	FluentSpeech	FluentEditor	FluentSpeech-N	Mix Method
Original text: It is very unfair, and something should be done. Edited text: It is very unfair for ours, and something should be done.

Original text: It is believed the size of the settlement has not been finalised. Edited text: It is believed the size of the financial settlement has not been finalised.
Ground Truth	Reflow-TSE 2-steps	Reflow-TSE 8-steps	A³T	FluentSpeech	FluentEditor	FluentSpeech-N	Mix Method

Original text: In fact, they have the opposite effect. Edited text: In fact, they often have the opposite effect.
Ground Truth	Reflow-TSE 2-steps	Reflow-TSE 8-steps	A³T	FluentSpeech	FluentEditor	FluentSpeech-N	Mix Method

Replacement Operation

Ground Truth	Reflow-TSE 2-steps	Reflow-TSE 8-steps	A³T	FluentSpeech	FluentEditor	FluentSpeech-N	Mix Method
Original text: The Chancellor will deal from a position of strength. Edited text: The premier will deal from a position of strength.

Original text: The rainbow is a division of white light into many beautiful colors. Edited text: The rainbow is a division of white light into many different colors.
Ground Truth	Reflow-TSE 2-steps	Reflow-TSE 8-steps	A³T	FluentSpeech	FluentEditor	FluentSpeech-N	Mix Method

Original text: It is still too early for any likely contenders to have emerged. Edited text: It is still too early for any likely competitors to have emerged.
Ground Truth	Reflow-TSE 2-steps	Reflow-TSE 8-steps	A³T	FluentSpeech	FluentEditor	FluentSpeech-N	Mix Method

4. Clean Speech

Demos of insertion and replacement operations on clean speech using Reflow-TSE and other baseline models.

Insertion Operation

Ground Truth	Reflow-TSE 2-steps	Reflow-TSE 8-steps	A³T	FluentSpeech	FluentEditor	FluentSpeech-N
Original text: That proved enough to settle the home side down. Edited text: That proved barely enough to settle the home side down.

Original text: You can cope with it. Edited text: You can not cope with it.
Ground Truth	Reflow-TSE 2-steps	Reflow-TSE 8-steps	A³T	FluentSpeech	FluentEditor	FluentSpeech-N

Original text: The Norsemen considered the rainbow as a bridge over which the gods passed from earth to their home in the sky. Edited text: The Norsemen considered the beautiful and colorful rainbow as a bridge over which the gods passed from earth to their home in the sky.
Ground Truth	Reflow-TSE 2-steps	Reflow-TSE 8-steps	A³T	FluentSpeech	FluentEditor	FluentSpeech-N

Replacement Operation

Ground Truth	Reflow-TSE 2-steps	Reflow-TSE 8-steps	A³T	FluentSpeech	FluentEditor	FluentSpeech-N
Original text: It is believed the size of the settlement has not been finalised. Edited text: It is believed the size of the transaction has not been finalised.

Original text: It does affect the staff and the prisoners. Edited text: It does affect the jailer and the prisoners.
Ground Truth	Reflow-TSE 2-steps	Reflow-TSE 8-steps	A³T	FluentSpeech	FluentEditor	FluentSpeech-N

Original text: But this relative success has not been easy. Edited text: But this limited victory has not been easy.
Ground Truth	Reflow-TSE 2-steps	Reflow-TSE 8-steps	A³T	FluentSpeech	FluentEditor	FluentSpeech-N

5. Ablation Study of Reflow-TSE

Two sets of ablation experiments were designed: the complete Reflow-TSE was compared with models that removed noise modeling and models that removed flow rectification, respectively. The bold content represents the reconstructed part of the model.

Without Flow Rectification

Reflow-TSE	w/o Flow Rectification
These take the shape of a long round arch, with its path high above, and its two ends apparently beyond the horizon.

That proved enough to settle the home side down.
Reflow-TSE	w/o Flow Rectification

The talks were announced yesterday by Brian Wilson, the foreign office minister.
Reflow-TSE	w/o Flow Rectification

That kind of growth is the important thing.
Reflow-TSE	w/o Flow Rectification

Without Noise Modeling

Reflow-TSE	w/o Noise Modeling
Everything is now in place for Manchester.

And they haven't taken the foot off the gas.
Reflow-TSE	w/o Noise Modeling

They do not work for Glasgow City Council.
Reflow-TSE	w/o Noise Modeling

A tribunal would then consider the seriousness of the incident.
Reflow-TSE	w/o Noise Modeling

[1] J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” in Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
[2] H. Bai, R. Zheng, J. Chen, M. Ma, X. Li, and L. Huang, “A3T: Alignment-aware acoustic and text pretraining for speech synthesis and editing,” in International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, ser. Proceedings of Machine Learning Research, vol. 162. PMLR, 2022, pp. 1399–1411.
[3] Z. Jiang, Q. Yang, J. Zuo, Z. Ye, R. Huang, Y. Ren, and Z. Zhao, “Fluentspeech: Stutter-oriented automatic speech editing with context-aware diffusion models,” in Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023. Association for Computational Linguistics, 2023, pp. 11 655–11 671.
[4] R. Liu, J. Xi, Z. Jiang, and H. Li, “Fluenteditor: Text-based speech editing by considering acoustic and prosody consistency,” in Interspeech 2024, 2024, pp. 3435–3439.