0. Contents


1. Abstract

Although text-based speech editing (TSE) has made remarkable progress in clean speech, much of the speech is recorded in various noisy environments, and how to effectively edit these noisy speech remains a challenge. Background noise degrades the quality of generated speech, and the edited speech fails to maintain noise context consistency. We propose Reflow-TSE, a robust and efficient TSE model for noise-consistent speech editing. For noise robustness, 1) we design a noise condition module to extract frame-level background noise sequences from noisy speech, and 2) we introduce an enhanced context conditioned prediction module that predicts masked noise sequences along with conventional duration and pitch, using these conditions to guide generation. For boosting efficiency, 3) we introduce the rectified flow model, leveraging speech context and predicted conditions to achieve high-quality editing with limited sampling steps. Experimental results show that with just 2 steps sampling, Reflow-TSE achieves context-consistent noisy speech editing, a capability absent in other models. Additionally, for clean speech, Reflow-TSE also matches or surpasses baseline models.

Fig1: The architecture of Reflow-TSE.

2. Baseline Models

(1)Ground Truth: The waveform reconstructed by the HiFiGAN[1] vocoder from ground truth Mel-spectrograms. [code]
(2)A3T[2]: A3T adopts an alignment-aware acoustic-text pretraining method, reconstructing high-quality spectrograms from partially masked inputs. [code]
(3)FluentSpeech[3]: FluentSpeech introduces diffusion models into a context-aware mask prediction network, iteratively refining the edited Mel-spectrogram based on contextual features. [code]
(4)FluentEditor[4]: FluentEditor further improves editing fluency by adding acoustic and prosodic consistency training to FluentSpeech. [code]
(5)FluentSpeech-N: FluentSpeech with added noise condition module and masked noise predictor, but without the rectified flow decoder.
(6)Mix Method: This approach first generates clean speech using Reflow-TSE, then resamples the noise waveform extracted by the noise extractor to match the speech length, and finally combines the clean speech with the noise.

3. Noisy Speech

Demos of insertion and replacement operations on noisy speech using Reflow-TSE and other baseline models.

Insertion Operation
Original text: It is very unfair, and something should be done.
Edited text: It is very unfair for ours, and something should be done.
Ground Truth Reflow-TSE 2-steps Reflow-TSE 8-steps A3T FluentSpeech FluentEditor FluentSpeech-N Mix Method
Original text: It is believed the size of the settlement has not been finalised.
Edited text: It is believed the size of the financial settlement has not been finalised.
Ground Truth Reflow-TSE 2-steps Reflow-TSE 8-steps A3T FluentSpeech FluentEditor FluentSpeech-N Mix Method
Original text: In fact, they have the opposite effect.
Edited text: In fact, they often have the opposite effect.
Ground Truth Reflow-TSE 2-steps Reflow-TSE 8-steps A3T FluentSpeech FluentEditor FluentSpeech-N Mix Method

Replacement Operation
Original text: The Chancellor will deal from a position of strength.
Edited text: The premier will deal from a position of strength.
Ground Truth Reflow-TSE 2-steps Reflow-TSE 8-steps A3T FluentSpeech FluentEditor FluentSpeech-N Mix Method
Original text: The rainbow is a division of white light into many beautiful colors.
Edited text: The rainbow is a division of white light into many different colors.
Ground Truth Reflow-TSE 2-steps Reflow-TSE 8-steps A3T FluentSpeech FluentEditor FluentSpeech-N Mix Method
Original text: It is still too early for any likely contenders to have emerged.
Edited text: It is still too early for any likely competitors to have emerged.
Ground Truth Reflow-TSE 2-steps Reflow-TSE 8-steps A3T FluentSpeech FluentEditor FluentSpeech-N Mix Method

4. Clean Speech

Demos of insertion and replacement operations on clean speech using Reflow-TSE and other baseline models.

Insertion Operation
Original text: That proved enough to settle the home side down.
Edited text: That proved barely enough to settle the home side down.
Ground Truth Reflow-TSE 2-steps Reflow-TSE 8-steps A3T FluentSpeech FluentEditor FluentSpeech-N
Original text: You can cope with it.
Edited text: You can not cope with it.
Ground Truth Reflow-TSE 2-steps Reflow-TSE 8-steps A3T FluentSpeech FluentEditor FluentSpeech-N
Original text: The Norsemen considered the rainbow as a bridge over which the gods passed from earth to their home in the sky.
Edited text: The Norsemen considered the beautiful and colorful rainbow as a bridge over which the gods passed from earth to their home in the sky.
Ground Truth Reflow-TSE 2-steps Reflow-TSE 8-steps A3T FluentSpeech FluentEditor FluentSpeech-N

Replacement Operation
Original text: It is believed the size of the settlement has not been finalised.
Edited text: It is believed the size of the transaction has not been finalised.
Ground Truth Reflow-TSE 2-steps Reflow-TSE 8-steps A3T FluentSpeech FluentEditor FluentSpeech-N
Original text: It does affect the staff and the prisoners.
Edited text: It does affect the jailer and the prisoners.
Ground Truth Reflow-TSE 2-steps Reflow-TSE 8-steps A3T FluentSpeech FluentEditor FluentSpeech-N
Original text: But this relative success has not been easy.
Edited text: But this limited victory has not been easy.
Ground Truth Reflow-TSE 2-steps Reflow-TSE 8-steps A3T FluentSpeech FluentEditor FluentSpeech-N

5. Ablation Study of Reflow-TSE

Two sets of ablation experiments were designed: the complete Reflow-TSE was compared with models that removed noise modeling and models that removed flow rectification, respectively. The bold content represents the reconstructed part of the model.

Without Flow Rectification
These take the shape of a long round arch, with its path high above, and its two ends apparently beyond the horizon.
Reflow-TSE w/o Flow Rectification
That proved enough to settle the home side down.
Reflow-TSE w/o Flow Rectification
The talks were announced yesterday by Brian Wilson, the foreign office minister.
Reflow-TSE w/o Flow Rectification
That kind of growth is the important thing.
Reflow-TSE w/o Flow Rectification

Without Noise Modeling
Everything is now in place for Manchester.
Reflow-TSE w/o Noise Modeling
And they haven't taken the foot off the gas.
Reflow-TSE w/o Noise Modeling
They do not work for Glasgow City Council.
Reflow-TSE w/o Noise Modeling
A tribunal would then consider the seriousness of the incident.
Reflow-TSE w/o Noise Modeling


[1] J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” in Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
[2] H. Bai, R. Zheng, J. Chen, M. Ma, X. Li, and L. Huang, “A3T: Alignment-aware acoustic and text pretraining for speech synthesis and editing,” in International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, ser. Proceedings of Machine Learning Research, vol. 162. PMLR, 2022, pp. 1399–1411.
[3] Z. Jiang, Q. Yang, J. Zuo, Z. Ye, R. Huang, Y. Ren, and Z. Zhao, “Fluentspeech: Stutter-oriented automatic speech editing with context-aware diffusion models,” in Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023. Association for Computational Linguistics, 2023, pp. 11 655–11 671.
[4] R. Liu, J. Xi, Z. Jiang, and H. Li, “Fluenteditor: Text-based speech editing by considering acoustic and prosody consistency,” in Interspeech 2024, 2024, pp. 3435–3439.