Abstract

Voice conversion (VC) enables natural speech synthesis with minimal data; however, it poses security risks, e.g., identity theft and privacy breaches. To address this, we propose Mimic Blocker, an active defense mechanism that prevents VC models from extracting speaker characteristics while preserving audio quality. Our method employs adversarial training, an audio quality preservation strategy, and an attack strategy. It relies on only publicly available pretrained feature extractors, which ensures model-agnostic protection. Furthermore, it enables self-supervised learning using only the original speaker's speech. Experimental results demonstrate that our method achieves robust defense performance in both white-box and black-box scenarios. Notably, the proposed approach maintains audio quality by generating noise imperceptible to human listeners, thereby enabling protection while retaining natural voice characteristics in practical applications.

Model Architecture

Diagram of Mimic Blocker model architecture showing training and testing stages with computational pathways

Training and testing stage of Mimic Blocker. The red arrows represent the computational pathways used to calculate the loss functions.

Audio Demos

For adversarial attacks to be successful, the noise-injected style audio (ADVERSARIAL WAVEFORM) should be imperceptible to human listeners while the converted output (ADVERSARIAL OUTPUT) should maximally deviate from the original speaker's vocal characteristics.

White Box

Style Waveform Original Output Adversarial Waveform Adversarial Output
Sample 1
Sample 1
Sample 1
Sample 1
Sample 2
Sample 2
Sample 2
Sample 2
Sample 3
Sample 3
Sample 3
Sample 3
Sample 4
Sample 4
Sample 4
Sample 4

Black Box

Style Waveform Original Output Adversarial Waveform Adversarial Output
Sample 1
Sample 1
Sample 1
Sample 1
Sample 2
Sample 2
Sample 2
Sample 2
Sample 3
Sample 3
Sample 3
Sample 3
Sample 4
Sample 4
Sample 4
Sample 4