Self-Supervised Fine-Tuning of Automatic Speech Recognition Systems against Signal Processing Attacks
In: ACM Asia Conference on Computer and Communications Security (2024)
Type: Conference
Links:
Abstract:
Automatic Speech Recognition (ASR) systems take audio signals as inputs and output the corresponding text transcriptions. The text is then used to execute commands and perform searches in several application domains, including security-critical applications such as smartphone assistants, smart home assistants, and self-driving car assistants. Signal processing attacks are one of the most recent types of attacks designed to fool ASR models. Signal processing attacks exploit the feature extraction stage of the ASR pipeline and add perturbations to the audio. These attacks are capable of generating wrong transcriptions of the audio signals even though the attacked audio sounds similar to the original audio. Existing defences for adversarial attacks are neural networks that act as a filter to remove attacks from audio waveforms. The heuristic-based training objective function used in training these filter networks has a negative impact on the performance. Also, there is a disconnect between the training objective function and the application objective function. We address these problems and propose a novel self-supervised fine-tuning algorithm to make existing ASR models robust to adversarial attacks. We do extensive experimentation on our method against signal processing attacks across four different scenarios, and in three out of four scenarios, our method exhibits the best results.
Related Projects: