Total: 1
Training a linear transformation between speech encoders and LLMs enable LLMs to transcribe speech. SLAM-ASR is one such recently proposed architecture. This paper examines its adaptability across three domains of varying difficulty: read speech (Librispeech, easiest), meeting speech (AMI, medium) and post-stroke aphasic speech (AphasiaBank, most difficult) for both word- and phoneme-level transcription. After studying cross-domain adaptability, our work explores the use of transfer learning to seed model fine-tuning for the target domain from a source domain. Results show that transferring from an easier to a harder domain offers little benefit, while the reverse seems to improve model robustness in the easier target domain. Our work also looks at the impact of a phoneme encoder at the input and multiple single-task instruction fine tuning on phoneme and word transcription tasks. This work advances the adaptation of LLM-based ASR for atypical speech transcription.