SABATO MARCO SINISCALCHI

Using Cross-Attention for Conversational ASR over the Telephone

Autori: Dymbe, S.; Siniscalchi, S.M.; Svendsen, T.; Salvi, G.
Anno di pubblicazione: 2026
Tipologia: Contributo in atti di convegno pubblicato in volume
OA Link: http://hdl.handle.net/10447/689285

Abstract

We present a neural architecture for speech recognition over the telephone. In telephone conversations, the speakers are already separated into separate channels. Although this is mostly an advantage, the separation also removes important contextual information from the other speaker. Earlier approaches have been proposed to address this problem, but 1) they do not precisely model the temporal relationship of the two channels, or 2) the model only has access to context in the form of text. We propose a Transformer model that uses cross-attention between the two channels of a telephone conversation and uses positional encodings that provide the model with the accurate temporal relationship of the two channels. Our empirical results on the Fisher, CallHome and Switchboard datasets show that our model outperforms the HuBERT baseline by a significant margin. We also provide an analysis of the cross-attention maps that shed some light on the internal workings of the model.