Pubblicazione | MARCO LA CASCIA | Università degli Studi di Palermo

A Unified Attention-Based Model for Segmenting Compound Words in Sanskrit

Authors: Ali, I.; Presti, L.L.; Spano', I.; Cascia, M.L.
Publication year: 2025
Type: Contributo in atti di convegno pubblicato in volume
OA Link: http://hdl.handle.net/10447/692830

Abstract

This paper proposes a novel deep learning-based multi-task approach designed to handle Sanskrit compound word splitting, a problem that poses significant challenges due to the complicated morphology of the language. The model is trained to jointly solve two closely related tasks: character-level segmentation location prediction and segmented word generation using a sequence-to-sequence model. The proposed architecture consists of two branches and relies on a shared character-level embedding layer combined with bidirectional LSTMs to effectively capture character-level contextual variations. Furthermore, each branch leverages the Multi-Head Attention mechanism, and experiments show that this mechanism improves character-level segmentation location prediction and guides the decoder when generating segmented words. Differently than state-of-the-art approaches, adopting cascaded models for solving independently the two tasks, in the proposed unified model, the decoder is initialized with a composite state that combines the encoder context with the location features computed by the segmentation branch, allowing the two tasks to support each other and leading to improvements in the overall model accuracy. Experiments on a challenging Sanskrit compound dataset show that the proposed model outperforms traditional methods, achieving a location prediction accuracy of 93.51% and a splitting accuracy of 87.92%, respectively. The experiments further show that joint modeling of the two tasks not only advances the state-of-the-art performance in splitting Sanskrit compounds, but also offers a richer understanding of the splitting process, paving the way for strengthening morphological analysis in natural language processing and digital humanities for the in-depth study of complex religious texts.