Cross Lingual Embeddings for Clinical Text: A Statistical Framework for Validating Real and Synthetic Electronic Health Records
- Autori: Speciale Marco; Albano Alessandro; Sciandra Mariangela; Plaia Antonella
- Anno di pubblicazione: 2025
- Tipologia: Contributo in atti di convegno pubblicato in volume
- OA Link: http://hdl.handle.net/10447/684744
Abstract
The effective integration of real and synthetic clinical data in multiple languages is essential to advance healthcare research. In this study, we propose a statistical framework that leverages cross-lingual embeddings to validate semantic alignment between authentic Italian EHRs and synthetic English clinical notes. Using two state-of-the-art models, E5 and BGE, we encode the texts and employ Fuzzy C-Means clustering along with multidimensional scaling to assess their semantic coherence. Our analysis reveals distinct language-specific patterns alongside robust cross-lingual alignment, highlighting the promise of synthetic data augmentation in mitigating resource scarcity.