UniQA: an Italian and English Question-Answering Data Set Based on Educational Documents
- Autori: Irene Siragusa; Roberto Pirrone
- Anno di pubblicazione: 2024
- Tipologia: Contributo in atti di convegno pubblicato in volume
- OA Link: http://hdl.handle.net/10447/678364
Abstract
In this paper we introduce UniQA, a high-quality Question-Answering data set that comprehends more than 1k documents and nearly 14k QA pairs. UniQA has been generated in a semi-automated manner using the data retrieved from the website of the University of Palermo, covering information about the bachelor and master degree courses for the academic year 2024/2025. Data are both in Italian and English, thus making the data set suitable for QA and translation models. To assess the data, we propose a Retrieval Augmented Generation model based on Llama-3.1-instruct. UniQA can be found at https://github.com/CHILab1/UniQA.