Salta al contenuto principale
Passa alla visualizzazione normale.

SIMONA ESTER ROMBO

BioSet2Vec: extraction of k-mer dictionaries from multiple sets of biological sequences via big data technologies

Abstract

Background: In several contexts involving large collections of sets of biological sequences, a relevant problem is that of selecting significant groups of k-mers that characterize one set with regards to the others in the same collection. Results: Here a software framework is proposed implementing a novel methodology for the extraction of k-mer dictionaries, from multiple sets of biological sequences. It has been implemented according to the most recent technologies for Big Data analytics, with the perspective of allowing its usage with a variety of input datasets of any size. In particular, two different packages are provided. The first is BioFt, enabling the extraction of recurrent patterns based on k-mers frequency and the computation of other metrics from information retrieval, here specialized for biological sequences. The second package BioSet2Vec, instead, extends the functionality of BioFt by allowing the creation of dictionaries according to different criteria. Conclusions: The framework has been validated on three different case studies: (1) the characterization of different chromatin states; (2) the study of association between different diseases and related genes; (3) the analysis of genomes of different organisms. All tests performed on the considered datasets have shown the potentialities of the proposed approach.