Salta al contenuto principale
Passa alla visualizzazione normale.

RAFFAELE GIANCARLO

Distributed compressive genomics: Fundamental pattern matching primitives via spark

  • Autori: Rocco, L.D.; Ferraro Petrillo, U.; Giancarlo, R.; Cattaneo, G.
  • Anno di pubblicazione: 2026
  • Tipologia: Articolo in rivista
  • OA Link: http://hdl.handle.net/10447/692252

Abstract

Compressive genomics leverages compressed data representations to enhance the efficiency of bioinformatics tasks like sequence comparison and search. Surprisingly, the fundamental operation of pattern matching on large DNA sequence collections remains unexplored in the realm of genomic analysis. However, distributed systems like Spark offer the scalability necessary to process increasingly large genomic datasets efficiently. We present the first Spark-based implementation of the FM-Index and Compressed Boyer-Moore (CBM) algorithms, evaluating their performance and providing insights into their advantages for large-scale bioinformatics applications. A comprehensive experimental study demonstrates clear performance gains over uncompressed approaches. Furthermore, we introduce SparkGeco, a distributed compressive genomics software library designed to simplify the integration of FM-Index and CBM algorithms into DNA sequence analysis pipelines within Apache Spark, thus supporting the development of efficient and scalable genomic analysis workflows. This work provides a concrete step towards high-performance, data-centric eScience solutions in computational biology.