RAFFAELE GIANCARLO

Textual data compression in computational biology: Algorithmic techniques.

Autori: Giancarlo, R; Scaturro, D; Utro, F
Anno di pubblicazione: 2012
Tipologia: Articolo in rivista (Articolo in rivista)
Parole Chiave: Data Compression Theory and Practice, Alignment-free sequence comparison, Entropy, Huffman coding, Hidden Markov Models, Kolmogorov complexity, Lempel–Ziv compressors, Minimum Description Length principle, Pattern discovery in bioinformatics, Reverse engineering of biological networks, Sequence alignment
OA Link: http://hdl.handle.net/10447/69844

Abstract

In a recent review [R. Giancarlo, D. Scaturro, F. Utro, Textual data compression in computational biology: a synopsis, Bioinformatics 25 (2009) 1575–1586] the first systematic organization and presentation of the impact of textual data compression for the analysis of biological data has been given. Its main focus was on a systematic presentation of the key areas of bioinformatics and computational biology where compression has been used together with a technical presentation of how well-known notions from information theory have been adapted to successfully work on biological data. Rather surprisingly, the use of data compression is pervasive in computational biology. Starting from that one, the focus of this companion review is on the computational methods involved in the use of data compression in computational biology. Indeed, although one would expect ad hoc adaptation of compression techniques to work on biological data, unifying and homogeneous algorithmic approaches are emerging. Moreover, given that experiments based on parallel sequencing are the future for biological research, data compression techniques are among a handful of candidates that seem able, successfully, to deal with the deluge of sequence data they produce; although, until now, only in terms of storage and indexing, with the analysis still being a challenge. Therefore, the two reviews, complementing each other, are perceived to be a useful starting point for computer scientists to get acquainted with many of the computational challenges coming from computational biology in which core ideas of the information sciences are already having a substantial impact.