RAFFAELE GIANCARLO

ValWorkBench: An open source Java library for cluster validation, with applications to microarray data analysis.

Autori: R, Giancarlo ; D. Scaturro ; F. Utro
Anno di pubblicazione: 2015
Tipologia: Articolo in rivista (Articolo in rivista)
OA Link: http://hdl.handle.net/10447/137712

Abstract

Background: Cluster analysis is one of the most well known activities in scientific investigation and the object of research in many disciplines, ranging from statistics to computer science. It is central to the life sciences due to the advent of high throughput technologies, e.g., classification of tumors. In particular, in cluster analysis, it is of relevance to assess cluster quality and to predict the number of clusters in a dataset, if any. This latter task is usually performed via internal validation measures. Despite their potentially important role, both the use of classic internal validation measures and the design of new ones, specific for microarray data, do not seem to have great prominence in bioinformatics, where attention is mostly given to clustering algorithms. Therefore, there is a growing need for software tools/libraries, simple and ready to use, dedicated to validation analysis. Unfortunately, the state of the art in terms for such software presents several drawbacks, the main ones being the following: (a) the existing libraries, for several reasons, privilege the “user point of view” resulting in very little access /reusability of the main building blocks of the implemented functions; (b) cluster validation, and therefore also the existing corresponding software, has received very little attention both in terms of algorithmic design and engineering, with the effect that time performance limits the use of validation programs. Results: In order to partially address at least the previous two drawbacks, here we present ValWorkBench. Indeed, it is the first open source library for cluster validation analysis that places the “developer point of view” at a par with the “user point of view”: it has been specifically designed to provide fast algorithms and full usability of all its building blocks, in addition to being readily usable via a GUI. Moreover, it is also of use in the assessment of the quality of new clustering algorithms. Although the entire library has been validated in the context of gene expression data obtained by microarray technologies, it can be used for any kind of multidimensional data (e.g. RNA-seq). Conclusions: We present the very first software development platform for cluster validation analysis that has been specifically designed to provide fast algorithms and full usability of all its building block. In this paper, and related supplementary material, the developer point of view is privileged by giving a detailed description of the library architecture and of each Java class content. For completeness, some relevant domains of application are also mentioned.