Pubblicazione | SALVATORE MASTRANGELO | Università degli Studi di Palermo

A pipeline for variants discovery using next-generation DNA sequencing data

Authors: Tolone, M; Sardina, MT; Di Gerlando, R; Mastrangelo, S; Sutera, AM; Portolano, B.
Publication year: 2017
Type: eedings
OA Link: http://hdl.handle.net/10447/241352

Abstract

Recent advances in next generation sequencing (NGS) technology provide a cost-effective approach to large-scale resequencing of livestock samples in order to study several biological phenomena. NGS produces millions of short DNA sequences that require an unbiased way to make possible comprehensive searches for variation to identify putative causative mutations for economically important traits. The aim of this work was to present a bioinformatics pipeline analysis for variants discovery in ovine genome. A total of 30 individuals belonging to Valle del Belice dairy ewes was used for whole genome sequencing of pooled libraries prepared using Illumina Nextera Kit. Paired-end sequencing was carried out in an 8-lanes flow-cell of the Illumina HiScanSQ platform yielding a total of 1,159,664,912, 101 bp length reads. The left and right raw reads were separated into two files, and converted to the fastq format using CASAVA 1.8. The whole procedure was split in different workflows, in order to give more flexibility to end-users. One workflow is aimed to verify the quality of the raw sequencing reads using FastQC and FASTX-Toolkit, in order to keep bases with Phred quality Score greater than 20 and to trim the reads with poor quality. Another step aligns the reads to the Ovis aries 3.1 reference genome using BWAmem with standard parameters. The resulting SAM file was converted in BAM file using the SAMtools software, then unmapped and duplicate reads were removed using the CleanSam and MarkDuplicate commands of the Picard software. Therefore, to get more accurate base qualities, Genetic Analysis Tool Kit (GATK) was used to locally realign reads such that the number of mismatching bases due to indels is minimized across all the reads (IndelRealigner) and to detect systematic errors in base quality scores (BaseRecalibrator). In the last workflow SNPs and indelsare identified using mpileup command of SAMtools software. The resulting BCF file is passed to “bcftools view” tool to be filtered and converted into VCF format. Finally, for variants annotation the SNPSift software was used. A total of 6,357,170 variations, of which 5,265,739 SNPs and 1,091,431 indels, were discovered. About 77% of the SNPs were present in the Ovis aries dbSNP v147 while the remaining were novel SNPs. The discovered SNPs must be validated and then could be used to several applications as phylogenic analysis, genome-wide association studies or genomic selection.