As the development of the second generation sequencing technology (NGS), research about the genetic variation can be realized by sequencing about the whole genome of an individual or re-sequencing about the target area. Single nucleotide polymorphism (SNP) is the most common form of genetic variation. SNP detection is to find a new polymorphism site and the known polymorphism alleles on the target area.
There is lots of SNP detection software working on NGS data, among which is the widely used SOAPsnp. It takes into account the quality of sequencing data and errors of alignment and experiment to use a Bayesian model based SNP detection algorithm for calculation of quality score of each base. These quality scores are used as the standard of consensus sequence calling. Combined with the prior probability of dbSNP allele, it gets a low error rate for low-depth sequencing.
GSNP is an implementation of SOAPsnp on GPU. It uses one GPU thread to process one independent site, and optimizes the program in two ways: 1. Use sparse data structure to store the aligned base so as to reduce memory overhead; 2. Develop customized compression algorithms to reduce I/O overhead.
With these optimizations, GSNP can get more than 40X speedup compared with SOAPsnp. Originally, SOAPsnp needs 3 days to process the whole human genome while GSNP only uses 2 hours to complete the same work.
1.2 Executable Program
2.1 Hardware Requirement
A machine equipped with a NVIDIA video card with compute capability 1.3 or above.2.2 Software Requirement
GSNP was developed under 64-bit Linux system. It needs the following software to compile:
gcc 4.1 (download) or above
Boost 1.46 (download) or above
CUDA Developer Drivers, CUDA Toolkit & GPU Computing SDK 3.2 (download) or above
Note: The versions of CUDA Toolkit and GPU Computing SDK must be the same.
1. Extract the source code.
|$ tar -xvf gsnp_v1.0.tar.gz|
2. Go into the extracted directory.
|$ cd gsnp|
3. Add the header file paths of CUDA Toolkit, GPU Computing SDK and Boost to the environment variable CPATH, and the library paths of CUDA Toolkit, GPU Computing SDK and Boost to the environment variable LIBRARY_PATH.
|$ export CPATH=$CUDA_TOOLKIT_INSTALL_PATH/include:$CPATH
$ export CPATH=$GPU_COMPUTING_SDK_INSTALL_PATH/C/common/inc:$CPATH
$ export CPATH=$BOOST_INSTALL_PATH/include:$CPATH
$ export LIBRARY_PATH=$CUDA_TOOLKIT_INSTALL_PATH/lib64:$LIBRARY_PATH
$ export LIBRARY_PATH=$GPU_COMPUTING_SDK_INSTALL_PATH/C/lib:$LIBRARY_PATH
$ export LIBRARY_PATH=$BOOST_INSTALL_PATH/lib:$LIBRARY_PATH
4. Make the executable program from the source code.
$ make all
4. HOW TO USE
4.1 Quick Start
For diploid genome resequencing:
|gsnp -i <Alignment.soap.sort.chrN> -d <chrN.fasta> -o <chrN.consensus> -r 0.00005 -e 0.0001 -t -u -L <Maximum Read Length> -M <chrN.mat>|
For monoploid genome resequencing:
|gsnp -i <Alignment.soap.sort.chrN> -d <chrN.fasta> -o <chrN.consensus> -r 0.0001 -t -u -L <Maximum Read Length> -M <chrN.mat> -m|
4.2 Command Line Options
|-i||<FILE>||Input SORTED SOAP alignment result（"sorted" means alignments of each chromosome are sorted first by chromosome name lexicographically and then by coordinates on each chromosome numerically）|
|-d||<FILE>||Reference DNA sequence in FASTA format|
|-o||<FILE>||Output consensus file|
4.2.2 Optional Parameters:(default in [ ])
|-z||<char>||ASCII character that stands for quality score==0 [@]|
FASTQ files generated by Illumina base-calling pipeline use ‘@’ as 0, but some institutes use ‘!’ as 0.
|-g||<double>||Global error dependency coefficient, 0.0(complete dependent)~1.0(complete independent) [0.9]|
|-p||<double>||PCR error dependency coefficient, 0.0(complete dependent)~1.0(complete independent) [0.5]|
|Sequencing errors are found slightly repeatable (once an error occur, additional errors also tend to occur) due to various reasons. Therefore, observations of sequencing errors are not complete independent.|
The main source of repeatable errors is believed to be PCR amplification in sequencing process. The proper values of the two parameters rely on wetlab process. Nonetheless, the default value generally work at most time.
|-r||<double>||Novel altHOM prior probability [0.0005]|
|-e||<double>||Novel HET prior probability [0.0010]|
|-r&-e are prior probabilities of homozygous SNPs (altHOM) and heterozygous SNPs (HET), which are used in Bayes formula calculation. Note these are prior probabilities of a new (novel) SNP. They are expected to be stringent. For different species, the two values should change if necessary.|
|-t||Set transition/transversion ratio to 2:1 in prior probability|
|-s||<FILE>||Pre-formatted known SNP information.|
The file consist of a lot of lines like this one:
chr1 201979756 1 1 0 0.161 0 0 0.839 rs568The columns from left to right are:
name of chromosome, coordinate on the chromosome, whether the SNP has allele frequency information (1 is true, 0 is false), whether the SNP is validated by experiment (1 is true, 0 is false), whether the SNP is actually an indel (1 is true, 0 is false), frequency of A, frequency of C, frequency of T, frequency of G, SNP id.
For known SNP sites that do not have allele frequency information, the frequency information can be arbitrarily determined as any positive values, which only imply what alleles have already been deposited in the database.
|-2||specify this option will REFINE SNP calling using known SNP information [Off]|
|-a||<double>||Validated HET prior, if no allele frequency known [0.1]|
|-b||<double>||Validated altHOM prior, if no allele frequency known [0.05]|
|-j||<double>||Unvalidated HET prior, if no allele frequency known [0.02]|
|-k||<double>||Unvalidated altHOM rate, if no allele frequency known [0.01]|
|-a,-b,-j and -k are related to using external SNP information to alter prior probabilities for SNP calling. GSNP will try using allele frequency information as prior probability in calling genotypes for each site. If the allele frequency information is absent, it will use the above 4 parameters as prior probability.|
|-u||Enable rank sum test (that check whether the two allele of a possible HET call have same sequencing quality) to give HET further penalty for better accuracy [Off]|
|-m||Enable monoploid calling mode, this will ensure all consensus as HOM and you probably should SPECIFY higher altHOM rate [Off]|
|-M||<FILE>||Output the quality calibration matrix; the matrix can be reused with -I if you rerun the program|
|-I||<FILE>||Input previous quality calibration matrix. It cannot be used simutaneously with -M|
|-L||<short>||Maximum length of read |
Note that: Once length of some reads exceeds the parameter will probably collapse the program.
|-C||Use the compression for the output, otherwise stored as cns text files [Off]|
|-Q||<short>||Maximum FASTQ quality score |
|-h||Display this help|
5. OUTPUT FORMAT
5.1 Text Format
|The result of GSNP has 17 columns:|
|01) Chromosome ID|
02) Coordinate on chromosome, start from 1
03) Reference genotype
04) Consensus genotype
05) Quality score of consensus genotype
06) Best base
07) Average quality score of best base
08) Count of uniquely mapped best base
09) Count of all mapped best base
10) Second best base
11) Average quality score of second best base
12) Count of uniquely mapped second best base
13) Count of all mapped second best base
14) Sequencing depth of the site
15) Rank sum test p_value
16) Average copy number of nearby region
17) Whether the site is a dbSNP
5.2 Compression Format
|GSNP adopts customized GPU-accelrated compression techniques to address the I/O. Such techniques provide a better compression ratio as well as higher compression/decompression speed than general data compression tools, such as gzip.|
|Users have two options to handle such compressed results.|
|1) Compressed result can be extracted chunk by chunk in-memory. /cpp/decompression.cpp demonstrates that how to use the decompression APIs to read compressed data directly.|
|2) The binary file "decompression" can be used to decompress the result. The generated results are stored in the exactly same format as SOAPsnp's text output.|