DNA variant discovery

Siyang Liu

Overview

In this exercise we start from a realigned and recalibrated alignment file of individual NA12878 in BAM format and infer SNPs and INDELs from it.
We have two set of bams including the full Hiseq2000 sequencing alignment data set for NA12878 (~45x, high-coverage.bam) and the downsampled one from the high-coverage.bam (~4.5x, low-coverage.bam). We recall some of the knowledge that we learn from the course and rethink the variant discovery process.

Below is a list of practices we will go through in the exercise.

Observation of the realigned and recalibrated bam
Coverage depth of an alignment file
Calling SNPs and INDELs using GATK HaplotypeCaller for high-coverage and low-coverage data set
Soft filtering (Variant recalibration) of SNPs and Indels

NA12878 and the pre-processing of the data

The sample that we use is called NA12878 with european ancestry. It is probably the sample in the world that has most of the genome sequencing data with cell-line DNA available upon purchase. Those characteristics makes it a popular validation resource for variant calling.

The GenomeInABottle consortium GIAB has sequenced the NA12878 to 45x coverage using Illumina platform and even more in the future. The sequencing data that we are working on are downloaded from NA12878_1.fastq.gz and NA12878_2.fastq.gz.

Considering the limited time and the storage and computation resource that we have for the course and to allow you to have more concentrations on the variant calling procedure, I did the pre-processing step that we learnt about the GATK workflow in advance. Recall that the pre-processing includes removing low-quality reads using SoapNuke, aligning of the NA12878 reads to human genome reference version b37 using bwa mem, sorting the alignment using samtools sort, removing the potential PCR duplication using samtools rmdup, realigning around the potential INDEL region and recalibrating the base quality score using GATK.

Finally, I downsampled the bam files from 45x to 4.5x coverage and restrict the exercise on variant calling on chromosome 22. The total size of the analysis ready full bam (NA12878_realign_BQSR4.1p.chr22.bam) and the downsampled bam (NA12878_realign_BQSR4.0.1p.chr22.bam) are 7.0G and 708M respectively.

A few notes:
In general, the analysis of the full dataset like this scale in a typical server with 32G memory and 12 cores requires at least 3 days but it can be accelearted using different strategy utlizing a MapReduce cluster achitecture or GPU/FPGA hardwares.

We are using BAM format instead of the CRAM format this time since you are required to view the alignment using IGV which so far haven’t been upgraded to accept CRAM. However, note that CRAM will probably be replacing BAM in the near future and therefore, it is good to get familiar with CRAM by yourself. It’s simple by including the reference for all the samtools or GATK command.

Preparation

Create a working directory in your home directory and link the files needed for the exercise in the working directory.

For utilization of GATK, the first thing is to get to know about the bundle files which includes all the file required by GATK.

Java:

1	$ source /share/home/teacher1/software/java/8/load.sh

The two NA12878 BAMs:

1
2

/share/home/teacher1/20161128_exercise/Alignment/NA12878/NA12878_realign_BQSR4.1p.chr22.bam (high coverage bam)
/share/home/teacher1/20161128_exercise/Alignment/NA12878/NA12878_realign_BQSR4.0.1p.chr22.bam (downsampled by proportion 0.1)

The BAI files are the index file for bam to enable the fast query. We need them for samtools and gatk utilities.

The gatk engine:
/share/home/teacher1/software/GenomeAnalysisTK.jar

The gatk bundle:
/share/home/teacher1/database/gatk_bundle/2.8/b37

Some useful scripts:
/share/home/teacher1/bin/

The precomputed vcf called from NA12878_realign_BQSR4.1p.chr22.bam:
/share/home/teacher1/20161128_exercise/VariantCall/NA12878_realign_BQSR4.1p.chr22.raw.vcf

1 2	$ mkdir variant_calling $ cd variant_calling

Some prior knowledge

If you find the exercise difficult, it can be due to you lack some of the prior knowledge or skills for it. Don’t hestiate to ask teachers or classmates for help.

Here is a list of information that you can google find the right material and learn by yourself.

linux basic command like mkdir, cd, ln, rsync, nohup &
bash shell script
perl and R script

View the analysis ready bam

In the course, we have studied the INDEL realignment and base quality recalibration (BQSR) processes in the variant calling framework.

Let’s look into the difference between the original and the realigned and recalibrated bams. This requires a technique to view the SAM/BAM/CRAM alignment file using samtools view. You can quickly extract all the reads overlapping a the position by using the following command

1	$ samtools view $bam chr:startposition-endposition

NB: For cram file, you should add –T reference_file.

Questions

What are the difference for read “ERR194147.706568380” in position 22:16050152-16050152 between the original bam and the realigned and reclibrated bam?
Tips:
The read in the original bam is pasted below:

1
2
3

$ samtools view NA12878_rmdup.bam 22:16050152-16050152|grep ERR194147.706568380
ERR194147.706568380 163 22  16050152    40  101M    =   16050372    321 TATTGATCTTTTGTGACATGCACGTGGGTTCCCAGTAGCAAGAAACTAAAGGGTCGCAGGCCGGTTTCTGCTAATTTCTTTAATTCCAAGACAGTCTCAAA   CCCFFFFFHHHHHJJJJJJJJJJJHJJJGHIJJJIHHJJJJJJJJJIGIIFHIEGIHIJHHHFDCDDDDDDDDDDEEDEDDDCDEDDEDDDDDDDDDDDDC   NM:i:0  MD:Z:101    AS:i:101    XS:i:101    XA:Z:14,-19792749,101M,0;   RG:Z:HiseqEAAAGAAA-98
samtools view NA12878_realign_BQSR4.down.bam 22:16050152-16050152

You can use samtools view command to check the alignment record of this read in the realigned and recalibrated bam.

1	$ samtools view NA12878_realign_BQSR4.1p.chr22.bam 22:16050152-16050152\|grep ERR194147.706568380

The base quality column changes; There is a BD and BI tag added to the read representing the probability of deletion and insertion errors.

samtools view NA12878_rmdup.bam 22:16050152-16050152|grep ERR194147.706568380
ERR194147.706568380 163 22  16050152    40  101M    =   16050372    321 TATTGATCTTTTGTGACATGCACGTGGGTTCCCAGTAGCAAGAAACTAAAGGGTCGCAGGCCGGTTTCTGCTAATTTCTTTAATTCCAAGACAGTCTCAAA   CCCFFFFFHHHHHJJJJJJJJJJJHJJJGHIJJJIHHJJJJJJJJJIGIIFHIEGIHIJHHHFDCDDDDDDDDDDEEDEDDDCDEDDEDDDDDDDDDDDDC   NM:i:0  MD:Z:101    AS:i:101    XS:i:101    XA:Z:14,-19792749,101M,0;   RG:Z:HiseqEAAAGAAA-98
>samtools view NA12878_realign_BQSR4.down.bam 22:16050152-16050152|grep ERR194147.706568380
ERR194147.706568380 163 22  16050152    40  101M    =   16050372    321 TATTGATCTTTTGTGACATGCACGTGGGTTCCCAGTAGCAAGAAACTAAAGGGTCGCAGGCCGGTTTCTGCTAATTTCTTTAATTCCAAGACAGTCTCAAA   @?<@ABBAB@@?A@??>?????>:?@?>@?@???A?@A@@?A@???AA@ABAA?A;AAAAAA<AAAABCCBDCBCCCCDBBCCDCDEECEDCEECDDA??A   XA:Z:14,-19792749,101M,0;   MC:Z:101M   BD:Z:JJIKPQPOMLEEMJIMMIMMNNIKLIMJMKKLJMONMNONKKJJBJLLJBKMILLKLMNMMMJJLJCJJMMNMKKKDKKKDKKKKKLNLMLNKQPPNKMKC  MD:Z:101    RG:Z:HiseqEAAAGAAA-98   BI:Z:MMMMRRQPONGGPMLPOLNNOPKNNKMKPLLKKNPNOPQPLNMLFNOOLFNOKPPKNPPOPOLMQMGMNQPRPMNMGMNMHPNONNMQOQPROTRURMOMG  NM:i:0  MQ:i:51 AS:i:101    XS:i:101

what does the second column in the bam file mean?

Tip: you can check the meaning via here

If you are not in a hurry, ask yourself or discuss with the surroundings what the rest of the columns and some of the tags mean? (Cigar field)

You can find most of the samtools specification information via this link: SAMv1

The coverage depth information from the bam

Sequencing coverage and depth are important parameters deciding on the quality of the variants for your sample and the DNA variant analysis strategy. Whenever we are exposed to a SAM/BAM/CRAM alignment file, we should ask ourselves what the coverage depth distribution is.

This statistics can be estimated from the SAM/BAM/CRAM alignment files using the samtools depth coupled with some scripts.

Let’s generate two plots respectively for the high coverage and downsampled bam using the following commands (you can also use the bash shell to submit the jobs to the backend using nohup sh $shell &)

1 2	#for high coverage bam $ nohup time samtools depth -a NA12878_realign_BQSR4.1p.chr22.bam >NA12878_realign_BQSR4.1p.chr22.depth && echo samtools depth done &

Then you can use the following script in the /share/home/teacher1/bin directory to generate the following figures.

1
2
3

$ nohup perl coverage.depth.pl NA12878_realign_BQSR4.1p.chr22.depth 16410021 > NA12878_realign_BQSR4.1p.chr22.depth.dist & #16410021 is the gap size of chr22.
$ Rscript samtools_coveragedepth_stat.r NA12878_realign_BQSR4.1p.chr22.depth.dist NA12878_realign_BQSR4.1p.chr22.depth.dist.pdf

Questions

1. What is the approximate average coverage depth for the two data set respectively?

Around 45x and around 4.5x as indicated in the following figure

Figure1. coverage depth and accumulative coverage depth for high coverage BAM

Figure2. coverage depth and accumulative coverage depth for downsampled BAM

2. How much is the genome (excluding N) covered by more than 1 read and 4 read.

99% by the high coverage bam and 90% by the downsampled bam

3. Did you notice the parameters -q and –Q in the samtools depth and what do they mean?
By applying –q and –Q, we can get the effective coverage and depth for variant calling. By effective, we mean only the high quality bases in a read with high mapping quality that will be taken into account for variant discovery. By default, the base quality threshold is 10 in the GATK haplotypecaller.

Calling SNP and INDELs using the GATK haplotypecaller

There are several state-of-the-art variant detectors which calls SNPs and small INDELs from next generation sequencing data. You can find most of them via omictools.
HaplotypeCaller is one of the most widely sophiscated approach developed by the GATK team in Broad institute where we have gone through the algorithms in the course.

Now, let’s give it a try by applying the HaplotypeCaller to call variants from our bams and learn about a few metrics in the vcf file.

The haplotypecaller for the full high coverage bam NA12878_realign_BQSR4.1p.chr22.bam takes around 92 minutes and therefore, I precomputed this for you.

It’s your turn to carry out the haplotypecaller analysis of the low-coverage bam – it will take around 15 minutes in the training server. During this process, you can have a look at the vcf NA12878_realign_BQSR4.1p.chr22.raw.vcf (using less -S) for the high coverage data and answer the first two questions.

NB: I don’t use absolute path before the files used in the following command. You can either ln -s the necessary files to your current directory or you place the absolute path in

ref=human_g1k_v37.fasta
downbam=NA12878_realign_BQSR4.0.1p.bam
dbsnp=dbsnp_138.b37.vcf
outdir=\`pwd\`
out=NA12878_realign_BQSR4.0.1p.chr22.raw.vcf
shname=NA12878_realign_BQSR4.0.1p.chr22.raw.hc.sh
echo \"time java -Xmx15g -jar GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-R $ref \
-I $downbam \
-D $dbsnp \
-L 22 \
-o $outdir/$out \
-stand_call_conf 30 \
--stand_emit_conf 10\
 -A QualByDepth -A RMSMappingQuality -A MappingQualityRankSumTest -A ReadPosRankSumTest -A FisherStrand -A StrandOddsRatio -A Coverage\" > $shname
nohup sh $shname &

Question:

1. What do the parameters mean in the above command line?

The “–D” (an abbreviation for the “–dbsnp”) parameter annotates the variants with known variants in the human dbSNP database.
The “–L” parameter restricts our analysis in chr22.
The “-stand_call_conf” which is minimum Phred quality value of SNP to pass filtering, and “-stand_emit_conf” which is the minimum Phred quality value of the SNP to be reported at all.
We have annotated several metrics which is useful for the filtration purpose using “-A”
The full list of the annotation metrics can be found in the “Annotation Modules” Categories in the GATK Documentation. For example the FisherStrand annotation detects whether the alternative allele and the reference allele over/under-present in specific strand (strand bias).
Remember that we can have HaplotypeCaller to output all sites in gVCF format if we add the command: -emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000. It is the best practice when we are working on multiple samples when we can perform the genotype calling for single individual and then carry out a joint calling using the individual gvcf.

2. Can you distinguish the structure of a vcf file?

You should be able to tell the two main sections including the header section denoted by “#” as the first character and actual variant call section.
The header gives you information about all the abbreviations in the file and is quite handy.
In the variant call part, you should be able to tell what each of the columns as follows:
Chromosome, Position, dbSNP id, Reference base, Alternative base, Phred quality of call, Filter status, Lots of info, Format of the next field and then genotype information for our sample NA12878.

3. When you finished the your haplotypecaller analsyis, think about what the difference are between the low-coverage variant call set that you have computed and the high-coverage variant call set that I precomputed?

Think about

How many SNP and INDEL calls did we get respectively?
How many passed the quality score (>30 Qual)?

4. How many known SNPs and INDELs did we get and how many of them are novel?

1
2
3

$ less NA12878_realign_BQSR4.0.1p.chr22.raw.vcf|perl -e 'while(<>){chomp;if($_=~/^#/){next;} @tmp=split;if($tmp[2] eq "."){$hash{'novel'}++;}else{$hash{'known'}++;}}foreach $k(keys %hash){print "$k\t$hash{$k}\n";}'
known   37201
novel   1133

Variant Quality Score Recalibration / Soft filtering of SNP

Now we come to the most exciting part where we will apply Gaussian mixture models on a few of the annotation and distinguish the false positive calls from the real ones. This process is called VQSR (variant quality score recalibration) and is one of the novel methods introduced by GATK in the field different from hard filtering that we normally adopted before 2008.

The recalibration requires training sites and true sites. From the commands below, we can see that the hapmap will be used as the true sites with the highest Phred-scaled prior while the Omni and the 1000 genome high confident SNP as the training sites with mediate prior. Finally ,the dbSNP will not be included in the training but used for visualization purpose to evaluation of how many variants will be known or novel in the tranche plot.

Since the model relies on features on known variants and we know that features for SNP and INDEL are different, it is easy to understand why we will run the “VariantRecalibrator” for SNP and INDEL, respectively.

First, you run the VariantRecalibrator Engine with a SNP mode for the downsampled vcf using the commands below. Simultaneously or afterwards, you run the VariantRecalibration for the high coverage chr22 alignment bam from NA12878 by replacing the –input arguments in the following command line. You should be able to go through it in around 1 min. When you finish, have a look at the output file and answer the following questions.

rawvcf=NA12878_realign_BQSR4.0.1p.chr22.raw.vcf
hapmap=hapmap_3.3.b37.vcf
omni=1000G_omni2.5.b37.vcf
highconfi=1000G_phase1.snps.high_confidence.b37.vcf
dbsnp=dbsnp_138.b37.vcf
outdir=\`pwd\`
outname=\`basename $rawvcf|sed -e 's/\.raw\.vcf//'\`
tag=$outname.snprecal
echo "time java -Xmx15g -jar $gatk -T VariantRecalibrator\
 -R $ref\
 -L 22\
 -input $rawvcf\
 -resource:hapmap,known=false,training=true,truth=true,prior=15.0 $hapmap\
 -resource:omini,known=false,training=true,truth=true,prior=12.0 $omni\
 -resource:1000G,known=false,training=true,truth=false,prior=10.0 $highconfi \
 -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 $dbsnp\
 -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP\
 --maxGaussians 4\ #Since we have only variants on chr22, we don’t need the default 8 clusters. 
 -mode SNP\
 -recalFile $outdir/$outname.snps.recal\
 -tranchesFile $outdir/$outname.snps.tranches\
 -rscriptFile $outdir/$outname.snps.plots.R\
 && echo \"VariantRecalibrator done\" " >$tag.sh
 nohup sh $tag.sh &

Questions

1. Did you see some errors emitting in the recalibration process? Why is it?

The reason for the errors is because no variants with LOD scores less than -5.0 were found.
This occur when the raw variants proposed to the VariantRecalibration engine is too few to generate a good training data set. It’s important to know that while the previous analysis like indel realignment, base quality score recalibration, haplotypecaller can be processed by chromosomes or by specific regions, it’s recommended to apply the VariantRecalibrator on whole genome scale variants so as to have enough data for the training model.

Hint: If you feel interested, you can lower down the -stand_call_conf and -stand_emit_conf in the haplotypecaller process so that you can have more variants and some of those variants will display LOD score less than <=-0.5 or you can lower down the LOD score. This doesn’t matter too much and the point is that you understand enough data is required for the recalibration model. Otherwise, we won’t have correct result.

2. less –S the *tranche file (or open the tranche plot by rsync the file to your local disk), what does tranche mean? Does the result make sense to you?

The truth number defines the trances, so that 90 means that we accept all variants until we reach 90% of our truth set, 99 means that we accept all variants until we have reached 99% of our truth set and so on. Blue are predicted to be true SNPs and orange to be false SNPs. We use this plot to decide where to put the threshold for filtering, eg. how many more false SNPs are you likely to get by increasing from 99.0 to 99.9.

That the Ti/Tv ratio for known variants is greater than the expected range (2.0 – 2.1) suggests excessive conservertivity of the calls while the lower Ti/Tv ratio for the novel variants may suggest false positives.

This result may be limited because we are not using the best practice to train the model with all the variants within the genome.

3. What do you think you can do provided such a result?

We can do the recalibration again for the whole genome-scale variants.
Which I have done but the result is still not correct. This may suggests some problems with the pipelines which is difficult to find out.

The NA12878_realign_BQSR4.1p.chr22.snps.plots.R in the output direction is supposed to be generate the 2 dimensional projection of the Guassian mixture distribution over a pair of the annotation metrics used in the recalibration process. However, it contains some bug and you cannot just run it by Rscript NA12878_realign_BQSR4.1p.chr22.snps.plots.R. If you want to see it you can download it from here.

Apply recalibration to SNPs

Lets apply the recalibration to our SNP calls. Lets try with a threshold of 99.9

ref=human_g1k_v37.fasta
rawvcf=NA12878_realign_BQSR4.0.1p.chr22.raw.vcf
outdir=`pwd`
outname=`basename $rawvcf|sed -e 's/\.raw\.vcf//'`
tag=$outname.AR
echo "
time java -Xmx15g -jar $gatk -T ApplyRecalibration\
 -R $ref \
 -input $rawvcf\
 --ts_filter_level 99.9\
 -tranchesFile $outdir/$outname.snps.tranches\
 -recalFile $outdir/$outname.snps.recal\
 -mode SNP\
 -o $outdir/$outname.snprecal.vcf \
 && echo \"ApplyRecalibration done\"
" >$tag.sh
nohup sh $tag.sh &

Open the recalibrated file using less –S and check the Filter column, you should now see that all SNPs that were classified at 99.9 or better are all termed PASS and the rest are termed LowQual, TruthSensitivityTranche99.90to100.00. We can now get all the trusted SNPs by taking the ones with PASS.

Questions

1. How many known and novel SNPs do we have that PASS the filtering?

However keep in mind that even at the 99.0% sensitivity tranche we expect quite a number of false positive calls (the orange and orange scribbled bars in the plot). Lets try to divide the calls into known and novel (based on dbSNP) and plot the quality distributions using R.

2. Is there any difference for the quality score between known and novel SNPs?

Soft-filtering of INDEL

The recalibration of the INDEL set is similar to SNP and the difference relies on what training set that you will use. Carry out the following commands and answer the questions below. Finally you use the threshold from the recalibration process for filtration of the INDEL - you decide by yourself!

snprecalvcf=$outdir/$outname.snprecal.vcf
indel=Mills_and_1000G_gold_standard.indels.b37.vcf
outdir=`pwd`
outname=`basename $snprecalvcf|sed -e 's/\.vcf//'`
tag=$outname
echo \"time java -Xmx15g -jar $gatk -T VariantRecalibrator\
 -R $ref \
 -input $snprecalvcf \
 -resource:mills,known=true,training=true,truth=true,prior=12.0 $indel\
 -an QD -an DP -an FS -an SOR -an ReadPosRankSum -an MQRankSum\
 -mode INDEL\
 -recalFile $outdir/$outname \
 -tranchesFile $outdir/$outname \
 -rscriptFile $outdir/$outname.indel.plots.R\
 && echo \"VariantRecalibrator of INDEL done \">$tag.sh

Questions

1. Do you notice that the problem for the indel soft-filtering approach by GATK?

Training and validation set are both Mills_and_1000G_gold_standard.indels.b37.vcf which is problmatic.

Overview

NA12878 and the pre-processing of the data

Preparation

Some prior knowledge

View the analysis ready bam

Questions

The coverage depth information from the bam

Questions

1. What is the approximate average coverage depth for the two data set respectively?

2. How much is the genome (excluding N) covered by more than 1 read and 4 read.

3. Did you notice the parameters -q and –Q in the samtools depth and what do they mean?

Calling SNP and INDELs using the GATK haplotypecaller

Question:

1. What do the parameters mean in the above command line?

2. Can you distinguish the structure of a vcf file?

3. When you finished the your haplotypecaller analsyis, think about what the difference are between the low-coverage variant call set that you have computed and the high-coverage variant call set that I precomputed?

4. How many known SNPs and INDELs did we get and how many of them are novel?

Variant Quality Score Recalibration / Soft filtering of SNP

Questions

1. Did you see some errors emitting in the recalibration process? Why is it?

2. less –S the *tranche file (or open the tranche plot by rsync the file to your local disk), what does tranche mean? Does the result make sense to you?

3. What do you think you can do provided such a result?

Apply recalibration to SNPs

Questions

1. How many known and novel SNPs do we have that PASS the filtering?

2. Is there any difference for the quality score between known and novel SNPs?

Soft-filtering of INDEL

Questions

1. Do you notice that the problem for the indel soft-filtering approach by GATK?

2. What threashold will you use for the INDEL recalibration?