DNA variant discovery without solution

Siyang Liu

Overview

In this exercise we start from a realigned and recalibrated alignment file of individual NA12878 in BAM format and infer SNPs and INDELs from it.
We have two set of bams including the full Hiseq2000 sequencing alignment data set for NA12878 (~45x, high-coverage.bam) and the downsampled one from the high-coverage.bam (~4.5x, low-coverage.bam). We recall some of the knowledge that we learn from the course and rethink the variant discovery process.

Below is a list of practices we will go through in the exercise.

Observation of the realigned and recalibrated bam
Coverage depth of an alignment file
Calling SNPs and INDELs using GATK HaplotypeCaller for high-coverage and low-coverage data set
Soft filtering (Variant recalibration) of SNPs and Indels

NA12878 and the pre-processing of the data

The sample that we use is called NA12878 with european ancestry. It is probably the sample in the world that has most of the genome sequencing data with cell-line DNA available upon purchase. Those characteristics makes it a popular validation resource for variant calling.

The GenomeInABottle consortium GIAB has sequenced the NA12878 to 45x coverage using Illumina platform and even more in the future. The sequencing data that we are working on are downloaded from NA12878_1.fastq.gz and NA12878_2.fastq.gz.

Considering the limited time and the storage and computation resource that we have for the course and to allow you to have more concentrations on the variant calling procedure, I did the pre-processing step that we learnt about the GATK workflow in advance. Recall that the pre-processing includes removing low-quality reads using SoapNuke, aligning of the NA12878 reads to human genome reference version b37 using bwa mem, sorting the alignment using samtools sort, removing the potential PCR duplication using samtools rmdup, realigning around the potential INDEL region and recalibrating the base quality score using GATK.

Finally, I downsampled the bam files from 45x to 4.5x coverage and restrict the exercise on variant calling on chromosome 22. The total size of the analysis ready full bam (NA12878_realign_BQSR4.1p.chr22.bam) and the downsampled bam (NA12878_realign_BQSR4.0.1p.chr22.bam) are 7.0G and 708M respectively.

A few notes:
In general, the analysis of the full dataset like this scale in a typical server with 32G memory and 12 cores requires at least 3 days but it can be accelearted using different strategy utlizing a MapReduce cluster achitecture or GPU/FPGA hardwares.

We are using BAM format instead of the CRAM format this time since you are required to view the alignment using IGV which so far haven’t been upgraded to accept CRAM. However, note that CRAM will probably be replacing BAM in the near future and therefore, it is good to get familiar with CRAM by yourself. It’s simple by including the reference for all the samtools or GATK command.

Preparation

Create a working directory in your home directory and link the files needed for the exercise in the working directory.

For utilization of GATK, the first thing is to get to know about the bundle files which includes all the file required by GATK.

Java:

1	$ source /share/home/teacher1/software/java/8/load.sh

The two NA12878 BAMs:

1
2

/share/home/teacher1/20161128_exercise/Alignment/NA12878/NA12878_realign_BQSR4.1p.chr22.bam (high coverage bam)
/share/home/teacher1/20161128_exercise/Alignment/NA12878/NA12878_realign_BQSR4.0.1p.chr22.bam (downsampled by proportion 0.1)

The BAI files are the index file for bam to enable the fast query. We need them for samtools and gatk utilities.

The gatk engine:
/share/home/teacher1/software/GenomeAnalysisTK.jar

The gatk bundle:
/share/home/teacher1/database/gatk_bundle/2.8/b37

Some useful scripts:
/share/home/teacher1/bin/

The precomputed vcf called from NA12878_realign_BQSR4.1p.chr22.bam:
/share/home/teacher1/20161128_exercise/VariantCall/NA12878_realign_BQSR4.1p.chr22.raw.vcf

1 2	$ mkdir variant_calling $ cd variant_calling

Some prior knowledge

If you find the exercise difficult, it can be due to you lack some of the prior knowledge or skills for it. Don’t hestiate to ask teachers or classmates for help.

Here is a list of information that you can google find the right material and learn by yourself.

linux basic command like mkdir, cd, ln, rsync, nohup &
bash shell script
perl and R script

View the analysis ready bam

In the course, we have studied the INDEL realignment and base quality recalibration (BQSR) processes in the variant calling framework.

Let’s look into the difference between the original and the realigned and recalibrated bams. This requires a technique to view the SAM/BAM/CRAM alignment file using samtools view. You can quickly extract all the reads overlapping a the position by using the following command

1	$ samtools view $bam chr:startposition-endposition

NB: For cram file, you should add –T reference_file.

Questions

What are the difference for read “ERR194147.706568380” in position 22:16050152-16050152 between the original bam and the realigned and reclibrated bam?
what does the second column in the bam file mean?

Tip: you can check the meaning via here
If you are not in a hurry, ask yourself or discuss with the surroundings what the rest of the columns and some of the tags mean? (Cigar field)

You can find most of the samtools specification information via this link: SAMv1

The coverage depth information from the bam

Sequencing coverage and depth are important parameters deciding on the quality of the variants for your sample and the DNA variant analysis strategy. Whenever we are exposed to a SAM/BAM/CRAM alignment file, we should ask ourselves what the coverage depth distribution is.

This statistics can be estimated from the SAM/BAM/CRAM alignment files using the samtools depth coupled with some scripts.

Let’s generate two plots respectively for the high coverage and downsampled bam using the following commands (you can also use the bash shell to submit the jobs to the backend using nohup sh $shell &)

1 2	#for high coverage bam $ nohup time samtools depth -a NA12878_realign_BQSR4.1p.chr22.bam >NA12878_realign_BQSR4.1p.chr22.depth && echo samtools depth done &

Then you can use the following script in the /share/home/teacher1/bin directory to generate the following figures.

1
2
3

$ nohup perl coverage.depth.pl NA12878_realign_BQSR4.1p.chr22.depth 16410021 > NA12878_realign_BQSR4.1p.chr22.depth.dist & #16410021 is the gap size of chr22.
$ Rscript samtools_coveragedepth_stat.r NA12878_realign_BQSR4.1p.chr22.depth.dist NA12878_realign_BQSR4.1p.chr22.depth.dist.pdf

Questions

1. What is the approximate average coverage depth for the two data set respectively?

2. How much is the genome (excluding N) covered by more than 1 read and 4 read.

3. Did you notice the parameters -q and –Q in the samtools depth and what do they mean?

Calling SNP and INDELs using the GATK haplotypecaller

There are several state-of-the-art variant detectors which calls SNPs and small INDELs from next generation sequencing data. You can find most of them via omictools.
HaplotypeCaller is one of the most widely sophiscated approach developed by the GATK team in Broad institute where we have gone through the algorithms in the course.

Now, let’s give it a try by applying the HaplotypeCaller to call variants from our bams and learn about a few metrics in the vcf file.

The haplotypecaller for the full high coverage bam NA12878_realign_BQSR4.1p.chr22.bam takes around 92 minutes and therefore, I precomputed this for you.

It’s your turn to carry out the haplotypecaller analysis of the low-coverage bam – it will take around 15 minutes in the training server. During this process, you can have a look at the vcf NA12878_realign_BQSR4.1p.chr22.raw.vcf (using less -S) for the high coverage data and answer the first two questions.

NB: I don’t use absolute path before the files used in the following command. You can either ln -s the necessary files to your current directory or you place the absolute path in

ref=human_g1k_v37.fasta
downbam=NA12878_realign_BQSR4.0.1p.bam
dbsnp=dbsnp_138.b37.vcf
outdir=\`pwd\`
out=NA12878_realign_BQSR4.0.1p.chr22.raw.vcf
shname=NA12878_realign_BQSR4.0.1p.chr22.raw.hc.sh
echo \"time java -Xmx15g -jar GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-R $ref \
-I $downbam \
-D $dbsnp \
-L 22 \
-o $outdir/$out \
-stand_call_conf 30 \
--stand_emit_conf 10\
 -A QualByDepth -A RMSMappingQuality -A MappingQualityRankSumTest -A ReadPosRankSumTest -A FisherStrand -A StrandOddsRatio -A Coverage\" > $shname
nohup sh $shname &

Question:

1. What do the parameters mean in the above command line?

2. Can you distinguish the structure of a vcf file?

3. When you finished the your haplotypecaller analsyis, think about what the difference are between the low-coverage variant call set that you have computed and the high-coverage variant call set that I precomputed?

Think about

How many SNP and INDEL calls did we get respectively?
How many passed the quality score (>30 Qual)?

4. How many known SNPs and INDELs did we get and how many of them are novel?

Variant Quality Score Recalibration / Soft filtering of SNP

Now we come to the most exciting part where we will apply Gaussian mixture models on a few of the annotation and distinguish the false positive calls from the real ones. This process is called VQSR (variant quality score recalibration) and is one of the novel methods introduced by GATK in the field different from hard filtering that we normally adopted before 2008.

The recalibration requires training sites and true sites. From the commands below, we can see that the hapmap will be used as the true sites with the highest Phred-scaled prior while the Omni and the 1000 genome high confident SNP as the training sites with mediate prior. Finally ,the dbSNP will not be included in the training but used for visualization purpose to evaluation of how many variants will be known or novel in the tranche plot.

Since the model relies on features on known variants and we know that features for SNP and INDEL are different, it is easy to understand why we will run the “VariantRecalibrator” for SNP and INDEL, respectively.

First, you run the VariantRecalibrator Engine with a SNP mode for the downsampled vcf using the commands below. Simultaneously or afterwards, you run the VariantRecalibration for the high coverage chr22 alignment bam from NA12878 by replacing the –input arguments in the following command line. You should be able to go through it in around 1 min. When you finish, have a look at the output file and answer the following questions.

rawvcf=NA12878_realign_BQSR4.0.1p.chr22.raw.vcf
hapmap=hapmap_3.3.b37.vcf
omni=1000G_omni2.5.b37.vcf
highconfi=1000G_phase1.snps.high_confidence.b37.vcf
dbsnp=dbsnp_138.b37.vcf
outdir=\`pwd\`
outname=\`basename $rawvcf|sed -e 's/\.raw\.vcf//'\`
tag=$outname.snprecal
echo "time java -Xmx15g -jar $gatk -T VariantRecalibrator\
 -R $ref\
 -L 22\
 -input $rawvcf\
 -resource:hapmap,known=false,training=true,truth=true,prior=15.0 $hapmap\
 -resource:omini,known=false,training=true,truth=true,prior=12.0 $omni\
 -resource:1000G,known=false,training=true,truth=false,prior=10.0 $highconfi \
 -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 $dbsnp\
 -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP\
 --maxGaussians 4\ #Since we have only variants on chr22, we don’t need the default 8 clusters. 
 -mode SNP\
 -recalFile $outdir/$outname.snps.recal\
 -tranchesFile $outdir/$outname.snps.tranches\
 -rscriptFile $outdir/$outname.snps.plots.R\
 && echo \"VariantRecalibrator done\" " >$tag.sh
 nohup sh $tag.sh &

Questions

1. Did you see some errors emitting in the recalibration process? Why is it?

2. less –S the *tranche file (or open the tranche plot by rsync the file to your local disk), what does tranche mean? Does the result make sense to you?

3. What do you think you can do provided such a result?

Apply recalibration to SNPs

Lets apply the recalibration to our SNP calls. Lets try with a threshold of 99.9

ref=human_g1k_v37.fasta
rawvcf=NA12878_realign_BQSR4.0.1p.chr22.raw.vcf
outdir=`pwd`
outname=`basename $rawvcf|sed -e 's/\.raw\.vcf//'`
tag=$outname.AR
echo "
time java -Xmx15g -jar $gatk -T ApplyRecalibration\
 -R $ref \
 -input $rawvcf\
 --ts_filter_level 99.9\
 -tranchesFile $outdir/$outname.snps.tranches\
 -recalFile $outdir/$outname.snps.recal\
 -mode SNP\
 -o $outdir/$outname.snprecal.vcf \
 && echo \"ApplyRecalibration done\"
" >$tag.sh
nohup sh $tag.sh &

Open the recalibrated file using less –S and check the Filter column, you should now see that all SNPs that were classified at 99.9 or better are all termed PASS and the rest are termed LowQual, TruthSensitivityTranche99.90to100.00. We can now get all the trusted SNPs by taking the ones with PASS.

Questions

1. How many known and novel SNPs do we have that PASS the filtering?

2. Is there any difference for the quality score between known and novel SNPs?

Soft-filtering of INDEL

The recalibration of the INDEL set is similar to SNP and the difference relies on what training set that you will use. Carry out the following commands and answer the questions below. Finally you use the threshold from the recalibration process for filtration of the INDEL - you decide by yourself!

snprecalvcf=$outdir/$outname.snprecal.vcf
indel=Mills_and_1000G_gold_standard.indels.b37.vcf
outdir=`pwd`
outname=`basename $snprecalvcf|sed -e 's/\.vcf//'`
tag=$outname
echo \"time java -Xmx15g -jar $gatk -T VariantRecalibrator\
 -R $ref \
 -input $snprecalvcf \
 -resource:mills,known=true,training=true,truth=true,prior=12.0 $indel\
 -an QD -an DP -an FS -an SOR -an ReadPosRankSum -an MQRankSum\
 -mode INDEL\
 -recalFile $outdir/$outname \
 -tranchesFile $outdir/$outname \
 -rscriptFile $outdir/$outname.indel.plots.R\
 && echo \"VariantRecalibrator of INDEL done \">$tag.sh

DNA variant discovery without solution

Overview

NA12878 and the pre-processing of the data

Preparation

Some prior knowledge

View the analysis ready bam

Questions

The coverage depth information from the bam

Questions

1. What is the approximate average coverage depth for the two data set respectively?

2. How much is the genome (excluding N) covered by more than 1 read and 4 read.

3. Did you notice the parameters -q and –Q in the samtools depth and what do they mean?

Calling SNP and INDELs using the GATK haplotypecaller

Question:

1. What do the parameters mean in the above command line?

2. Can you distinguish the structure of a vcf file?

3. When you finished the your haplotypecaller analsyis, think about what the difference are between the low-coverage variant call set that you have computed and the high-coverage variant call set that I precomputed?

4. How many known SNPs and INDELs did we get and how many of them are novel?

Variant Quality Score Recalibration / Soft filtering of SNP

Questions

1. Did you see some errors emitting in the recalibration process? Why is it?

2. less –S the *tranche file (or open the tranche plot by rsync the file to your local disk), what does tranche mean? Does the result make sense to you?

3. What do you think you can do provided such a result?

Apply recalibration to SNPs

Questions

1. How many known and novel SNPs do we have that PASS the filtering?

2. Is there any difference for the quality score between known and novel SNPs?

Soft-filtering of INDEL

Questions

1. Do you notice that the problem for the indel soft-filtering approach by GATK?

2. What threashold will you use for the INDEL recalibration?