See this
overview.
Missing variants due to a candidate not being generated:
There are multiple reasons that DeepVariant may not call a variant. It is
important to first determine whether a candidate variant was proposed by
DeepVariant. A potential variant requires at least 2 reads to support a variant
and a minimum fraction of reads supporting the variant (0.12 for SNPs and PacBio
Indels, 0.06 for Illumina Indels). All sites that have been generated as
candidates are written in the VCF file, so if you do not see a row in the VCF
file for the variant in question, it means that a candidate was not made.
However, within these sites certain possible alleles may have been pruned in
reporting. To see all alleles, you may add: --debug_output_all_candidates=ALT
in the postprocess_variants step.
To increase the sensitivity of DeepVariant to these sites, you may add the
following parameters, here shown with their defaults:
--make_examples_extra_args="vsc_min_count_snps=2,vsc_min_fraction_snps=0.12,vsc_min_count_indels=2,vsc_min_fraction_indels=0.06"
It is sometimes also the case that realignment of the reads within DeepVariant
changes or reduces the evidence supporting the variant. To check for this, try
using the --norealign_reads
flag to turn off realignment temporarily. Note
that we don't recommend turning off the realigner for Illumina data in general
cases because the realigner improves accuracy overall.
There is also the option to output the realigned reads, e.g. to inspect the new
alignments in IGV. See the "What is the realigner and how does it work?" section
for instructions.
Missing variants where a candidate is generated:
If a candidate is made, but is called as reference (either 0/0 or ./.) it means
that the neural network processed the genomic region, but based on all of its
learned experience from training data, it decided the highest probability for
the position was as non-variant. Some of the reasons that DeepVariant may
suspect a false positive are: strand-bias in reads, low mapping quality in
reads, low base quality in reads, and overall low coverage.
In addition, there is another pattern that causes DeepVariant to suspect variant
positions which can initially seem counterintuitive to human observers. This
occurs when a dense set of variants appears on one haplotype while the other
haplotype is fully reference, and humans often perceive this as missing a
clearly heterozygous position. DeepVariant seems to have learned that this
signature often indicates a region which is a segmental duplication, copy number
variant, or structural variant where multiple copies of similar genomic regions
are mapping to the same reference location. In this case, it may be worthwhile
to inspect the region to see if it has elevated coverage, and whether you can
identify more than 2 haplotypes present by overlapping the reads. If you can, it
suggests that the region may have a copy number variation. Some analysis of this
was presented at AGBT as a poster
“Uncaptured segmental duplication creates artifacts in workflows using GRCh37”.
This pattern of undercalling positions at high variant density may affect
variant-dense non-human species (those with a variant density of >1 in 40
positions). For an analysis of this, please see our blog
“Improved non-human variant calling using species-specific DeepVariant models”.
If these reasons seem applicable, there could be some other reason DeepVariant
determined the position is not variant. You can catalog the variant position and
its support. The way to improve variant calling for these positions is to train
new models, but be aware that training is already a balance between reducing
false negatives and positives, and it may not be possible to call variants like
the one you are seeing without increasing overall false positives by a greater
amount.
See this
blog post.
If the actual depth in a particular region is greater than the pileup image
height, DeepVariant randomly downsamples reads until the image has been filled
up. For the default DeepVariant models (height 100), an image can accommodate at
most 95 reads in a given region (5 rows are reserved for the reference
sequence).
You may be able to successfully run our pretrained models with a different
pileup image height (via --pileup_image_height
in make_examples.py
),
depending on the new height, but we generally do not recommend using different
image heights at training and inference time. If you wish to use a different
pileup image height, we recommend retraining a new model with images of that
height.
If you are working with extremely high coverage sequencing data for applications
such as somatic sequencing, we recommend using a somatic caller instead of
DeepVariant, which is a germline caller.
We do not recommend using DeepVariant for somatic calling. We do have a
prototype implementation for somatic calling, which can take a tumor and normal
BAM and call subclonal variants. However, we don't yet have enough confidence in
the available truth sets, and that they come from a diverse enough sampling of
cancers with mutational profiles, for us to be certain in releasing something of
high quality. We're watching developments in the area of these truth sets and
hope to be able to further develop the somatic caller in the future.
DeepVariant has previously been applied to plant species. In the case of rice,
there was good evidence of high accuracy. You can see
some results in this blog post.
However, these rice genomes were diploid and with a similar variant density of
humans.
DeepVariant is currently written to be a diploid variant caller. So if the plant
species you are working with is polyploid, it is not yet clear how DeepVariant
will perform. That is because even with re-training, DeepVariant can only
produce variant calls that are homozygous alternate, heterozygous, or homozygous
reference, which don't have much meaning in a tetraploid genome, for example.
See this
blog post.
In general, we recommend running DeepVariant using Docker for the simplest
setup. If you are building from source because you want to experiment with
changes to the codebase, we still recommend Docker. You can clone the
DeepVariant repo, modify the source code, and build a Docker image with your
changes using the provided Dockerfile.
This often happens because the way Docker works, input and output directories
have to be mounted and then files are referred to by their mounted location,
which can be confusing. To check that files are visible inside the Docker
container, you can ls
inside the container. For example, using the setup shown
in the README and looking inside the /input
volume:
BIN_VERSION="1.6.1"
docker run \
-v "YOUR_INPUT_DIR":"/input" \
-v "YOUR_OUTPUT_DIR:/output" \
google/deepvariant:"${BIN_VERSION}" \
ls /input
Mounting directories with Docker can be confusing. One trick to make this
simpler is to set both sides as your $HOME
, so the paths are the same inside
and outside the Docker container.
echo $HOME # see what your home directory is first.
ls $HOME
BIN_VERSION="1.6.1"
sudo docker run \
-v "${HOME}":"${HOME}" \
google/deepvariant:"${BIN_VERSION}" \
ls $HOME
Since the DeepVariant v0.9 release, we recommend
"Best practices for multi-sample variant calling with DeepVariant".
For specifically calling on duos or trios, we introduced
DeepTrio
in v1.1.
We have been observing the following message while running on GPU since we moved
platform from slim to keras:
2023-10-20 22:21:03.818638: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1278] could not retrieve CUDA device count: CUDA_ERROR_NOT_INITIALIZED: initialization error
We
have tested and confirmed that this does not affect GPU usage or inference. So
you can continue running DeepVariant without being worried about this message.
16GB. In our test, we observe the model occupying 16GB GPU memory.
No. We have moved from Slim to Keras. All models before 1.6.0
were trained in
Slim platform. So they are not compatible with 1.6.0
anymore.
No. Although possible, we have not implemented the multi-GPU capability in GPU
inference yet.
No. TensorFlow's Estimator API does provide support for running training on
multiple GPUs through the use of a DistributionStrategy. However,
DistributionStrategy cannot be used with exponential moving average (EMA), which
is present in the DeepVariant codebase.
From the
DeepVariant 2018 manuscript:
Mapped reads are preprocessed using an error-tolerant, local
De-Bruijn-graph-based read assembly procedure that realigns them according to
their most likely derived haplotype. Candidate windows across the genome are
selected for reassembly by looking for any evidence of possible genetic
variation, such as mismatching or soft clipped bases. The selection criteria
for a candidate window are very permissive so that true variation is unlikely
to be missed. All candidate windows across the genome are considered
independently. De Bruijn graphs are constructed using multiple fixed k-mer
sizes (from 20 to 75, inclusive, with increments of 5) out of the reference
genome bases for the candidate window, as well as all overlapping reads. Edges
are given a weight determined by how many times they are observed in the
reads. We trim any edges with weight less than three, except that edges found
in the reference are never trimmed. Candidate haplotypes are generated by
traversing the assembly graphs and the top two most likely haplotypes are
selected that best explain the read evidence. The likelihood function used to
score haplotypes is a traditional pair HMM with fixed parameters that do not
depend on base quality scores. This likelihood function assumes that each read
is independent. Finally, each read is then realigned to its most likely
haplotype. This procedure updates both the position and the CIGAR string for
each read.
Local realignment is not performed for long reads (PacBio, and other similar
technologies). The realigner step can optionally be switched off using
--norealign_reads
.
There is also the option to output the realigned reads, e.g. to inspect the new
alignments in IGV. This can be done by passing the following parameters:
--make_examples_extra_args="emit_realigned_reads=true,realigner_diagnostics=/output/realigned_reads"
Note that this is meant for debugging and produces a bam file for every
candidate variant, which can result in millions of tiny bam files, so when using
this, narrow down the DeepVariant run using --regions
to just the variants you
want to inspect more closely.
AD
and DP
values calculated?In order to efficiently perform variant calling, DeepVariant partitions the
genome into chunks (set by --partition_size
), and will read in a max number of
reads into each partition (set by --max_reads_per_partition
). By default,
--partition_size
is set to 1000 and --max_reads_per_partition
is set to
1500. The AD
and DP
values are based on the read depths constrained by
--max_reads_per_partition
.
For example, if you have a depth of 2000x at a given site, DeepVariant will
subsample 1500 reads, and DP
or AD
will be capped at 1500. If you want to
calculate the true AD
and DP
values at high-depth regions, you can set
--max_reads_per_partition=0
to calculate AD
and DP
using all reads. In
practice, capping reads per partition reduces runtimes with little/no impact on
accuracy.
This is a known issue that we don't currently address. Please see:
https://github.com/google/deepvariant/issues/505 for more context.
Please see the answers provided by Paul Grosu in
this issue thread. We thank
Paul for providing a detailed description and reasoning.
TMPDIR
If you have issues with TMPDIR
when running with Singularity, try adding this
to your command:
export TMPDIR="$PWD/tmp_dir"
See https://github.com/google/deepvariant/issues/524#issuecomment-1067597987.
/mnt/
User reported that sometimes their setup uses /mnt/
, which exists in our
Docker image, and it has caused an issue in Singularity.
You can use -B
in Singularity to avoid this issue. See:
https://github.com/google/deepvariant/issues/530#issuecomment-1076923302 for
more details.