Diff of /README.md [000000] .. [5a4941]

Switch to unified view

a b/README.md
1
<img src="docs/images/dv_logo.png" width=50% height=50%>
2
3
[![release](https://img.shields.io/badge/release-v1.8-green?logo=github)](https://github.com/google/deepvariant/releases)
4
[![announcements](https://img.shields.io/badge/announcements-blue)](https://groups.google.com/d/forum/deepvariant-announcements)
5
[![blog](https://img.shields.io/badge/blog-orange)](https://goo.gl/deepvariant)
6
7
DeepVariant is a deep learning-based variant caller that takes aligned reads (in
8
BAM or CRAM format), produces pileup image tensors from them, classifies each
9
tensor using a convolutional neural network, and finally reports the results in
10
a standard VCF or gVCF file.
11
12
DeepVariant supports germline variant-calling in diploid organisms.
13
14
**DeepVariant case-studies for germline variant calling:**
15
16
*   NGS (Illumina or Element) data for either a
17
    [whole genome](docs/deepvariant-case-study.md) or
18
    [whole exome](docs/deepvariant-exome-case-study.md).
19
*   PacBio HiFi data
20
    [PacBio case study](docs/deepvariant-pacbio-model-case-study.md).
21
*   Oxford Nanopore R10.4.1
22
    [Simplex case study](docs/deepvariant-ont-r104-simplex-case-study.md),
23
    [Duplex case study](docs/deepvariant-ont-r104-duplex-case-study.md).
24
*   Complete Genomics
25
    [T7 case study](docs/deepvariant-complete-t7-case-study.md);
26
    [G400 case study](docs/deepvariant-complete-g400-case-study.md).
27
*   Pangenome-mapping-based case-study:
28
    [vg case study](docs/deepvariant-vg-case-study.md).
29
*   RNA data for
30
    [PacBio Iso-Seq/MAS-Seq case study](docs/deepvariant-masseq-case-study.md)
31
    and [Illumina RNA-seq Case Study](docs/deepvariant-rnaseq-case-study.md).
32
*   Hybrid PacBio HiFi + Illumina WGS, see the
33
    [hybrid case study](docs/deepvariant-hybrid-case-study.md).
34
35
**Pangenome-aware DeepVariant case-studies:**
36
37
*   Pangenome-aware DeepVariant WGS (Illumina or Element):
38
    [Mapped with BWA](docs/pangenome-aware-wgs-bwa-case-study.md),
39
    [Mapped with VG](docs/pangenome-aware-wgs-vg-case-study.md).
40
*   Pangenome-aware DeepVariant WES (Illumina or Element):
41
    [Mapped with BWA](docs/pangenome-aware-wes-bwa-case-study.md).
42
43
We have also adapted DeepVariant for somatic calling. See the
44
[DeepSomatic](https://github.com/google/deepsomatic) repo for details.
45
46
Please also note:
47
48
*   DeepVariant currently supports variant calling on organisms where the
49
    ploidy/copy-number is two. This is because the genotypes supported are
50
    hom-alt, het, and hom-ref.
51
*   The models included with DeepVariant are only trained on human data. For
52
    other organisms, see the
53
    [blog post on non-human variant-calling](https://google.github.io/deepvariant/posts/2018-12-05-improved-non-human-variant-calling-using-species-specific-deepvariant-models/)
54
    for some possible pitfalls and how to handle them.
55
56
## DeepTrio
57
58
DeepTrio is a deep learning-based trio variant caller built on top of
59
DeepVariant. DeepTrio extends DeepVariant's functionality, allowing it to
60
utilize the power of neural networks to predict genomic variants in trios or
61
duos. See [this page](docs/deeptrio-details.md) for more details and
62
instructions on how to run DeepTrio.
63
64
DeepTrio supports germline variant-calling in diploid organisms for the
65
following types of input data:
66
67
*   NGS (Illumina) data for either
68
    [whole genome](docs/deeptrio-wgs-case-study.md) or whole exome.
69
*   PacBio HiFi data, see the
70
    [PacBio case study](docs/deeptrio-pacbio-case-study.md).
71
72
Please also note:
73
74
*   All DeepTrio models were trained on human data.
75
*   It is possible to use DeepTrio with only 2 samples (child, and one parent).
76
*   External tool [GLnexus](https://github.com/dnanexus-rnd/GLnexus) is used to
77
    merge output VCFs.
78
79
## How to run DeepVariant
80
81
We recommend using our Docker solution. The command will look like this:
82
83
```
84
BIN_VERSION="1.8.0"
85
docker run \
86
  -v "YOUR_INPUT_DIR":"/input" \
87
  -v "YOUR_OUTPUT_DIR:/output" \
88
  google/deepvariant:"${BIN_VERSION}" \
89
  /opt/deepvariant/bin/run_deepvariant \
90
  --model_type=WGS \ **Replace this string with exactly one of the following [WGS,WES,PACBIO,ONT_R104,HYBRID_PACBIO_ILLUMINA]**
91
  --ref=/input/YOUR_REF \
92
  --reads=/input/YOUR_BAM \
93
  --output_vcf=/output/YOUR_OUTPUT_VCF \
94
  --output_gvcf=/output/YOUR_OUTPUT_GVCF \
95
  --num_shards=$(nproc) \ **This will use all your cores to run make_examples. Feel free to change.**
96
  --vcf_stats_report=true \ **Optional. Creates VCF statistics report in html file. Default is false.
97
  --disable_small_model=true \ **Optional. Disables the small model from make_examples stage. Default is false.
98
  --logging_dir=/output/logs \ **Optional. This saves the log output for each stage separately.
99
  --haploid_contigs="chrX,chrY" \ **Optional. Heterozygous variants in these contigs will be re-genotyped as the most likely of reference or homozygous alternates. For a sample with karyotype XY, it should be set to "chrX,chrY" for GRCh38 and "X,Y" for GRCh37. For a sample with karyotype XX, this should not be used.
100
  --par_regions_bed="/input/GRCh3X_par.bed" \ **Optional. If --haploid_contigs is set, then this can be used to provide PAR regions to be excluded from genotype adjustment. Download links to this files are available in this page.
101
  --dry_run=false **Default is false. If set to true, commands will be printed out but not executed.
102
```
103
104
For details on X,Y support, please see
105
[DeepVariant haploid support](docs/deepvariant-haploid-support.md) and the case
106
study in
107
[DeepVariant X, Y case study](docs/deepvariant-xy-calling-case-study.md). You
108
can download the PAR bed files from here:
109
[GRCh38_par.bed](https://storage.googleapis.com/deepvariant/case-study-testdata/GRCh38_PAR.bed),
110
[GRCh37_par.bed](https://storage.googleapis.com/deepvariant/case-study-testdata/GRCh37_PAR.bed).
111
112
To see all flags you can use, run: `docker run
113
google/deepvariant:"${BIN_VERSION}"`
114
115
If you're using GPUs, or want to use Singularity instead, see
116
[Quick Start](docs/deepvariant-quick-start.md) for more details.
117
118
If you are running on a machine with a GPU, an experimental mode is available
119
that enables running the `make_examples` stage on the CPU while the
120
 `call_variants` stage runs on the GPU simultaneously.
121
For more details, refer to the [Fast Pipeline case study](docs/deepvariant-fast-pipeline-case-study.md).
122
123
For more information, also see:
124
125
*   [Full documentation list](docs/README.md)
126
*   [Detailed usage guide](docs/deepvariant-details.md) with more information on
127
    the input and output file formats and how to work with them.
128
*   [Best practices for multi-sample variant calling with DeepVariant](docs/trio-merge-case-study.md)
129
*   [(Advanced) Training tutorial](docs/deepvariant-training-case-study.md)
130
*   [DeepVariant's Frequently Asked Questions, FAQ](docs/FAQ.md)
131
132
## How to cite
133
134
If you're using DeepVariant in your work, please cite:
135
136
[A universal SNP and small-indel variant caller using deep neural networks. *Nature Biotechnology* 36, 983–987 (2018).](https://rdcu.be/7Dhl) <br/>
137
Ryan Poplin, Pi-Chuan Chang, David Alexander, Scott Schwartz, Thomas Colthurst, Alexander Ku, Dan Newburger, Jojo Dijamco, Nam Nguyen, Pegah T. Afshar, Sam S. Gross, Lizzie Dorfman, Cory Y. McLean, and Mark A. DePristo.<br/>
138
doi: https://doi.org/10.1038/nbt.4235
139
140
Additionally, if you are generating multi-sample calls using our
141
[DeepVariant and GLnexus Best Practices](docs/trio-merge-case-study.md), please
142
cite:
143
144
[Accurate, scalable cohort variant calls using DeepVariant and GLnexus.
145
_Bioinformatics_ (2021).](https://doi.org/10.1093/bioinformatics/btaa1081)<br/>
146
Taedong Yun, Helen Li, Pi-Chuan Chang, Michael F. Lin, Andrew Carroll, and Cory
147
Y. McLean.<br/>
148
doi: https://doi.org/10.1093/bioinformatics/btaa1081
149
150
## Why Use DeepVariant?
151
152
*   **High accuracy** - DeepVariant won 2020
153
    [PrecisionFDA Truth Challenge V2](https://precision.fda.gov/challenges/10/results)
154
    for All Benchmark Regions for ONT, PacBio, and Multiple Technologies
155
    categories, and 2016
156
    [PrecisionFDA Truth Challenge](https://precision.fda.gov/challenges/truth/results)
157
    for best SNP Performance. DeepVariant maintains high accuracy across data
158
    from different sequencing technologies, prep methods, and species. For
159
    [lower coverage](https://google.github.io/deepvariant/posts/2019-09-10-twenty-is-the-new-thirty-comparing-current-and-historical-wgs-accuracy-across-coverage/),
160
    using DeepVariant makes an especially great difference. See
161
    [metrics](docs/metrics.md) for the latest accuracy numbers on each of the
162
    sequencing types.
163
*   **Flexibility** - Out-of-the-box use for
164
    [PCR-positive](https://ai.googleblog.com/2018/04/deepvariant-accuracy-improvements-for.html)
165
    samples and
166
    [low quality sequencing runs](https://blog.dnanexus.com/2018-01-16-evaluating-the-performance-of-ngs-pipelines-on-noisy-wgs-data/),
167
    and easy adjustments for
168
    [different sequencing technologies](https://google.github.io/deepvariant/posts/2019-01-14-highly-accurate-snp-and-indel-calling-on-pacbio-ccs-with-deepvariant/)
169
    and
170
    [non-human species](https://google.github.io/deepvariant/posts/2018-12-05-improved-non-human-variant-calling-using-species-specific-deepvariant-models/).
171
*   **Ease of use** - No filtering is needed beyond setting your preferred
172
    minimum quality threshold.
173
*   **Cost effectiveness** - With a single non-preemptible n1-standard-16
174
    machine on Google Cloud, it costs ~$11.8 to call a 30x whole genome and
175
    ~$0.89 to call an exome. With preemptible pricing, the cost is $2.84 for a
176
    30x whole genome and $0.21 for whole exome (not considering preemption).
177
*   **Speed** - See [metrics](docs/metrics.md) for the runtime of all supported
178
    datatypes on a 96-core CPU-only machine</sup>. Multiple options for
179
    acceleration exist.
180
*   **Usage options** - DeepVariant can be run via Docker or binaries, using
181
    both on-premise hardware or in the cloud, with support for hardware
182
    accelerators like GPUs and TPUs.
183
184
<a name="myfootnote1">(1)</a>: Time estimates do not include mapping.
185
186
## How DeepVariant works
187
188
![Stages in DeepVariant](docs/images/inference_flow_diagram.svg)
189
190
For more information on the pileup images and how to read them, please see the
191
["Looking through DeepVariant's Eyes" blog post](https://google.github.io/deepvariant/posts/2020-02-20-looking-through-deepvariants-eyes/).
192
193
DeepVariant relies on [Nucleus](https://github.com/google/nucleus), a library of
194
Python and C++ code for reading and writing data in common genomics file formats
195
(like SAM and VCF) designed for painless integration with the
196
[TensorFlow](https://www.tensorflow.org/) machine learning framework. Nucleus
197
was built with DeepVariant in mind and open-sourced separately so it can be used
198
by anyone in the genomics research community for other projects. See this blog
199
post on
200
[Using Nucleus and TensorFlow for DNA Sequencing Error Correction](https://google.github.io/deepvariant/posts/2019-01-31-using-nucleus-and-tensorflow-for-dna-sequencing-error-correction/).
201
202
## DeepVariant Setup
203
204
### Prerequisites
205
206
*   Unix-like operating system (cannot run on Windows)
207
*   Python 3.10
208
209
### Official Solutions
210
211
Below are the official solutions provided by the
212
[Genomics team in Google Health](https://health.google/health-research/).
213
214
Name                                                                                                | Description
215
:-------------------------------------------------------------------------------------------------: | -----------
216
[Docker](docs/deepvariant-quick-start.md)           | This is the recommended method.
217
[Build from source](docs/deepvariant-build-test.md) | DeepVariant comes with scripts to build it on Ubuntu 20.04. To build and run on other Unix-based systems, you will need to modify these scripts.
218
Prebuilt Binaries                                                                                   | Available at [`gs://deepvariant/`](https://console.cloud.google.com/storage/browser/deepvariant). These are compiled to use SSE4 and AVX instructions, so you will need a CPU (such as Intel Sandy Bridge) that supports them. You can check the `/proc/cpuinfo` file on your computer, which lists these features under "flags".
219
220
## Contribution Guidelines
221
222
Please [open a pull request](https://github.com/google/deepvariant/compare) if
223
you wish to contribute to DeepVariant. Note, we have not set up the
224
infrastructure to merge pull requests externally. If you agree, we will test and
225
submit the changes internally and mention your contributions in our
226
[release notes](https://github.com/google/deepvariant/releases). We apologize
227
for any inconvenience.
228
229
If you have any difficulty using DeepVariant, feel free to
230
[open an issue](https://github.com/google/deepvariant/issues/new). If you have
231
general questions not specific to DeepVariant, we recommend that you post on a
232
community discussion forum such as [BioStars](https://www.biostars.org/).
233
234
## License
235
236
[BSD-3-Clause license](LICENSE)
237
238
## Acknowledgements
239
240
DeepVariant happily makes use of many open source packages. We would like to
241
specifically call out a few key ones:
242
243
*   [Boost Graph Library](http://www.boost.org/doc/libs/1_65_1/libs/graph/doc/index.html)
244
*   [abseil-cpp](https://github.com/abseil/abseil-cpp) and
245
    [abseil-py](https://github.com/abseil/abseil-py)
246
*   [pybind11](https://github.com/pybind/pybind11)
247
*   [GNU Parallel](https://www.gnu.org/software/parallel/)
248
*   [htslib & samtools](http://www.htslib.org/)
249
*   [Nucleus](https://github.com/google/nucleus)
250
*   [numpy](http://www.numpy.org/)
251
*   [SSW Library](https://github.com/mengyao/Complete-Striped-Smith-Waterman-Library)
252
*   [TensorFlow](https://www.tensorflow.org/)
253
254
We thank all of the developers and contributors to these packages for their
255
work.
256
257
## Disclaimer
258
259
This is not an official Google product.
260
261
NOTE: the content of this research code repository (i) is not intended to be a
262
medical device; and (ii) is not intended for clinical use of any kind, including
263
but not limited to diagnosis or prognosis.