Diff of /README.md [000000] .. [0a8e41]

Switch to unified view

a b/README.md
1
# NeuSomatic: Deep convolutional neural networks for accurate somatic mutation detection
2
3
NeuSomatic is based on deep convolutional neural networks for accurate somatic mutation detection. With properly trained models, it can robustly perform across sequencing platforms, strategies, and conditions. NeuSomatic summarizes and augments sequence alignments in a novel way and incorporates multi-dimensional features to capture variant signals effectively. It is not only a universal but also accurate somatic mutation detection method.
4
5
For more information contact us at bioinformatics.red@roche.com
6
7
## Publication
8
If you use NeuSomatic in your work, please cite the following papers:
9
10
Sayed Mohammad Ebrahim Sahraeian, Ruolin Liu, Bayo Lau, Karl Podesta, Marghoob Mohiyuddin, Hugo Y. K. Lam, <br/> 
11
[Deep convolutional neural networks for accurate somatic mutation detection. Nature Communications 10: 1041, (2019). <br/> 
12
doi: https://doi.org/10.1038/s41467-019-09027-x](https://doi.org/10.1038/s41467-019-09027-x)
13
14
Sayed Mohammad Ebrahim Sahraeian, Li Tai Fang, Marghoob Mohiyuddin, Huixiao Hong, Wenming Xiao, <br/> 
15
[Robust cancer mutation detection with deep learning models derived from tumor-normal sequencing data. bioRxiv (2019): 667261. <br/> 
16
doi: https://doi.org/10.1101/667261](https://doi.org/10.1101/667261)
17
18
19
## Example Input Matrix
20
![Example input](resources/toy_example.png)
21
22
## Table of Contents
23
**[Availability](#availability)**<br>
24
**[NeuSomatic Docker Image](#neusomatic-docker-image)**<br>
25
**[Required Inputs](#required-inputs)**<br>
26
**[Quick Test](#quick-test)**<br>
27
**[Example Usage](#example-usage)**<br>
28
**[Ensemble mode](#ensemble-mode)**<br>
29
**[Creating Training Data](#creating-training-data)**<br>
30
**[Trained Network Models](#trained-network-models)**<br>
31
**[HPC run](#hpc-run)**<br>
32
**[License](#license)**<br>
33
34
35
## Availability
36
37
NeuSomatic is written in Python and C++ and requires a Unix-like environment to run. It has been sucessfully tested on CentOS 7. Its deep learning framework is implemented using PyTorch 1.1.0 to enable GPU acceleration for training/testing.
38
39
NeuSomatic first scans the genome to identify candidate variants and extract alignment information. 
40
The binary for this step can be obtained at `neusomatic/bin` folder by running `./build.sh` (which requires cmake 3.13.2 and g++ 5.4.0).
41
42
Python 3.7 and the following Python packages must be installed:
43
* pytorch 1.1.0
44
* torchvision 0.3.0
45
* pybedtools 0.8.0
46
* pysam 0.15.2
47
* zlib 1.2.11
48
* numpy 1.15.4
49
* scipy 1.2.0
50
* imageio 2.5.0
51
* biopython 1.73
52
53
It also depends on the following packages:
54
* cudatoolkit 9.0 (if you want to use GPU)
55
* tabix 0.2.6
56
* bedtools 2.27.1
57
* samtools 1.9
58
59
You can install these packages using [anaconda](https://www.anaconda.com/download)/[miniconda](https://conda.io/miniconda.html) :
60
```
61
conda install zlib=1.2.11 numpy=1.15.4 scipy=1.2.0 cmake=3.13.2 imageio=2.5.0
62
conda install pysam=0.15.2 pybedtools=0.8.0 samtools=1.9 tabix=0.2.6 bedtools=2.27.1 biopython=1.73 -c bioconda
63
conda install pytorch=1.1.0 torchvision=0.3.0 cudatoolkit=9.0 -c pytorch
64
```
65
Then you can export the conda paths as:
66
```
67
export PATH="/PATH/TO/CONDA/bin:$PATH"
68
export LD_LIBRARY_PATH="/PATH/TO/CONDA/lib:$LD_LIBRARY_PATH"
69
```
70
g++ 5.4.0 can also be obained as `sudo apt-get install gcc-5 g++-5`.
71
72
## NeuSomatic Docker Image
73
74
The docker image with all the packages installed (CPU-only) can be found at https://hub.docker.com/r/msahraeian/neusomatic/ 
75
76
To use GPU (in `train.py` and `call.py` steps), you should use conda environment to locally install required packages as shown above.
77
78
The dockerfile is also available at `docker/Dockerfile` for local build.
79
80
Examples on how to use the docker image are shown at `test/docker_test.sh`.
81
82
## Required Inputs
83
84
For training mode, the following inputs are required:
85
* tumor `.bam` alignment file 
86
* normal `.bam` alignment file
87
* training region `.bed` file
88
* truth somatic variant `.vcf` file
89
90
For calling mode, the following inputs are required:
91
* tumor `.bam` alignment file 
92
* normal `.bam` alignment file
93
* call region `.bed` file
94
* trained model `.pth` file
95
96
For the region `.bed` files, if you don't have any preferred target regions for training/calling, you can use the whole genome as the target region. Example bed files of major chromosomes for human hg38, hg19, and b37 references can be found at [resources](resources).
97
98
## Quick Test
99
Testing the preprocessing, calling, and postprocessing steps:
100
```
101
cd test
102
./run_test.sh
103
```
104
The outputs at `test/example/work_standalone/NeuSomatic_standalone.vcf` and `test/example/work_ensemble/NeuSomatic_ensemble.vcf` for stand-alone and ensemble modes should look like `test/NeuSomatic_standalone.vcf` and `test/NeuSomatic_ensemble.vcf`, respectively.
105
106
Similarly, you can test docker image as:
107
```
108
cd test
109
./docker_test.sh
110
```
111
112
## Example Usage
113
114
For training:
115
1. Preprocess step in train mode (scan the alignments, find candidates, generate input matrix dataset)
116
```
117
python preprocess.py \
118
    --mode train \
119
    --reference GRCh38.fa \
120
    --region_bed region.bed \
121
    --tumor_bam tumor.bam \
122
    --normal_bam normal.bam \
123
    --work work_train \
124
    --truth_vcf truth.vcf \
125
    --min_mapq 10 \
126
    --number_threads 10 \
127
    --scan_alignments_binary ../bin/scan_alignments
128
```
129
2. Train network
130
```
131
python train.py \
132
    --candidates_tsv work_train/dataset/*/candidates*.tsv \
133
    --out work_train \
134
    --num_threads 10 \
135
    --batch_size 100 
136
```
137
If you want to continue training starting from a pretrained model you can use `--checkpoint`.
138
139
For testing:
140
1. Preprocess step in call mode (scan the alignments, find candidates, generate input matrix dataset)
141
```
142
python preprocess.py \
143
    --mode call \
144
    --reference GRCh38.fa \
145
    --region_bed region.bed \
146
    --tumor_bam tumor.bam \
147
    --normal_bam normal.bam \
148
    --work work_call \
149
    --min_mapq 10 \
150
    --number_threads 10 \
151
    --scan_alignments_binary ../bin/scan_alignments
152
```
153
2. Call variants
154
```
155
python call.py \
156
    --candidates_tsv work_call/dataset/*/candidates*.tsv \
157
    --reference GRCh38.fa \
158
    --out work_call \
159
    --checkpoint work_train/some_checkpoint.pth \
160
    --num_threads 10 \
161
    --batch_size 100 
162
```
163
3. Postprocess step (resolve long INDEL sequences, report vcf)
164
```
165
python postprocess.py \
166
    --reference GRCh38.fa \
167
    --tumor_bam tumor.bam \
168
    --pred_vcf work_call/pred.vcf \
169
    --candidates_vcf work_call/work_tumor/filtered_candidates.vcf \
170
    --output_vcf work_call/NeuSomatic.vcf \
171
    --work work_call 
172
```
173
Here, the final NeuSomatic prediction is reported at `work_call/NeuSomatic.vcf`.
174
175
NeuSomatic will use GPUs in train/call steps if they are avilable. To use specific GPUs for train/call steps, you can set the environment variable `CUDA_VISIBLE_DEVICES` as:
176
177
```
178
CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py ...
179
```
180
or
181
```
182
CUDA_VISIBLE_DEVICES=0,1,2,3 python call.py ...
183
```
184
185
To run in CPU mode you can disable accessing to GPU by exporting `CUDA_VISIBLE_DEVICES=`.
186
187
## Ensemble mode
188
NeuSomatic can be used universally as a stand-alone somatic mutation detection method or with an ensemble of existing methods. NeuSomatic currently supports outputs from MuTect2, MuSE, Strelka2, SomaticSniper, VarDict, and VarScan2. For ensemble mode, the ensembled outputs of different somatic callers (as a single `.tsv` file) should be prepared and inputed using `--ensemble_tsv` argument in `preprocess.py` and `postprocess.py` . 
189
190
There are two alternative ways to prepare this file:
191
192
1. **Dockerized solution** for running all of the individual somatic callers (MuTect2, MuSE, Strelka2, SomaticSniper, VarDict, and VarScan2), and a wrapper that combines their output is explained at [ensemble_docker_pipelines](https://github.com/bioinform/neusomatic/tree/master/ensemble_docker_pipelines).
193
194
2. Alternartively, if you don't want to use docker and already have the output of all somatic callers, you can prepare the `.tsv` file using the `SomaticSeq.Wrapper.sh` script [here](https://github.com/bioinform/somaticseq/blob/master/SomaticSeq.Wrapper.sh) (you may need to install the necessary dependencies for this script, explained [here](https://github.com/bioinform/somaticseq)).
195
196
    For instance:
197
    ```
198
    SomaticSeq.Wrapper.sh \
199
    --output-dir output \
200
    --genome-reference GRCh38.fa \
201
    --tumor-bam tumor.bam \
202
    --normal-bam normal.bam \
203
     -mutect2 MuTect2.vcf \
204
    --varscan-snv VarScan2.snp.vcf \
205
    --varscan-indel VarScan2.indel.vcf \
206
     -sniper SomaticSniper.vcf \
207
    --vardict VarDict.vcf \
208
    --muse MuSE.vcf \
209
    --strelka-snv somatic.snvs.vcf.gz \
210
    --strelka-indel somatic.indels.vcf.gz \
211
    --inclusion-region region.bed \
212
     -dbsnp dbsnp.GRCh38.vcf \
213
     -gatk GenomeAnalysisTK.jar
214
    ```
215
216
    Then, in the output directory, do:
217
    ```
218
    cat <(cat Ensemble.s*.tsv |grep CHROM|head -1) \
219
        <(cat Ensemble.s*.tsv |grep -v CHROM) | sed "s/nan/0/g" > ensemble_ann.tsv
220
    ```
221
    and provide `ensemble_ann.tsv` as `--ensemble_tsv` argument in `preprocess.py` and `postprocess.py`.
222
223
224
### NOTE: 
225
226
* To train or call in the ensemble mode you should use `--ensemble` argument in `train.py` and `call.py`.
227
228
## Creating Training Data
229
The best performance can be obtained when the network is trained on your input dataset. You can creat training data for your input data using [BAMSurgeon](https://github.com/adamewing/bamsurgeon) that can spike in *in silico* somatic mutations into existing BAM files. The dockerized piplines and complete documentation on how to creat training sets can be found [here](https://github.com/bioinform/somaticseq/tree/master/utilities/dockered_pipelines/bamSimulator).
230
231
You can then used the synthetic tumor/normal pair and the known *in silico* spiked mutations (as truth set) to train the NeuSomatic network as shown in [Example Usage](#example-usage).
232
233
234
235
## Trained Network Models
236
We provide a set of trained NeuSomatic network models for general purpose usage. Users should note that these models are trained for sepcific settings and are not supposed to work perfectly for all circumestances.
237
238
The SEQC-II pretrained models are the recommended NeuSomatic models and are analyzed in detail in [Sahraeian et al. 2019](https://doi.org/10.1101/667261).
239
240
The following models can be found at `neusomatic/models` folder:
241
242
243
### Latest models
244
Model                                              | Mode         | Training Information                                                        
245
---------------------------------------------------|---------------|-----------------------------------------------------------------------
246
`NeuSomatic_v0.1.4_standalone_SEQC-WGS-Spike.pth` |  Stand-alone  | SEQC-II (SEQC-WGS-Spike model) (trained on 20 WGS replicate pairs with in silico somatic mutations of 1%-100% AF, matched with both 95%N and 100%N, Illumina HiSeq and NovaSeq, BWA-MEM,  ~40x-220x)
247
`NeuSomatic_v0.1.4_ensemble_SEQC-WGS-Spike.pth`   |  Ensemble     | SEQC-II (SEQC-WGS-Spike model) (trained on 20 WGS replicate pairs with in silico somatic mutations of 1%-100% AF, matched with both 95%N and 100%N, Illumina HiSeq and NovaSeq, BWA-MEM,  ~40x-220x, 5 callers used: MuTect2, Strelka2, MuSE, SomaticSniper, VarDict)
248
`NeuSomatic_v0.1.4_standalone_SEQC-WGS-GT50-SpikeWGS10.pth` |  Stand-alone  | SEQC-II (SEQC-WGS-GT50-SpikeWGS10 model) (trained on combination of two datasets: (1) 50% of the genome for 24 real tumor-normal SEQC-II replicates using the HighConf truth set annotation, with multiple purity settings of 100T-100N/10T-100N/10T-95N, 1%-100% AF and (2) 10% of data used in `NeuSomatic_v0.1.4_standalone_SEQC-WGS-Spike.pth` model. Illumina HiSeq and NovaSeq, BWA-MEM,  ~40x-390x)
249
`NeuSomatic_v0.1.4_ensemble_SEQC-WGS-GT50-SpikeWGS10.pth` |  Ensemble     | SEQC-II (SEQC-WGS-GT50-SpikeWGS10 model) (trained on combination of two datasets: (1) 50% of the genome for 24 real tumor-normal SEQC-II replicates using the HighConf truth set annotation, with multiple purity settings of 100T-100N/10T-100N/10T-95N, 1%-100% AF and (2) 10% of data used in `NeuSomatic_v0.1.4_ensemble_SEQC-WGS-Spike.pth` model. Illumina HiSeq and NovaSeq, BWA-MEM,  ~40x-390x, 5 callers used: MuTect2, Strelka2, MuSE, SomaticSniper, VarDict)
250
251
### Older models
252
Model                                              | Mode         | Training Information                                                        
253
---------------------------------------------------|---------------|-----------------------------------------------------------------------
254
`NeuSomatic_v0.1.3_standalone_Dream3.pth` |  Stand-alone  | WGS Dream Challenge Stage 3 (trained on multiple purity settings: 100T-100N/50T-100N/70T-95N/50T-95N/25T-95N, Illumina, BWA-MEM,  ~30x)
255
`NeuSomatic_v0.1.3_ensemble_Dream3.pth`   |  Ensemble     | WGS Dream Challenge Stage 3 (trained on multiple purity settings: 100T-100N/50T-100N/70T-95N/50T-95N/25T-95N, Illumina, BWA-MEM,  ~30x, 6 callers used: MuTect2, Strelka2, MuSE, SomaticSniper, VarDict, VarScan2)
256
`NeuSomatic_v0.1.0_standalone_Dream3_70purity.pth` |  Stand-alone  | WGS Dream Challenge Stage 3 (70% tumor and 95% normal purities, Illumina, BWA-MEM,  ~30x) 
257
`NeuSomatic_v0.1.0_ensemble_Dream3_70purity.pth`   |  Ensemble     | WGS Dream Challenge Stage 3 (70% tumor and 95% normal purities, Illumina, BWA-MEM,  ~30x, 6 callers used: MuTect2, Strelka2, MuSE, SomaticSniper, VarDict, VarScan2) 
258
`NeuSomatic_v0.1.0_standalone_WEX_100purity.pth`   |  Stand-alone  | WEX (100% tumor and normal purities, Illumina, BWA-MEM,  ~125x)
259
`NeuSomatic_v0.1.0_ensemble_WEX_100purity.pth`     |  Ensemble     | WEX (100% tumor and normal purities, Illumina, BWA-MEM,  ~125x, 6 callers used: MuTect2, Strelka2, MuSE, SomaticSniper, VarDict, VarScan2)
260
261
262
263
## HPC run
264
For distributed data processing on HPC platforms, the input regions can be splitted to smaller sub-regions as follows:
265
```
266
python split_bed.py \
267
    --region_bed region.bed \
268
    --output work \
269
    --num_splits 10
270
```
271
Then, each sub-region can be processed as follows (for instance on SGE cluster):
272
```
273
for i in {0..10}
274
do
275
    qsub -pe smp 24 \
276
    "python preprocess.py \
277
    --mode call [or --mode train]
278
    --reference GRCh38.fa --tumor_bam tumor.bam --normal_bam normal.bam \
279
    --region_bed work/splits/region_${i}.bed \
280
    --work work/work_${i} \
281
    --min_mapq 10 --number_threads 24 \
282
    --scan_alignments_binary ../bin/scan_alignments"
283
done
284
```
285
Then the candiate `.tsv` files to be used for train/call can be found at `work_*/dataset/*/candidates*.tsv`.
286
287
## License
288
NeuSomatic is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
289
290