NeuSomatic can be used universally as a stand-alone somatic mutation detection method or with an ensemble of existing methods. In the ensemble mode NeuSomatic currently supports outputs from MuTect2, MuSE, Strelka2, SomaticSniper, VarDict, and VarScan2. For ensemble mode, the ensembled outputs of different somatic callers (as a single .tsv
file) should be prepared and inputed using --ensemble_tsv
argument in preprocess.py
.
The following steps use docker pipelines to prepare the ensemble .tsv
file.
This is an adaptation of SomaticSeq's scripts.
qsub
command, such as Sun Grid Engine.To run the individual somatic callers (MuTect2, MuSE, Strelka2, SomaticSniper, VarDict, and VarScan2), you can use the following command that create 10 (if we set splits
to 10) equal-size regions in 10 bed files, and parallelize the jobs into 10 regions.
prepare_callers_scripts.sh \
--normal-bam /ABSOLUTE/PATH/TO/normal_sample.bam \
--tumor-bam /ABSOLUTE/PATH/TO/tumor_sample.bam \
--human-reference /ABSOLUTE/PATH/TO/ref.fa \
--output-dir /ABSOLUTE/PATH/TO/output \
--dbsnp /ABSOLUTE/PATH/TO/dbSNP.vcf \
--splits 10 \
--selector /ABSOLUTE/PATH/TO/region.bed \
--mutect2 --somaticsniper --vardict --varscan2 --muse --strelka --wrapper
NOTE: If you want to use Singulariy instead of Docker, please use --singularity
argument.
This command will create sub-folders with the following sctructure for the split regions:
output
├── genome.bed
├── 1
├── 2
├── .
├── .
├── .
├── i # i'th folder
│ ├── i.bed # i'th sub-region
│ ├── logs
│ │ ├── strelka.cmd # Strelka2 run script for region i.bed
│ │ ├── muse.cmd # MuSE run script for region i.bed
│ │ ├── mutect2.cmd # MuTect2 run script for region i.bed
│ │ ├── vardict.cmd # VarDict run script for region i.bed
│ │ └── varscan2.cm # VarScan2 run script for region i.bed
│ └── Wrapper
│ └── logs
│ └── wrapper.cmd # Wrapper script to combine results
├── .
├── .
├── .
├── 10
└── logs
└── somaticsniper.cmd # SomaticSniper run script for whole region genome.bed
You should first run all individual callers .cmd
run scripts for all regions. For instance with qsub
command:
for tool in strelka muse mutect2 vardict varscan2
do
for i in {1..10}
do
for script in output/${i}/logs/${tool}.cmd
do
qsub $script
done
done
done
qsub output/logs/somaticsniper.cmd
Once all these scripts finished successfully, the respective VCF files for each tool will be available under each region sub-folder.
Then, you can run wrapper.cmd scripts to combine calls made by each caller:
for i in {1..10}
do
qsub output/${i}/Wrapper/logs/wrapper.cmd
done
This will generate "Ensemble.sSNV.tsv" and "Ensemble.sINDEL.tsv" under each region subfolder (e.g. under 'output/{1,2,3,...}/Wrapper/').
Now you can combine these files to generate ensemble_ann.tsv
file.
cat <(cat output/*/Wrapper/Ensemble.s*.tsv |grep CHROM|head -1) \
<(cat output/*/Wrapper/Ensemble.s*.tsv |grep -v CHROM) | sed "s/nan/0/g" > ensemble_ann.tsv
and provide ensemble_ann.tsv
as --ensemble_tsv
argument in preprocess.py
.
prepare_callers_scripts.sh can prepare dockerized somatic mutation calling jobs. The following options are available:
* --normal-bam
/ABSOLUTE/PATH/TO/normal_sample.bam (Required)
* --tumor-bam
/ABSOLUTE/PATH/TO/tumor_sample.bam (Required)
* --output-dir
/ABSOLUTE/PATH/TO/OUTPUT_DIRECTORY (Required)
* --human-reference
/ABSOLUTE/PATH/TO/human_reference.fa (Required)
* --dbsnp
/ABSOLUTE/PATH/TO/dbsnp.vcf (Required for MuSE and LoFreq)
* --cosmic
/ABSOLUTE/PATH/TO/cosmic.vcf (Optional)
* --selector
/ABSOLUTE/PATH/TO/Capture_region.bed (Optional. Will create genome.bed from the .fa.fai file when not specified.)
* --exclude
/ABSOLUTE/PATH/TO/Blacklist_Region.bed (Optional)
* --min-af
(Optional. The minimum VAF cutoff for VarDict. Defulat is 0.05.)
* --splits
N (Optional: evenly split the genome into N BED files. Default = 12).
* --mutect2
(Optional flag to invoke MuTect2)
* --varscan2
(Optional flag to invoke VarScan2)
* --somaticsniper
(Optional flag to invoke SomaticSniper)
* --vardict
(Optional flag to invoke VarDict)
* --muse
(Optional flag to invoke MuSE)
* --strelka
(Optional flag to invoke Strelka)
* --wrapper
(Optional flag to invoke wrapper to combine outputs of all callers).
* --exome
(Optional flag for Strelka and MuSE when data is WES)
* --mutect2-arguments
(Extra parameters to pass onto Mutect2, e.g., --mutect2-arguments '--initial_tumor_lod 3.0 --log_somatic_prior -5.0 --min_base_quality_score 20')
* --mutect2-filter-arguments
(Extra parameters to pass onto FilterMutectCalls)
* --varscan-arguments
(Extra parameters to pass onto VarScan2)
* --varscan-pileup-arguments
(Extra parameters to pass onto samtools mpileup that creates pileup files for VarScan)
* --somaticsniper-arguments
(Extra parameters to pass onto SomaticSniper)
* --vardict-arguments
(Extra parameters to pass onto VarDict)
* --muse-arguments
(Extra parameters to pass onto MuSE)
* --strelka-config-arguments
(Extra parameters to pass onto Strelka's config command)
* --strelka-run-arguments
(Extra parameters to pass onto Strekla's run command)
* --wrapper-arguments
(Extra parameters to pass onto SomaticSeq.Wrapper.sh)
* --singularity
(Optional flag to use Singulariy instead of Docker)
--singularity
argument in prepare_callers_scripts
.Done at 2017/10/30 29:03:02
), and it will only do so when the task is completed with an exit code of 0. It can be a quick way to check if a task is done, by looking at the final line of the standard error log file.output/{1,2,3,...}
./ABSOLUTE/PATH/TO/dbSNP.vcf
, there also needs to be dbSNP.vcf.idx
, dbSNP.vcf.gz
, and dbSNP.vcf.gz.tbi
present at the same directory because MuSE and LoFreq are expecting them.