This section contains examples and also reproducibility process of our analysis. It is assumed that DeepMod has been successfully installed. If not, please install it and its required packages first according to the instruction. It is also assumed that the virtual environment mdeepmod
is used. Please remove source activate mdeepmod
if virtual environment is not used in your system.
To prepare to run DeepMod, It is assumed that there is a directory which has a DeepMod directory for our DeepMod tool, a data directory for storing Nanopore sequencing data, and a ref directory for reference data.
Please used bwa
to index the genome before runnning DeepMod
The E. coli reference fasta could be downloaded from https://www.ncbi.nlm.nih.gov/nuccore/556503834. We assumed the reference file name under the ref directory is 'Ecoli_k12_mg1655.fasta' and indexed by bwa
Hg38 was used for human Nanopore sequencing data. We assumed the fasta human file name under the ref directory is 'hg38' and indexed by bwa
DeepMod needs a group of FAST5 files generated by Nanopore sequencer after basecalling. Thus, before you run DeepMod, you might need to run Albacore
first on your Nanopore data. When you run Albacore
on your Nanopore data, please make sure that fastq,fast5
is used so that event
information can be stored in FAST5 files.
The Nanopore data set are large, and thus users need to contact the original authors1 to get the downloading URL.
Please download Nanopore sequencing data for con1, con2 and CG motif with SSSl, and untar them into seperate sub-folder under the data directory, and assumed their sub-folder names are Control_lib1, Control_lib3 and meth10_lib3 respectively.
mkdir ecoli_pred/
time python DeepMod/bin/DeepMod.py detect --wrkBase data/meth10_lib3/ --Ref ref/Ecoli_k12_mg1655.fasta --FileID Cgsss --modfile DeepMod/train_mod/rnn_sinmodC_P100wd21_f7ne1u0_4/mod_train_sinmodC_P100wd21_f3ne1u0 --threads 15 --outFolder ecoli_pred/
time python DeepMod/bin/DeepMod.py detect --wrkBase data/Control_lib1/ --Ref ref/Ecoli_k12_mg1655.fasta --FileID con1 --modfile DeepMod/train_mod/rnn_sinmodC_P100wd21_f7ne1u0_4/mod_train_sinmodC_P100wd21_f3ne1u0 --threads 15 --outFolder ecoli_pred/
time python DeepMod/bin/DeepMod.py detect --wrkBase data/Control_lib3/ --Ref ref/Ecoli_k12_mg1655.fasta --FileID con2 --modfile DeepMod/train_mod/rnn_sinmodC_P100wd21_f7ne1u0_4/mod_train_sinmodC_P100wd21_f3ne1u0 --threads 15 --outFolder ecoli_pred/
The three commands might take ~41 minutes, ~84 minutes and ~120 minutes to be done. After that, you will found the results under the directory of ecoli_pred with --FileID as the sub-folder name. The results are in the bed format with the file names as mod_pos.NC_000913.3-.C.bed and mod_pos.NC_000913.3+.C.bed. The detail of the methylation prediction for each long read is also provided.
One can run DeepMod on other datasets with the commands below.
time python DeepMod/bin/DeepMod.py detect --wrkBase data/meth8_lib2/ --Ref ref/Ecoli_k12_mg1655.fasta --FileID gCgc --modfile DeepMod/train_mod/rnn_sinmodC_P100wd21_f7ne1u0_4/mod_train_sinmodC_P100wd21_f3ne1u0 --threads 15 --outFolder ecoli_pred/
time python DeepMod/bin/DeepMod.py detect --wrkBase data/meth9_lib2/ --Ref ref/Ecoli_k12_mg1655.fasta --FileID Cgmpe --modfile DeepMod/train_mod/rnn_sinmodC_P100wd21_f7ne1u0_4/mod_train_sinmodC_P100wd21_f3ne1u0 --threads 15 --outFolder ecoli_pred/
The following commands then can be used to calculate average precision and AUC values of DeepMod.
python DeepMod/tools/cal_EcoliDetPerf.py ecoli_pred/Cgmpe ref/Ecoli_k12_mg1655.fasta Cg 0 '' -1 -1 ecoli_pred/Cgmpe/ ecoli_pred/con1,ecoli_pred/con2
python DeepMod/tools/cal_EcoliDetPerf.py ecoli_pred/Cgsss ref/Ecoli_k12_mg1655.fasta Cg 0 '' -1 -1 ecoli_pred/Cgsss/ ecoli_pred/con1,ecoli_pred/con2
python DeepMod/tools/cal_EcoliDetPerf.py ecoli_pred/gCgc ref/Ecoli_k12_mg1655.fasta gCgc 1 '' -1 -1 ecoli_pred/gCgc/ ecoli_pred/con1,ecoli_pred/con2
The commands above will generate AP plots and AUC plots under the directory of ecoli_pred/Cgmpe/, ecoli_pred/Cgsss/, ecoli_pred/gCgc/ which are performance as shown in Figure 2 (a), (c) and (d).
You need contact the original authors1 to get the downloading URL.
Please download Nanopore sequencing data for con1, con2 and gAtc motif, tcgA and gaAttc, and untar them into seperate sub-folder under the data directory, and assumed their sub-folder names are Control_lib1, Control_lib3, meth11_lib3, meth1_lib1 and meth4_lib1 respectively. If you have run Example 1, you might already have the dataset for con1 and con2.
mkdir ecoli_pred/ (if not exist)
time python DeepMod/bin/DeepMod.py detect --wrkBase data/meth11_lib3/ --Ref ref/Ecoli_k12_mg1655.fasta --Base A --FileID gAtc --modfile DeepMod/train_mod/rnn_conmodA_P100wd21_f7ne1u0_4/mod_train_conmodA_P100wd21_f3ne1u0 --threads 15 --outFolder ecoli_pred/
time python DeepMod/bin/DeepMod.py detect --wrkBase data/meth1_lib1/ --Ref ref/Ecoli_k12_mg1655.fasta --Base A --FileID tcgA --modfile DeepMod/train_mod/rnn_conmodA_P100wd21_f7ne1u0_4/mod_train_conmodA_P100wd21_f3ne1u0 --threads 15 --outFolder ecoli_pred/
time python DeepMod/bin/DeepMod.py detect --wrkBase data/meth4_lib1/ --Ref ref/Ecoli_k12_mg1655.fasta --Base A --FileID gaAttc --modfile DeepMod/train_mod/rnn_conmodA_P100wd21_f7ne1u0_4/mod_train_conmodA_P100wd21_f3ne1u0 --threads 15 --outFolder ecoli_pred/
time python DeepMod/bin/DeepMod.py detect --wrkBase data/Control_lib1/ --Ref ref/Ecoli_k12_mg1655.fasta --Base A --FileID con1a --modfile DeepMod/train_mod/rnn_conmodA_P100wd21_f7ne1u0_4/mod_train_conmodA_P100wd21_f3ne1u0 --threads 15 --outFolder ecoli_pred/
time python DeepMod/bin/DeepMod.py detect --wrkBase data/Control_lib3/ --Ref ref/Ecoli_k12_mg1655.fasta --Base A --FileID con2a --modfile DeepMod/train_mod/rnn_conmodA_P100wd21_f7ne1u0_4/mod_train_conmodA_P100wd21_f3ne1u0 --threads 15 --outFolder ecoli_pred/
The following commands then can be used to calculate average precision and AUC values of DeepMod.
python DeepMod/tools/cal_EcoliDetPerf.py ecoli_pred/gAtc ref/Ecoli_k12_mg1655.fasta gAtc 1 '' 1000000 2000000 ecoli_pred/gAtc/ ecoli_pred/con1a,ecoli_pred/con2a
python DeepMod/tools/cal_EcoliDetPerf.py ecoli_pred/tcgA ref/Ecoli_k12_mg1655.fasta tcgA 3 '' 1000000 2000000 ecoli_pred/tcgA/ ecoli_pred/con1a,ecoli_pred/con2a
python DeepMod/tools/cal_EcoliDetPerf.py ecoli_pred/gaAttc ref/Ecoli_k12_mg1655.fasta gaAttc 2 '' 1000000 2000000 ecoli_pred/gaAttc/ ecoli_pred/con1a,ecoli_pred/con2a
The commands above will generate AP plots and AUC plots under the directory of ecoli_pred/gAtc/, ecoli_pred/tcgA/, ecoli_pred/gaAttc/ which are performance as shown in Figure 3 (a) and the supplementary Figure 6.
You might need to Na12878 Nanopore sequencing data to download fast5 files. Please note that the whole dataset is ~30TB.
Since it is very large for NA12878 Nanopore sequencing data, users can run each of tar files (each chromomsome has 1 to 9 tar files) separately to speed up the detection process. An example of running DeepMod on a template tar file is given below:
mkdir na12878_pred
time python DeepMod/bin/DeepMod.py detect --wrkBase data/chr1/tar1 --Ref ref/hg38.fasta --FileID chr1_tar1 --modfile DeepMod/train_mod/rnn_conmodC_P100wd21_f7ne1u0_4/mod_train_conmodC_P100wd21_f3ne1u0 --threads 15 --outFolder na12878_pred/
Then, the following command can be used to merge all results in Step 2.
python DeepMod/tools/sum_chr_mod.py na12878_pred/ C na12878_C
Then, the results will be under the directory of na12878_pred/ and the result file names start with na12878_C and end with '.bed' in a bed format. The results are grouped by chromosomes.
This step is to consder the cluster effect of 5mC in human genome. To do that, a CpG index in a human genome will be generated.
python DeepMod/tools/generate_motif_pos.py ref/hg38.fa genome_motif/C C CG 0
The results are under the directory of genome_motif/C.
After that, a second deep learning process will be used to consider cluster effect.
python DeepMod/tools/hm_cluster_predict.py na12878_pred/na12878_C genome_motif/C
The script will take all a12878_pred/na12878_C.chr[12....].C.bed as input, and output the bed files with the file name format of a12878_pred/na12878_C_clusterCpG.chr[12....].C.bed
To evaluate DeepMod's performance on NA12878, users might use bisulfite sequencing results from https://www.encodeproject.org/experiments/ENCSR890UQO/. Due to the heterogeneity of sequenced samples, completely methylated and completely un-methylated bases could be used for the evaluation: a genomic position of a base was considered to be completely methylated if its methylation percentage >=90% in both replicates of bisulfite sequencing with coverage>=c (c could be 1, 5 or 10), and to be completely un-methylated if its methylation percentage is 0% in both replicates.
The modification detection on HX1 can be run in the similar way to that on NA12878 but with different --modfile
: DeepMod/train_mod/rnn_f7_wd21_chr1to10_4/mod_train_f7_wd21_chr1to10.