[c66173]: / docs / Reproducibility.md

Download this file

109 lines (89 with data), 10.0 kB

Demo and Examples

This section contains examples and also reproducibility process of our analysis. It is assumed that DeepMod has been successfully installed. If not, please install it and its required packages first according to the instruction. It is also assumed that the virtual environment mdeepmod is used. Please remove source activate mdeepmod if virtual environment is not used in your system.

To prepare to run DeepMod, It is assumed that there is a directory which has a DeepMod directory for our DeepMod tool, a data directory for storing Nanopore sequencing data, and a ref directory for reference data.

Reference genomes

Please used bwa to index the genome before runnning DeepMod

E. coli reference genome

The E. coli reference fasta could be downloaded from https://www.ncbi.nlm.nih.gov/nuccore/556503834. We assumed the reference file name under the ref directory is 'Ecoli_k12_mg1655.fasta' and indexed by bwa

Human reference genome

Hg38 was used for human Nanopore sequencing data. We assumed the fasta human file name under the ref directory is 'hg38' and indexed by bwa

Nanopore data

DeepMod needs a group of FAST5 files generated by Nanopore sequencer after basecalling. Thus, before you run DeepMod, you might need to run Albacore first on your Nanopore data. When you run Albacore on your Nanopore data, please make sure that fastq,fast5 is used so that event information can be stored in FAST5 files.

Run on E. coli data

The Nanopore data set are large, and thus users need to contact the original authors1 to get the downloading URL.

Example 1: 5mC detection

Step 1. datasets

Please download Nanopore sequencing data for con1, con2 and CG motif with SSSl, and untar them into seperate sub-folder under the data directory, and assumed their sub-folder names are Control_lib1Control_lib3 and meth10_lib3 respectively.

Step 2.

mkdir ecoli_pred/
time python DeepMod/bin/DeepMod.py detect --wrkBase data/meth10_lib3/ --Ref ref/Ecoli_k12_mg1655.fasta --FileID Cgsss --modfile DeepMod/train_mod/rnn_sinmodC_P100wd21_f7ne1u0_4/mod_train_sinmodC_P100wd21_f3ne1u0 --threads 15 --outFolder ecoli_pred/
time python DeepMod/bin/DeepMod.py detect --wrkBase data/Control_lib1/ --Ref ref/Ecoli_k12_mg1655.fasta --FileID con1 --modfile DeepMod/train_mod/rnn_sinmodC_P100wd21_f7ne1u0_4/mod_train_sinmodC_P100wd21_f3ne1u0 --threads 15 --outFolder ecoli_pred/
time python DeepMod/bin/DeepMod.py detect --wrkBase data/Control_lib3/ --Ref ref/Ecoli_k12_mg1655.fasta --FileID con2 --modfile DeepMod/train_mod/rnn_sinmodC_P100wd21_f7ne1u0_4/mod_train_sinmodC_P100wd21_f3ne1u0 --threads 15 --outFolder ecoli_pred/

The three commands might take ~41 minutes, ~84 minutes and ~120 minutes to be done. After that, you will found the results under the directory of ecoli_pred with --FileID as the sub-folder name. The results are in the bed format with the file names as mod_pos.NC_000913.3-.C.bed and mod_pos.NC_000913.3+.C.bed. The detail of the methylation prediction for each long read is also provided.

One can run DeepMod on other datasets with the commands below.

time python DeepMod/bin/DeepMod.py detect --wrkBase data/meth8_lib2/ --Ref ref/Ecoli_k12_mg1655.fasta --FileID gCgc --modfile DeepMod/train_mod/rnn_sinmodC_P100wd21_f7ne1u0_4/mod_train_sinmodC_P100wd21_f3ne1u0 --threads 15 --outFolder ecoli_pred/
time python DeepMod/bin/DeepMod.py detect --wrkBase data/meth9_lib2/ --Ref ref/Ecoli_k12_mg1655.fasta --FileID Cgmpe --modfile DeepMod/train_mod/rnn_sinmodC_P100wd21_f7ne1u0_4/mod_train_sinmodC_P100wd21_f3ne1u0 --threads 15 --outFolder ecoli_pred/

Step 3.

The following commands then can be used to calculate average precision and AUC values of DeepMod.

python DeepMod/tools/cal_EcoliDetPerf.py ecoli_pred/Cgmpe ref/Ecoli_k12_mg1655.fasta Cg 0  '' -1 -1 ecoli_pred/Cgmpe/ ecoli_pred/con1,ecoli_pred/con2
python DeepMod/tools/cal_EcoliDetPerf.py ecoli_pred/Cgsss ref/Ecoli_k12_mg1655.fasta Cg 0  '' -1 -1 ecoli_pred/Cgsss/ ecoli_pred/con1,ecoli_pred/con2
python DeepMod/tools/cal_EcoliDetPerf.py ecoli_pred/gCgc ref/Ecoli_k12_mg1655.fasta gCgc 1  '' -1 -1 ecoli_pred/gCgc/ ecoli_pred/con1,ecoli_pred/con2

The commands above will generate AP plots and AUC plots under the directory of ecoli_pred/Cgmpe/, ecoli_pred/Cgsss/, ecoli_pred/gCgc/ which are performance as shown in Figure 2 (a), (c) and (d).

Example 2: 6mA detection

Step 1. datasets

You need contact the original authors1 to get the downloading URL.

Please download Nanopore sequencing data for con1, con2 and gAtc motif, tcgA and gaAttc, and untar them into seperate sub-folder under the data directory, and assumed their sub-folder names are Control_lib1Control_lib3, meth11_lib3, meth1_lib1 and meth4_lib1 respectively. If you have run Example 1, you might already have the dataset for con1 and con2.

Step 2.

mkdir ecoli_pred/ (if not exist)
time python DeepMod/bin/DeepMod.py detect --wrkBase data/meth11_lib3/ --Ref ref/Ecoli_k12_mg1655.fasta --Base A --FileID gAtc --modfile DeepMod/train_mod/rnn_conmodA_P100wd21_f7ne1u0_4/mod_train_conmodA_P100wd21_f3ne1u0 --threads 15 --outFolder ecoli_pred/
time python DeepMod/bin/DeepMod.py detect --wrkBase data/meth1_lib1/ --Ref ref/Ecoli_k12_mg1655.fasta --Base A --FileID tcgA --modfile DeepMod/train_mod/rnn_conmodA_P100wd21_f7ne1u0_4/mod_train_conmodA_P100wd21_f3ne1u0 --threads 15 --outFolder ecoli_pred/
time python DeepMod/bin/DeepMod.py detect --wrkBase data/meth4_lib1/ --Ref ref/Ecoli_k12_mg1655.fasta --Base A --FileID gaAttc --modfile DeepMod/train_mod/rnn_conmodA_P100wd21_f7ne1u0_4/mod_train_conmodA_P100wd21_f3ne1u0 --threads 15 --outFolder ecoli_pred/
time python DeepMod/bin/DeepMod.py detect --wrkBase data/Control_lib1/ --Ref ref/Ecoli_k12_mg1655.fasta --Base A --FileID con1a --modfile DeepMod/train_mod/rnn_conmodA_P100wd21_f7ne1u0_4/mod_train_conmodA_P100wd21_f3ne1u0 --threads 15 --outFolder ecoli_pred/
time python DeepMod/bin/DeepMod.py detect --wrkBase data/Control_lib3/ --Ref ref/Ecoli_k12_mg1655.fasta --Base A --FileID con2a --modfile DeepMod/train_mod/rnn_conmodA_P100wd21_f7ne1u0_4/mod_train_conmodA_P100wd21_f3ne1u0 --threads 15 --outFolder ecoli_pred/

Step 3.

The following commands then can be used to calculate average precision and AUC values of DeepMod.

python DeepMod/tools/cal_EcoliDetPerf.py ecoli_pred/gAtc ref/Ecoli_k12_mg1655.fasta gAtc 1  '' 1000000 2000000 ecoli_pred/gAtc/ ecoli_pred/con1a,ecoli_pred/con2a
python DeepMod/tools/cal_EcoliDetPerf.py ecoli_pred/tcgA ref/Ecoli_k12_mg1655.fasta tcgA 3  '' 1000000 2000000 ecoli_pred/tcgA/ ecoli_pred/con1a,ecoli_pred/con2a
python DeepMod/tools/cal_EcoliDetPerf.py ecoli_pred/gaAttc ref/Ecoli_k12_mg1655.fasta gaAttc 2  '' 1000000 2000000 ecoli_pred/gaAttc/ ecoli_pred/con1a,ecoli_pred/con2a

The commands above will generate AP plots and AUC plots under the directory of ecoli_pred/gAtc/, ecoli_pred/tcgA/, ecoli_pred/gaAttc/ which are performance as shown in Figure 3 (a) and the supplementary Figure 6.

Example 3: Detect 5mC on Na12878

Step 1. datasets

You might need to Na12878 Nanopore sequencing data to download fast5 files. Please note that the whole dataset is ~30TB.

Step 2.

Since it is very large for NA12878 Nanopore sequencing data, users can run each of tar files (each chromomsome has 1 to 9 tar files) separately to speed up the detection process. An example of running DeepMod on a template tar file is given below:

mkdir na12878_pred
time python DeepMod/bin/DeepMod.py detect --wrkBase data/chr1/tar1 --Ref ref/hg38.fasta --FileID chr1_tar1 --modfile DeepMod/train_mod/rnn_conmodC_P100wd21_f7ne1u0_4/mod_train_conmodC_P100wd21_f3ne1u0 --threads 15 --outFolder na12878_pred/

Step 3.

Then, the following command can be used to merge all results in Step 2.

python DeepMod/tools/sum_chr_mod.py na12878_pred/ C na12878_C

Then, the results will be under the directory of na12878_pred/ and the result file names start with na12878_C and end with '.bed' in a bed format. The results are grouped by chromosomes.

Step 4 (optional)

This step is to consder the cluster effect of 5mC in human genome. To do that, a CpG index in a human genome will be generated.

python DeepMod/tools/generate_motif_pos.py ref/hg38.fa genome_motif/C C CG 0

The results are under the directory of genome_motif/C.

After that, a second deep learning process will be used to consider cluster effect.

python DeepMod/tools/hm_cluster_predict.py na12878_pred/na12878_C genome_motif/C

The script will take all a12878_pred/na12878_C.chr[12....].C.bed as input, and output the bed files with the file name format of a12878_pred/na12878_C_clusterCpG.chr[12....].C.bed

Step 5

To evaluate DeepMod's performance on NA12878, users might use bisulfite sequencing results from https://www.encodeproject.org/experiments/ENCSR890UQO/. Due to the heterogeneity of sequenced samples, completely methylated and completely un-methylated bases could be used for the evaluation: a genomic position of a base was considered to be completely methylated if its methylation percentage >=90% in both replicates of bisulfite sequencing with coverage>=c (c could be 1, 5 or 10), and to be completely un-methylated if its methylation percentage is 0% in both replicates.

The modification detection on HX1 can be run in the similar way to that on NA12878 but with different --modfile: DeepMod/train_mod/rnn_f7_wd21_chr1to10_4/mod_train_f7_wd21_chr1to10.

Reference

  1. Stoiber MH, et al. De novo Identification of DNA Modifications Enabled by Genome-Guided Nanopore Signal Processing. bioRxiv 10.1101/094672, (2017).