|
a |
|
b/docs/Reproducibility.md |
|
|
1 |
# Demo and Examples |
|
|
2 |
This section contains examples and also reproducibility process of our analysis. It is assumed that DeepMod has been successfully installed. If not, please install it and its required packages first according to the [instruction](https://github.com/WGLab/DeepMod/blob/master/docs/Install.md). It is also assumed that the virtual environment `mdeepmod` is used. Please remove `source activate mdeepmod` if virtual environment is not used in your system. |
|
|
3 |
|
|
|
4 |
To prepare to run DeepMod, It is assumed that there is a directory which has a *DeepMod* directory for our DeepMod tool, a *data* directory for storing Nanopore sequencing data, and a *ref* directory for reference data. |
|
|
5 |
|
|
|
6 |
## Reference genomes |
|
|
7 |
Please used `bwa` to index the genome before runnning `DeepMod` |
|
|
8 |
### E. coli reference genome |
|
|
9 |
The E. coli reference fasta could be downloaded from https://www.ncbi.nlm.nih.gov/nuccore/556503834. We assumed the reference file name under the *ref* directory is 'Ecoli_k12_mg1655.fasta' and indexed by `bwa` |
|
|
10 |
|
|
|
11 |
### Human reference genome |
|
|
12 |
Hg38 was used for human Nanopore sequencing data. We assumed the fasta human file name under the *ref* directory is 'hg38' and indexed by `bwa` |
|
|
13 |
|
|
|
14 |
## Nanopore data |
|
|
15 |
DeepMod needs a group of FAST5 files generated by Nanopore sequencer after basecalling. Thus, before you run DeepMod, you might need to run `Albacore` first on your Nanopore data. When you run `Albacore` on your Nanopore data, please make sure that `fastq,fast5` is used so that `event` information can be stored in FAST5 files. |
|
|
16 |
|
|
|
17 |
## Run on E. coli data |
|
|
18 |
The Nanopore data set are large, and thus users need to contact the original authors<sup>1</sup> to get the downloading URL. |
|
|
19 |
### Example 1: 5mC detection |
|
|
20 |
#### Step 1. datasets |
|
|
21 |
Please download Nanopore sequencing data for *con1*, *con2* and *CG* motif with SSSl, and untar them into seperate sub-folder under the *data* directory, and assumed their sub-folder names are *Control_lib1*, *Control_lib3* and *meth10_lib3* respectively. |
|
|
22 |
|
|
|
23 |
#### Step 2. |
|
|
24 |
``` |
|
|
25 |
mkdir ecoli_pred/ |
|
|
26 |
time python DeepMod/bin/DeepMod.py detect --wrkBase data/meth10_lib3/ --Ref ref/Ecoli_k12_mg1655.fasta --FileID Cgsss --modfile DeepMod/train_mod/rnn_sinmodC_P100wd21_f7ne1u0_4/mod_train_sinmodC_P100wd21_f3ne1u0 --threads 15 --outFolder ecoli_pred/ |
|
|
27 |
time python DeepMod/bin/DeepMod.py detect --wrkBase data/Control_lib1/ --Ref ref/Ecoli_k12_mg1655.fasta --FileID con1 --modfile DeepMod/train_mod/rnn_sinmodC_P100wd21_f7ne1u0_4/mod_train_sinmodC_P100wd21_f3ne1u0 --threads 15 --outFolder ecoli_pred/ |
|
|
28 |
time python DeepMod/bin/DeepMod.py detect --wrkBase data/Control_lib3/ --Ref ref/Ecoli_k12_mg1655.fasta --FileID con2 --modfile DeepMod/train_mod/rnn_sinmodC_P100wd21_f7ne1u0_4/mod_train_sinmodC_P100wd21_f3ne1u0 --threads 15 --outFolder ecoli_pred/ |
|
|
29 |
``` |
|
|
30 |
The three commands might take ~41 minutes, ~84 minutes and ~120 minutes to be done. After that, you will found the results under the directory of *ecoli_pred* with *--FileID* as the sub-folder name. The results are in the *bed* format with the file names as *mod_pos.NC_000913.3-.C.bed* and *mod_pos.NC_000913.3+.C.bed*. The detail of the methylation prediction for each long read is also provided. |
|
|
31 |
|
|
|
32 |
One can run DeepMod on other datasets with the commands below. |
|
|
33 |
``` |
|
|
34 |
time python DeepMod/bin/DeepMod.py detect --wrkBase data/meth8_lib2/ --Ref ref/Ecoli_k12_mg1655.fasta --FileID gCgc --modfile DeepMod/train_mod/rnn_sinmodC_P100wd21_f7ne1u0_4/mod_train_sinmodC_P100wd21_f3ne1u0 --threads 15 --outFolder ecoli_pred/ |
|
|
35 |
time python DeepMod/bin/DeepMod.py detect --wrkBase data/meth9_lib2/ --Ref ref/Ecoli_k12_mg1655.fasta --FileID Cgmpe --modfile DeepMod/train_mod/rnn_sinmodC_P100wd21_f7ne1u0_4/mod_train_sinmodC_P100wd21_f3ne1u0 --threads 15 --outFolder ecoli_pred/ |
|
|
36 |
``` |
|
|
37 |
|
|
|
38 |
#### Step 3. |
|
|
39 |
The following commands then can be used to calculate average precision and AUC values of DeepMod. |
|
|
40 |
``` |
|
|
41 |
python DeepMod/tools/cal_EcoliDetPerf.py ecoli_pred/Cgmpe ref/Ecoli_k12_mg1655.fasta Cg 0 '' -1 -1 ecoli_pred/Cgmpe/ ecoli_pred/con1,ecoli_pred/con2 |
|
|
42 |
python DeepMod/tools/cal_EcoliDetPerf.py ecoli_pred/Cgsss ref/Ecoli_k12_mg1655.fasta Cg 0 '' -1 -1 ecoli_pred/Cgsss/ ecoli_pred/con1,ecoli_pred/con2 |
|
|
43 |
python DeepMod/tools/cal_EcoliDetPerf.py ecoli_pred/gCgc ref/Ecoli_k12_mg1655.fasta gCgc 1 '' -1 -1 ecoli_pred/gCgc/ ecoli_pred/con1,ecoli_pred/con2 |
|
|
44 |
``` |
|
|
45 |
The commands above will generate AP plots and AUC plots under the directory of *ecoli_pred/Cgmpe/*, *ecoli_pred/Cgsss/*, *ecoli_pred/gCgc/* which are performance as shown in Figure 2 (a), (c) and (d). |
|
|
46 |
|
|
|
47 |
### Example 2: 6mA detection |
|
|
48 |
#### Step 1. datasets |
|
|
49 |
You need contact the original authors<sup>1</sup> to get the downloading URL. |
|
|
50 |
|
|
|
51 |
Please download Nanopore sequencing data for *con1*, *con2* and *gAtc* motif, *tcgA* and *gaAttc*, and untar them into seperate sub-folder under the *data* directory, and assumed their sub-folder names are *Control_lib1*, *Control_lib3*, *meth11_lib3*, *meth1_lib1* and *meth4_lib1* respectively. If you have run Example 1, you might already have the dataset for *con1* and *con2*. |
|
|
52 |
|
|
|
53 |
#### Step 2. |
|
|
54 |
``` |
|
|
55 |
mkdir ecoli_pred/ (if not exist) |
|
|
56 |
time python DeepMod/bin/DeepMod.py detect --wrkBase data/meth11_lib3/ --Ref ref/Ecoli_k12_mg1655.fasta --Base A --FileID gAtc --modfile DeepMod/train_mod/rnn_conmodA_P100wd21_f7ne1u0_4/mod_train_conmodA_P100wd21_f3ne1u0 --threads 15 --outFolder ecoli_pred/ |
|
|
57 |
time python DeepMod/bin/DeepMod.py detect --wrkBase data/meth1_lib1/ --Ref ref/Ecoli_k12_mg1655.fasta --Base A --FileID tcgA --modfile DeepMod/train_mod/rnn_conmodA_P100wd21_f7ne1u0_4/mod_train_conmodA_P100wd21_f3ne1u0 --threads 15 --outFolder ecoli_pred/ |
|
|
58 |
time python DeepMod/bin/DeepMod.py detect --wrkBase data/meth4_lib1/ --Ref ref/Ecoli_k12_mg1655.fasta --Base A --FileID gaAttc --modfile DeepMod/train_mod/rnn_conmodA_P100wd21_f7ne1u0_4/mod_train_conmodA_P100wd21_f3ne1u0 --threads 15 --outFolder ecoli_pred/ |
|
|
59 |
time python DeepMod/bin/DeepMod.py detect --wrkBase data/Control_lib1/ --Ref ref/Ecoli_k12_mg1655.fasta --Base A --FileID con1a --modfile DeepMod/train_mod/rnn_conmodA_P100wd21_f7ne1u0_4/mod_train_conmodA_P100wd21_f3ne1u0 --threads 15 --outFolder ecoli_pred/ |
|
|
60 |
time python DeepMod/bin/DeepMod.py detect --wrkBase data/Control_lib3/ --Ref ref/Ecoli_k12_mg1655.fasta --Base A --FileID con2a --modfile DeepMod/train_mod/rnn_conmodA_P100wd21_f7ne1u0_4/mod_train_conmodA_P100wd21_f3ne1u0 --threads 15 --outFolder ecoli_pred/ |
|
|
61 |
``` |
|
|
62 |
#### Step 3. |
|
|
63 |
The following commands then can be used to calculate average precision and AUC values of DeepMod. |
|
|
64 |
``` |
|
|
65 |
python DeepMod/tools/cal_EcoliDetPerf.py ecoli_pred/gAtc ref/Ecoli_k12_mg1655.fasta gAtc 1 '' 1000000 2000000 ecoli_pred/gAtc/ ecoli_pred/con1a,ecoli_pred/con2a |
|
|
66 |
python DeepMod/tools/cal_EcoliDetPerf.py ecoli_pred/tcgA ref/Ecoli_k12_mg1655.fasta tcgA 3 '' 1000000 2000000 ecoli_pred/tcgA/ ecoli_pred/con1a,ecoli_pred/con2a |
|
|
67 |
python DeepMod/tools/cal_EcoliDetPerf.py ecoli_pred/gaAttc ref/Ecoli_k12_mg1655.fasta gaAttc 2 '' 1000000 2000000 ecoli_pred/gaAttc/ ecoli_pred/con1a,ecoli_pred/con2a |
|
|
68 |
``` |
|
|
69 |
The commands above will generate AP plots and AUC plots under the directory of *ecoli_pred/gAtc/*, *ecoli_pred/tcgA/*, *ecoli_pred/gaAttc/* which are performance as shown in Figure 3 (a) and the supplementary Figure 6. |
|
|
70 |
|
|
|
71 |
## Example 3: Detect 5mC on Na12878 |
|
|
72 |
### Step 1. datasets |
|
|
73 |
You might need to [Na12878 Nanopore sequencing data](https://github.com/nanopore-wgs-consortium/NA12878/blob/master/nanopore-human-genome/rel_3_4.md) to download fast5 files. Please note that the whole dataset is ~30TB. |
|
|
74 |
|
|
|
75 |
### Step 2. |
|
|
76 |
Since it is very large for NA12878 Nanopore sequencing data, users can run each of tar files (each chromomsome has 1 to 9 tar files) separately to speed up the detection process. An example of running DeepMod on a template tar file is given below: |
|
|
77 |
``` |
|
|
78 |
mkdir na12878_pred |
|
|
79 |
time python DeepMod/bin/DeepMod.py detect --wrkBase data/chr1/tar1 --Ref ref/hg38.fasta --FileID chr1_tar1 --modfile DeepMod/train_mod/rnn_conmodC_P100wd21_f7ne1u0_4/mod_train_conmodC_P100wd21_f3ne1u0 --threads 15 --outFolder na12878_pred/ |
|
|
80 |
``` |
|
|
81 |
|
|
|
82 |
### Step 3. |
|
|
83 |
Then, the following command can be used to merge all results in Step 2. |
|
|
84 |
``` |
|
|
85 |
python DeepMod/tools/sum_chr_mod.py na12878_pred/ C na12878_C |
|
|
86 |
``` |
|
|
87 |
Then, the results will be under the directory of *na12878_pred/* and the result file names start with *na12878_C* and end with '.bed' in a bed format. The results are grouped by chromosomes. |
|
|
88 |
|
|
|
89 |
### Step 4 (optional) |
|
|
90 |
This step is to consder the cluster effect of 5mC in human genome. To do that, a CpG index in a human genome will be generated. |
|
|
91 |
``` |
|
|
92 |
python DeepMod/tools/generate_motif_pos.py ref/hg38.fa genome_motif/C C CG 0 |
|
|
93 |
``` |
|
|
94 |
The results are under the directory of *genome_motif/C*. |
|
|
95 |
|
|
|
96 |
After that, a second deep learning process will be used to consider cluster effect. |
|
|
97 |
``` |
|
|
98 |
python DeepMod/tools/hm_cluster_predict.py na12878_pred/na12878_C genome_motif/C |
|
|
99 |
``` |
|
|
100 |
The script will take all *a12878_pred/na12878_C.chr[12....].C.bed* as input, and output the bed files with the file name format of *a12878_pred/na12878_C_clusterCpG.chr[12....].C.bed* |
|
|
101 |
|
|
|
102 |
### Step 5 |
|
|
103 |
To evaluate DeepMod's performance on NA12878, users might use bisulfite sequencing results from https://www.encodeproject.org/experiments/ENCSR890UQO/. Due to the heterogeneity of sequenced samples, completely methylated and completely un-methylated bases could be used for the evaluation: a genomic position of a base was considered to be completely methylated if its methylation percentage >=90% in both replicates of bisulfite sequencing with coverage>=c (c could be 1, 5 or 10), and to be completely un-methylated if its methylation percentage is 0% in both replicates. |
|
|
104 |
|
|
|
105 |
The modification detection on HX1 can be run in the similar way to that on NA12878 but with different `--modfile`: *DeepMod/train_mod/rnn_f7_wd21_chr1to10_4/mod_train_f7_wd21_chr1to10*. |
|
|
106 |
|
|
|
107 |
## Reference |
|
|
108 |
1. Stoiber MH, et al. De novo Identification of DNA Modifications Enabled by Genome-Guided Nanopore Signal Processing. bioRxiv 10.1101/094672, (2017). |