Switch to unified view

a/README.md b/README.md
1
# DeepDrug3D
1
# DeepDrug3D
2
2
3
DeepDrug3D is a tool to predict the protein pocket to be ATP/Heme/other-binding given the binding residue numbers and the protein structure.
3
DeepDrug3D is a tool to predict the protein pocket to be ATP/Heme/other-binding given the binding residue numbers and the protein structure.
4
4
5
If you find this tool useful, please star this repo and cite our paper :)
5
If you find this tool useful, please star this repo and cite our paper :)
6
6
7
Pu L, Govindaraj RG, Lemoine JM, Wu HC, Brylinski M (2019) DeepDrug3D: Classification of ligand-binding pockets in proteins with a convolutional neural network. PLOS Computational Biology 15(2): e1006718. https://doi.org/10.1371/journal.pcbi.1006718
7
Pu L, Govindaraj RG, Lemoine JM, Wu HC, Brylinski M (2019) DeepDrug3D: Classification of ligand-binding pockets in proteins with a convolutional neural network. PLOS Computational Biology 15(2): e1006718. https://doi.org/10.1371/journal.pcbi.1006718
8
8
9
This README file is written by Limeng Pu.
9
This README file is written by Limeng Pu.
10
10
11
<p align="center">
11
<p align="center">
12
    <img width="400" height="400" src="./image/1a2sA.png">
12
    <img width="400" height="400" src="https://github.com/pulimeng/DeepDrug3D/blob/master/image/1a2sA.png?raw=true">
13
</p>
13
</p>
14
14
15
An example of binding grid generated, pdb ID: 1a2sA, atom type: C.ar. Red --> low potentials while Blue --> high potentials.
15
An example of binding grid generated, pdb ID: 1a2sA, atom type: C.ar. Red --> low potentials while Blue --> high potentials.
16
16
17
# Change Log
17
# Change Log
18
18
19
**This is a newer version of the implmentation. Since many people are interested in visualize the output from the grid generation like in the image above, I've decided to seperate the data-generation module and the training/prediction module. Another reason for this iteration of implementation is the dligand-linux used for potential calculation requires 32-bit Linux while <em>Pytorch</em> requires 64-bit Linux. It causes confiliction and resulting in different errors depending on the order you install them. Also the deep learning library used has been changed from <em>Keras</em> to <em>Pytorch</em>**.
19
**This is a newer version of the implmentation. Since many people are interested in visualize the output from the grid generation like in the image above, I've decided to seperate the data-generation module and the training/prediction module. Another reason for this iteration of implementation is the dligand-linux used for potential calculation requires 32-bit Linux while <em>Pytorch</em> requires 64-bit Linux. It causes confiliction and resulting in different errors depending on the order you install them. Also the deep learning library used has been changed from <em>Keras</em> to <em>Pytorch</em>**.
20
20
21
# Prerequisites
21
# Prerequisites
22
22
23
1. System requirement: Linux (DFIRE potential calculation only runs on Linux. Tested on <em>Red Hat Enterprise Linux 6</em>)
23
1. System requirement: Linux (DFIRE potential calculation only runs on Linux. Tested on <em>Red Hat Enterprise Linux 6</em>)
24
2. The data-generation module dependencies are provided in `./DataGeneration/environment.yml`. Please change line 9 in the file according to your system. To install all the dependencies run `conda env create -f environment.yml`.
24
2. The data-generation module dependencies are provided in `./DataGeneration/environment.yml`. Please change line 9 in the file according to your system. To install all the dependencies run `conda env create -f environment.yml`.
25
3. The learning module requires <em>Pytorch</em>. To install it, refer to https://pytorch.org/get-started/locally.
25
3. The learning module requires <em>Pytorch</em>. To install it, refer to https://pytorch.org/get-started/locally.
26
26
27
# Usage
27
# Usage
28
28
29
The package provides data-generation, prediction, and training modules.
29
The package provides data-generation, prediction, and training modules.
30
30
31
1. Data generation
31
1. Data generation
32
32
33
This module generates data for training/prediction while providing intermediate results for visualization. All files are under `./DataGeneration`. The DFIRE potential calculation uses the module (`./DataGeneration/dligand-linux`) described in `A Knowledge-Based Energy Function for Protein−Ligand, Protein−Protein, and Protein−DNA Complexes by Zhang et al.` since it is written in Fortran, which is faster than our own implementation in Python.
33
This module generates data for training/prediction while providing intermediate results for visualization. All files are under `./DataGeneration`. The DFIRE potential calculation uses the module (`./DataGeneration/dligand-linux`) described in `A Knowledge-Based Energy Function for Protein−Ligand, Protein−Protein, and Protein−DNA Complexes by Zhang et al.` since it is written in Fortran, which is faster than our own implementation in Python.
34
34
35
To generate the binding grid data, run 
35
To generate the binding grid data, run 
36
36
37
<pre><code>python voxelization.py --f example.pdb --a example_aux.txt --o results --r 15 --n 31 --p</code></pre>
37
<pre><code>python voxelization.py --f example.pdb --a example_aux.txt --o results --r 15 --n 31 --p</code></pre>
38
38
39
  - `--f` input pdb file path.
39
  - `--f` input pdb file path.
40
  - `--a` input auxilary file path, with binding residue numbers and center of ligand (optional). An example of the auxilary file is provided in `example_aux.txt`.
40
  - `--a` input auxilary file path, with binding residue numbers and center of ligand (optional). An example of the auxilary file is provided in `example_aux.txt`.
41
  - `--r` the radius of the spherical grid.
41
  - `--r` the radius of the spherical grid.
42
  - `--n` the number of points along the dimension of the spherical grid.
42
  - `--n` the number of points along the dimension of the spherical grid.
43
  - `--o` output folder path.
43
  - `--o` output folder path.
44
  - `--p` or `--s` whether to calculate the potential nor not. If not, only the binary occupied grid will be returne, i.e., the shape of the grid only. Default, yes (`--p`).
44
  - `--p` or `--s` whether to calculate the potential nor not. If not, only the binary occupied grid will be returne, i.e., the shape of the grid only. Default, yes (`--p`).
45
45
46
Several files will be saved, including `example_transformed.pdb` (coordinate-transformed pdb file), `example_transformed.mol2` (coordinate-transformed mol2 file for the calculation of DFIRE potential), `example.grid` (grid representation of the binding pocket grid for visualization), and `example.h5` (numpy array of the voxel representation).
46
Several files will be saved, including `example_transformed.pdb` (coordinate-transformed pdb file), `example_transformed.mol2` (coordinate-transformed mol2 file for the calculation of DFIRE potential), `example.grid` (grid representation of the binding pocket grid for visualization), and `example.h5` (numpy array of the voxel representation).
47
47
48
To visualize the output binidng pocket grid, run 
48
To visualize the output binidng pocket grid, run 
49
49
50
<pre><code>python visualization --i example.grid --c 0</code></pre>
50
<pre><code>python visualization --i example.grid --c 0</code></pre>
51
51
52
  - `--i` input binding pocket grid file path.
52
  - `--i` input binding pocket grid file path.
53
  - `--c` channel to visualize. Note that if you pass `--s` in the previous step, the channel number `--c` has to be 0.
53
  - `--c` channel to visualize. Note that if you pass `--s` in the previous step, the channel number `--c` has to be 0.
54
  
54
  
55
An output `example_grid.pdb` will be generated for visualization. Note this pocket grid matches the transformed protein `example_transformed.pdb`.
55
An output `example_grid.pdb` will be generated for visualization. Note this pocket grid matches the transformed protein `example_transformed.pdb`.
56
56
57
2. Prediction
57
2. Prediction
58
58
59
This module classifies the target binding pocket to be either an ATP-, Heme-, or other-type pocket, which basically means which type of ligand it tends to binding to. The trained model is available at `https://osf.io/enz69/`. All files are under `./Learning`.
59
This module classifies the target binding pocket to be either an ATP-, Heme-, or other-type pocket, which basically means which type of ligand it tends to binding to. The trained model is available at `https://osf.io/enz69/`. All files are under `./Learning`.
60
60
61
To use the prediction module, run 
61
To use the prediction module, run 
62
62
63
<pre><code>python predict.py --f example.h5 --m path_to_the_trianed_model</code></pre>
63
<pre><code>python predict.py --f example.h5 --m path_to_the_trianed_model</code></pre>
64
64
65
65
66
  - `--f` input h5 file path.
66
  - `--f` input h5 file path.
67
  - `--m` path to the trained model weights.
67
  - `--m` path to the trained model weights.
68
  
68
  
69
The output would be something like 
69
The output would be something like 
70
70
71
<pre><code>The probability of pocket provided binds with ATP ligands: 0.3000
71
<pre><code>The probability of pocket provided binds with ATP ligands: 0.3000
72
The probability of pocket provided binds with Heme ligands: 0.2000
72
The probability of pocket provided binds with Heme ligands: 0.2000
73
The probability of pocket provided binds with other ligands: 0.5000
73
The probability of pocket provided binds with other ligands: 0.5000
74
</code></pre>
74
</code></pre>
75
 
75
 
76
3. Training
76
3. Training
77
77
78
In order to use our model to train your own dataset, you have to convert your dataset, which will be pdbs to voxel representation of protein-ligand biniding grid representation. The data conversion procedure has been descibed before. The module runs a random 5-fold cross validation. All the related results including loss, accuracy and model weights will be saved. All files are under `./Learning`.
78
In order to use our model to train your own dataset, you have to convert your dataset, which will be pdbs to voxel representation of protein-ligand biniding grid representation. The data conversion procedure has been descibed before. The module runs a random 5-fold cross validation. All the related results including loss, accuracy and model weights will be saved. All files are under `./Learning`.
79
79
80
The trainig module can be runned as 
80
The trainig module can be runned as 
81
81
82
<pre><code>python train.py --path path_to_your_data_folder --lpath path_to_your_label_file --bs batch_size --lr inital_learning_rate --epoch number_of_epoches --opath output_folder_path</code></pre>
82
<pre><code>python train.py --path path_to_your_data_folder --lpath path_to_your_label_file --bs batch_size --lr inital_learning_rate --epoch number_of_epoches --opath output_folder_path</code></pre>
83
83
84
  - `--path` path to the folder contains all the voxel data.
84
  - `--path` path to the folder contains all the voxel data.
85
  - `--lpath` label file path. The file should be a comma separated file with no header. The first column is the filename and the second column is the class (starts from 0). An example has been provided `./Learning/labels`.
85
  - `--lpath` label file path. The file should be a comma separated file with no header. The first column is the filename and the second column is the class (starts from 0). An example has been provided `./Learning/labels`.
86
  - `--bs`, `--lr`, `--epoch` is the hyperparameters related to the model. Recommanded values are 64, 1e-5, 50.
86
  - `--bs`, `--lr`, `--epoch` is the hyperparameters related to the model. Recommanded values are 64, 1e-5, 50.
87
  - `--opath` If no output location is provided, a `logs` folder will be created under current working directory to store everything.
87
  - `--opath` If no output location is provided, a `logs` folder will be created under current working directory to store everything.
88
  
88
  
89
# Dataset
89
# Dataset
90
90
91
We provided our dataset we used for the training at https://osf.io/enz69/, which are the voxel representations of ATP, Heme, and other along with the class label file.
91
We provided our dataset we used for the training at https://osf.io/enz69/, which are the voxel representations of ATP, Heme, and other along with the class label file.