|
a |
|
b/README.md |
|
|
1 |
# DeepDrug3D |
|
|
2 |
|
|
|
3 |
DeepDrug3D is a tool to predict the protein pocket to be ATP/Heme/other-binding given the binding residue numbers and the protein structure. |
|
|
4 |
|
|
|
5 |
If you find this tool useful, please star this repo and cite our paper :) |
|
|
6 |
|
|
|
7 |
Pu L, Govindaraj RG, Lemoine JM, Wu HC, Brylinski M (2019) DeepDrug3D: Classification of ligand-binding pockets in proteins with a convolutional neural network. PLOS Computational Biology 15(2): e1006718. https://doi.org/10.1371/journal.pcbi.1006718 |
|
|
8 |
|
|
|
9 |
This README file is written by Limeng Pu. |
|
|
10 |
|
|
|
11 |
<p align="center"> |
|
|
12 |
<img width="400" height="400" src="./image/1a2sA.png"> |
|
|
13 |
</p> |
|
|
14 |
|
|
|
15 |
An example of binding grid generated, pdb ID: 1a2sA, atom type: C.ar. Red --> low potentials while Blue --> high potentials. |
|
|
16 |
|
|
|
17 |
# Change Log |
|
|
18 |
|
|
|
19 |
**This is a newer version of the implmentation. Since many people are interested in visualize the output from the grid generation like in the image above, I've decided to seperate the data-generation module and the training/prediction module. Another reason for this iteration of implementation is the dligand-linux used for potential calculation requires 32-bit Linux while <em>Pytorch</em> requires 64-bit Linux. It causes confiliction and resulting in different errors depending on the order you install them. Also the deep learning library used has been changed from <em>Keras</em> to <em>Pytorch</em>**. |
|
|
20 |
|
|
|
21 |
# Prerequisites |
|
|
22 |
|
|
|
23 |
1. System requirement: Linux (DFIRE potential calculation only runs on Linux. Tested on <em>Red Hat Enterprise Linux 6</em>) |
|
|
24 |
2. The data-generation module dependencies are provided in `./DataGeneration/environment.yml`. Please change line 9 in the file according to your system. To install all the dependencies run `conda env create -f environment.yml`. |
|
|
25 |
3. The learning module requires <em>Pytorch</em>. To install it, refer to https://pytorch.org/get-started/locally. |
|
|
26 |
|
|
|
27 |
# Usage |
|
|
28 |
|
|
|
29 |
The package provides data-generation, prediction, and training modules. |
|
|
30 |
|
|
|
31 |
1. Data generation |
|
|
32 |
|
|
|
33 |
This module generates data for training/prediction while providing intermediate results for visualization. All files are under `./DataGeneration`. The DFIRE potential calculation uses the module (`./DataGeneration/dligand-linux`) described in `A Knowledge-Based Energy Function for Protein−Ligand, Protein−Protein, and Protein−DNA Complexes by Zhang et al.` since it is written in Fortran, which is faster than our own implementation in Python. |
|
|
34 |
|
|
|
35 |
To generate the binding grid data, run |
|
|
36 |
|
|
|
37 |
<pre><code>python voxelization.py --f example.pdb --a example_aux.txt --o results --r 15 --n 31 --p</code></pre> |
|
|
38 |
|
|
|
39 |
- `--f` input pdb file path. |
|
|
40 |
- `--a` input auxilary file path, with binding residue numbers and center of ligand (optional). An example of the auxilary file is provided in `example_aux.txt`. |
|
|
41 |
- `--r` the radius of the spherical grid. |
|
|
42 |
- `--n` the number of points along the dimension of the spherical grid. |
|
|
43 |
- `--o` output folder path. |
|
|
44 |
- `--p` or `--s` whether to calculate the potential nor not. If not, only the binary occupied grid will be returne, i.e., the shape of the grid only. Default, yes (`--p`). |
|
|
45 |
|
|
|
46 |
Several files will be saved, including `example_transformed.pdb` (coordinate-transformed pdb file), `example_transformed.mol2` (coordinate-transformed mol2 file for the calculation of DFIRE potential), `example.grid` (grid representation of the binding pocket grid for visualization), and `example.h5` (numpy array of the voxel representation). |
|
|
47 |
|
|
|
48 |
To visualize the output binidng pocket grid, run |
|
|
49 |
|
|
|
50 |
<pre><code>python visualization --i example.grid --c 0</code></pre> |
|
|
51 |
|
|
|
52 |
- `--i` input binding pocket grid file path. |
|
|
53 |
- `--c` channel to visualize. Note that if you pass `--s` in the previous step, the channel number `--c` has to be 0. |
|
|
54 |
|
|
|
55 |
An output `example_grid.pdb` will be generated for visualization. Note this pocket grid matches the transformed protein `example_transformed.pdb`. |
|
|
56 |
|
|
|
57 |
2. Prediction |
|
|
58 |
|
|
|
59 |
This module classifies the target binding pocket to be either an ATP-, Heme-, or other-type pocket, which basically means which type of ligand it tends to binding to. The trained model is available at `https://osf.io/enz69/`. All files are under `./Learning`. |
|
|
60 |
|
|
|
61 |
To use the prediction module, run |
|
|
62 |
|
|
|
63 |
<pre><code>python predict.py --f example.h5 --m path_to_the_trianed_model</code></pre> |
|
|
64 |
|
|
|
65 |
|
|
|
66 |
- `--f` input h5 file path. |
|
|
67 |
- `--m` path to the trained model weights. |
|
|
68 |
|
|
|
69 |
The output would be something like |
|
|
70 |
|
|
|
71 |
<pre><code>The probability of pocket provided binds with ATP ligands: 0.3000 |
|
|
72 |
The probability of pocket provided binds with Heme ligands: 0.2000 |
|
|
73 |
The probability of pocket provided binds with other ligands: 0.5000 |
|
|
74 |
</code></pre> |
|
|
75 |
|
|
|
76 |
3. Training |
|
|
77 |
|
|
|
78 |
In order to use our model to train your own dataset, you have to convert your dataset, which will be pdbs to voxel representation of protein-ligand biniding grid representation. The data conversion procedure has been descibed before. The module runs a random 5-fold cross validation. All the related results including loss, accuracy and model weights will be saved. All files are under `./Learning`. |
|
|
79 |
|
|
|
80 |
The trainig module can be runned as |
|
|
81 |
|
|
|
82 |
<pre><code>python train.py --path path_to_your_data_folder --lpath path_to_your_label_file --bs batch_size --lr inital_learning_rate --epoch number_of_epoches --opath output_folder_path</code></pre> |
|
|
83 |
|
|
|
84 |
- `--path` path to the folder contains all the voxel data. |
|
|
85 |
- `--lpath` label file path. The file should be a comma separated file with no header. The first column is the filename and the second column is the class (starts from 0). An example has been provided `./Learning/labels`. |
|
|
86 |
- `--bs`, `--lr`, `--epoch` is the hyperparameters related to the model. Recommanded values are 64, 1e-5, 50. |
|
|
87 |
- `--opath` If no output location is provided, a `logs` folder will be created under current working directory to store everything. |
|
|
88 |
|
|
|
89 |
# Dataset |
|
|
90 |
|
|
|
91 |
We provided our dataset we used for the training at https://osf.io/enz69/, which are the voxel representations of ATP, Heme, and other along with the class label file. |