DeepDrug3D / Git / Diff of /README.md

Models:

Amanda-D/

DeepDrug3D

Downloads: 1

Diff of /README.md [b623ff] .. [554282]

Switch to unified view


# DeepDrug3D

DeepDrug3D is a tool to predict the protein pocket to be ATP/Heme/other-binding given the binding residue numbers and the protein structure.

If you find this tool useful, please star this repo and cite our paper :)

Pu L, Govindaraj RG, Lemoine JM, Wu HC, Brylinski M (2019) DeepDrug3D: Classification of ligand-binding pockets in proteins with a convolutional neural network. PLOS Computational Biology 15(2): e1006718. https://doi.org/10.1371/journal.pcbi.1006718

This README file is written by Limeng Pu.

<p align="center">
    <img width="400" height="400" src="https://github.com/pulimeng/DeepDrug3D/blob/master/image/1a2sA.png?raw=true">
</p>

An example of binding grid generated, pdb ID: 1a2sA, atom type: C.ar. Red --> low potentials while Blue --> high potentials.

# Change Log

**This is a newer version of the implmentation. Since many people are interested in visualize the output from the grid generation like in the image above, I've decided to seperate the data-generation module and the training/prediction module. Another reason for this iteration of implementation is the dligand-linux used for potential calculation requires 32-bit Linux while <em>Pytorch</em> requires 64-bit Linux. It causes confiliction and resulting in different errors depending on the order you install them. Also the deep learning library used has been changed from <em>Keras</em> to <em>Pytorch</em>**.

# Prerequisites

1. System requirement: Linux (DFIRE potential calculation only runs on Linux. Tested on <em>Red Hat Enterprise Linux 6</em>)
2. The data-generation module dependencies are provided in `./DataGeneration/environment.yml`. Please change line 9 in the file according to your system. To install all the dependencies run `conda env create -f environment.yml`.
3. The learning module requires <em>Pytorch</em>. To install it, refer to https://pytorch.org/get-started/locally.

# Usage

The package provides data-generation, prediction, and training modules.

1. Data generation

This module generates data for training/prediction while providing intermediate results for visualization. All files are under `./DataGeneration`. The DFIRE potential calculation uses the module (`./DataGeneration/dligand-linux`) described in `A Knowledge-Based Energy Function for Protein−Ligand, Protein−Protein, and Protein−DNA Complexes by Zhang et al.` since it is written in Fortran, which is faster than our own implementation in Python.

To generate the binding grid data, run 

<pre><code>python voxelization.py --f example.pdb --a example_aux.txt --o results --r 15 --n 31 --p</code></pre>

  - `--f` input pdb file path.
  - `--a` input auxilary file path, with binding residue numbers and center of ligand (optional). An example of the auxilary file is provided in `example_aux.txt`.
  - `--r` the radius of the spherical grid.
  - `--n` the number of points along the dimension of the spherical grid.
  - `--o` output folder path.
  - `--p` or `--s` whether to calculate the potential nor not. If not, only the binary occupied grid will be returne, i.e., the shape of the grid only. Default, yes (`--p`).

Several files will be saved, including `example_transformed.pdb` (coordinate-transformed pdb file), `example_transformed.mol2` (coordinate-transformed mol2 file for the calculation of DFIRE potential), `example.grid` (grid representation of the binding pocket grid for visualization), and `example.h5` (numpy array of the voxel representation).

To visualize the output binidng pocket grid, run 

<pre><code>python visualization --i example.grid --c 0</code></pre>

  - `--i` input binding pocket grid file path.
  - `--c` channel to visualize. Note that if you pass `--s` in the previous step, the channel number `--c` has to be 0.
  
An output `example_grid.pdb` will be generated for visualization. Note this pocket grid matches the transformed protein `example_transformed.pdb`.

2. Prediction

This module classifies the target binding pocket to be either an ATP-, Heme-, or other-type pocket, which basically means which type of ligand it tends to binding to. The trained model is available at `https://osf.io/enz69/`. All files are under `./Learning`.

To use the prediction module, run 

<pre><code>python predict.py --f example.h5 --m path_to_the_trianed_model</code></pre>


  - `--f` input h5 file path.
  - `--m` path to the trained model weights.
  
The output would be something like 

<pre><code>The probability of pocket provided binds with ATP ligands: 0.3000
The probability of pocket provided binds with Heme ligands: 0.2000
The probability of pocket provided binds with other ligands: 0.5000
</code></pre>
 
3. Training

In order to use our model to train your own dataset, you have to convert your dataset, which will be pdbs to voxel representation of protein-ligand biniding grid representation. The data conversion procedure has been descibed before. The module runs a random 5-fold cross validation. All the related results including loss, accuracy and model weights will be saved. All files are under `./Learning`.

The trainig module can be runned as 

<pre><code>python train.py --path path_to_your_data_folder --lpath path_to_your_label_file --bs batch_size --lr inital_learning_rate --epoch number_of_epoches --opath output_folder_path</code></pre>

  - `--path` path to the folder contains all the voxel data.
  - `--lpath` label file path. The file should be a comma separated file with no header. The first column is the filename and the second column is the class (starts from 0). An example has been provided `./Learning/labels`.
  - `--bs`, `--lr`, `--epoch` is the hyperparameters related to the model. Recommanded values are 64, 1e-5, 50.
  - `--opath` If no output location is provided, a `logs` folder will be created under current working directory to store everything.
  
# Dataset

We provided our dataset we used for the training at https://osf.io/enz69/, which are the voxel representations of ATP, Heme, and other along with the class label file.

	a/README.md		b/README.md
1	# DeepDrug3D	1	# DeepDrug3D
2		2
3	DeepDrug3D is a tool to predict the protein pocket to be ATP/Heme/other-binding given the binding residue numbers and the protein structure.	3	DeepDrug3D is a tool to predict the protein pocket to be ATP/Heme/other-binding given the binding residue numbers and the protein structure.
4		4
5	If you find this tool useful, please star this repo and cite our paper :)	5	If you find this tool useful, please star this repo and cite our paper :)
6		6
7	Pu L, Govindaraj RG, Lemoine JM, Wu HC, Brylinski M (2019) DeepDrug3D: Classification of ligand-binding pockets in proteins with a convolutional neural network. PLOS Computational Biology 15(2): e1006718. https://doi.org/10.1371/journal.pcbi.1006718	7	Pu L, Govindaraj RG, Lemoine JM, Wu HC, Brylinski M (2019) DeepDrug3D: Classification of ligand-binding pockets in proteins with a convolutional neural network. PLOS Computational Biology 15(2): e1006718. https://doi.org/10.1371/journal.pcbi.1006718
8		8
9	This README file is written by Limeng Pu.	9	This README file is written by Limeng Pu.
10		10
11	<p align="center">	11	<p align="center">
12	<img width="400" height="400" src="./image/1a2sA.png">	12	<img width="400" height="400" src="https://github.com/pulimeng/DeepDrug3D/blob/master/image/1a2sA.png?raw=true">
13	</p>	13	</p>
14		14
15	An example of binding grid generated, pdb ID: 1a2sA, atom type: C.ar. Red --> low potentials while Blue --> high potentials.	15	An example of binding grid generated, pdb ID: 1a2sA, atom type: C.ar. Red --> low potentials while Blue --> high potentials.
16		16
17	# Change Log	17	# Change Log
18		18
19	This is a newer version of the implmentation. Since many people are interested in visualize the output from the grid generation like in the image above, I've decided to seperate the data-generation module and the training/prediction module. Another reason for this iteration of implementation is the dligand-linux used for potential calculation requires 32-bit Linux while <em>Pytorch</em> requires 64-bit Linux. It causes confiliction and resulting in different errors depending on the order you install them. Also the deep learning library used has been changed from <em>Keras</em> to <em>Pytorch</em>.	19	This is a newer version of the implmentation. Since many people are interested in visualize the output from the grid generation like in the image above, I've decided to seperate the data-generation module and the training/prediction module. Another reason for this iteration of implementation is the dligand-linux used for potential calculation requires 32-bit Linux while <em>Pytorch</em> requires 64-bit Linux. It causes confiliction and resulting in different errors depending on the order you install them. Also the deep learning library used has been changed from <em>Keras</em> to <em>Pytorch</em>.
20		20
21	# Prerequisites	21	# Prerequisites
22		22
23	1. System requirement: Linux (DFIRE potential calculation only runs on Linux. Tested on <em>Red Hat Enterprise Linux 6</em>)	23	1. System requirement: Linux (DFIRE potential calculation only runs on Linux. Tested on <em>Red Hat Enterprise Linux 6</em>)
24	2. The data-generation module dependencies are provided in `./DataGeneration/environment.yml`. Please change line 9 in the file according to your system. To install all the dependencies run `conda env create -f environment.yml`.	24	2. The data-generation module dependencies are provided in `./DataGeneration/environment.yml`. Please change line 9 in the file according to your system. To install all the dependencies run `conda env create -f environment.yml`.
25	3. The learning module requires <em>Pytorch</em>. To install it, refer to https://pytorch.org/get-started/locally.	25	3. The learning module requires <em>Pytorch</em>. To install it, refer to https://pytorch.org/get-started/locally.
26		26
27	# Usage	27	# Usage
28		28
29	The package provides data-generation, prediction, and training modules.	29	The package provides data-generation, prediction, and training modules.
30		30
31	1. Data generation	31	1. Data generation
32		32
33	This module generates data for training/prediction while providing intermediate results for visualization. All files are under `./DataGeneration`. The DFIRE potential calculation uses the module (`./DataGeneration/dligand-linux`) described in `A Knowledge-Based Energy Function for Protein−Ligand, Protein−Protein, and Protein−DNA Complexes by Zhang et al.` since it is written in Fortran, which is faster than our own implementation in Python.	33	This module generates data for training/prediction while providing intermediate results for visualization. All files are under `./DataGeneration`. The DFIRE potential calculation uses the module (`./DataGeneration/dligand-linux`) described in `A Knowledge-Based Energy Function for Protein−Ligand, Protein−Protein, and Protein−DNA Complexes by Zhang et al.` since it is written in Fortran, which is faster than our own implementation in Python.
34		34
35	To generate the binding grid data, run	35	To generate the binding grid data, run
36		36
37	<pre><code>python voxelization.py --f example.pdb --a example_aux.txt --o results --r 15 --n 31 --p</code></pre>	37	<pre><code>python voxelization.py --f example.pdb --a example_aux.txt --o results --r 15 --n 31 --p</code></pre>
38		38
39	- `--f` input pdb file path.	39	- `--f` input pdb file path.
40	- `--a` input auxilary file path, with binding residue numbers and center of ligand (optional). An example of the auxilary file is provided in `example_aux.txt`.	40	- `--a` input auxilary file path, with binding residue numbers and center of ligand (optional). An example of the auxilary file is provided in `example_aux.txt`.
41	- `--r` the radius of the spherical grid.	41	- `--r` the radius of the spherical grid.
42	- `--n` the number of points along the dimension of the spherical grid.	42	- `--n` the number of points along the dimension of the spherical grid.
43	- `--o` output folder path.	43	- `--o` output folder path.
44	- `--p` or `--s` whether to calculate the potential nor not. If not, only the binary occupied grid will be returne, i.e., the shape of the grid only. Default, yes (`--p`).	44	- `--p` or `--s` whether to calculate the potential nor not. If not, only the binary occupied grid will be returne, i.e., the shape of the grid only. Default, yes (`--p`).
45		45
46	Several files will be saved, including `example_transformed.pdb` (coordinate-transformed pdb file), `example_transformed.mol2` (coordinate-transformed mol2 file for the calculation of DFIRE potential), `example.grid` (grid representation of the binding pocket grid for visualization), and `example.h5` (numpy array of the voxel representation).	46	Several files will be saved, including `example_transformed.pdb` (coordinate-transformed pdb file), `example_transformed.mol2` (coordinate-transformed mol2 file for the calculation of DFIRE potential), `example.grid` (grid representation of the binding pocket grid for visualization), and `example.h5` (numpy array of the voxel representation).
47		47
48	To visualize the output binidng pocket grid, run	48	To visualize the output binidng pocket grid, run
49		49
50	<pre><code>python visualization --i example.grid --c 0</code></pre>	50	<pre><code>python visualization --i example.grid --c 0</code></pre>
51		51
52	- `--i` input binding pocket grid file path.	52	- `--i` input binding pocket grid file path.
53	- `--c` channel to visualize. Note that if you pass `--s` in the previous step, the channel number `--c` has to be 0.	53	- `--c` channel to visualize. Note that if you pass `--s` in the previous step, the channel number `--c` has to be 0.
54		54
55	An output `example_grid.pdb` will be generated for visualization. Note this pocket grid matches the transformed protein `example_transformed.pdb`.	55	An output `example_grid.pdb` will be generated for visualization. Note this pocket grid matches the transformed protein `example_transformed.pdb`.
56		56
57	2. Prediction	57	2. Prediction
58		58
59	This module classifies the target binding pocket to be either an ATP-, Heme-, or other-type pocket, which basically means which type of ligand it tends to binding to. The trained model is available at `https://osf.io/enz69/`. All files are under `./Learning`.	59	This module classifies the target binding pocket to be either an ATP-, Heme-, or other-type pocket, which basically means which type of ligand it tends to binding to. The trained model is available at `https://osf.io/enz69/`. All files are under `./Learning`.
60		60
61	To use the prediction module, run	61	To use the prediction module, run
62		62
63	<pre><code>python predict.py --f example.h5 --m path_to_the_trianed_model</code></pre>	63	<pre><code>python predict.py --f example.h5 --m path_to_the_trianed_model</code></pre>
64		64
65		65
66	- `--f` input h5 file path.	66	- `--f` input h5 file path.
67	- `--m` path to the trained model weights.	67	- `--m` path to the trained model weights.
68		68
69	The output would be something like	69	The output would be something like
70		70
71	<pre><code>The probability of pocket provided binds with ATP ligands: 0.3000	71	<pre><code>The probability of pocket provided binds with ATP ligands: 0.3000
72	The probability of pocket provided binds with Heme ligands: 0.2000	72	The probability of pocket provided binds with Heme ligands: 0.2000
73	The probability of pocket provided binds with other ligands: 0.5000	73	The probability of pocket provided binds with other ligands: 0.5000
74	</code></pre>	74	</code></pre>
75		75
76	3. Training	76	3. Training
77		77
78	In order to use our model to train your own dataset, you have to convert your dataset, which will be pdbs to voxel representation of protein-ligand biniding grid representation. The data conversion procedure has been descibed before. The module runs a random 5-fold cross validation. All the related results including loss, accuracy and model weights will be saved. All files are under `./Learning`.	78	In order to use our model to train your own dataset, you have to convert your dataset, which will be pdbs to voxel representation of protein-ligand biniding grid representation. The data conversion procedure has been descibed before. The module runs a random 5-fold cross validation. All the related results including loss, accuracy and model weights will be saved. All files are under `./Learning`.
79		79
80	The trainig module can be runned as	80	The trainig module can be runned as
81		81
82	<pre><code>python train.py --path path_to_your_data_folder --lpath path_to_your_label_file --bs batch_size --lr inital_learning_rate --epoch number_of_epoches --opath output_folder_path</code></pre>	82	<pre><code>python train.py --path path_to_your_data_folder --lpath path_to_your_label_file --bs batch_size --lr inital_learning_rate --epoch number_of_epoches --opath output_folder_path</code></pre>
83		83
84	- `--path` path to the folder contains all the voxel data.	84	- `--path` path to the folder contains all the voxel data.
85	- `--lpath` label file path. The file should be a comma separated file with no header. The first column is the filename and the second column is the class (starts from 0). An example has been provided `./Learning/labels`.	85	- `--lpath` label file path. The file should be a comma separated file with no header. The first column is the filename and the second column is the class (starts from 0). An example has been provided `./Learning/labels`.
86	- `--bs`, `--lr`, `--epoch` is the hyperparameters related to the model. Recommanded values are 64, 1e-5, 50.	86	- `--bs`, `--lr`, `--epoch` is the hyperparameters related to the model. Recommanded values are 64, 1e-5, 50.
87	- `--opath` If no output location is provided, a `logs` folder will be created under current working directory to store everything.	87	- `--opath` If no output location is provided, a `logs` folder will be created under current working directory to store everything.
88		88
89	# Dataset	89	# Dataset
90		90
91	We provided our dataset we used for the training at https://osf.io/enz69/, which are the voxel representations of ATP, Heme, and other along with the class label file.	91	We provided our dataset we used for the training at https://osf.io/enz69/, which are the voxel representations of ATP, Heme, and other along with the class label file.