druggpt / Git / Diff of /README.md

Models:

Amanda-D/

druggpt

Downloads: 1

Diff of /README.md [a621b4] .. [df2647]

Switch to unified view


<div class="title" align=center>
    <h1>💊DrugGPT</h1>
    <div>A GPT-based Strategy for Designing Potential Ligands Targeting Specific Proteins</div>
    <br/>
    <p>
        <img src="https://img.shields.io/github/license/LIYUESEN/druggpt">
        <img src="https://img.shields.io/badge/python-3.8-blue">
    <a href="https://colab.research.google.com/drive/1DBJWuAQc1Tl-SiIk6QWcXvBAWHQ01_kw">
    <img src="https://colab.research.google.com/assets/colab-badge.svg"></a>
        <img src="https://img.shields.io/github/stars/LIYUESEN/druggpt?style=social">
</div>

## 💥 NEWS
**2024/08/11** We're excited to announce a new feature, Ligand Energy Minimization, now available in our latest release. Additionally, explore our new tool, druggpt_min_multi.py, designed specifically for efficient energy minimization of multiple ligands.  
**2024/07/30** All wet-lab validations have been completed, confirming that DrugGPT possesses ligand optimization capabilities.  
**2024/05/16** Wet-lab experiments confirm druggpt's ability to design ligands with new scaffolds from scratch and to repurpose existing ligands. Ligand optimization remains under evaluation. Stay tuned for more updates!  
**2024/05/16** The version has been upgraded to druggpt_v1.2, featuring new atom number control capabilities. Due to compatibility issues, the webui has been removed.  
**2024/04/03** Version upgraded to druggpt_v1.1, enhancing stability and adding a webui. Future versions will feature atom number control in molecules. Stay tuned.  
**2024/03/31** After careful consideration, I plan to create new repositories named druggpt_toolbox and druggpt_train to store post-processing tool scripts and training scripts, respectively. This repository should focus primarily on the generation of drug candidate molecules.  
**2024/03/31** I've decided to create a branch named druggpt_v1.0 for the current version since it is a stable release. Subsequently, I will continue to update the code.  
**2024/01/18** This project is now under experimental evaluation to confirm its actual value in drug research. Please continue to follow us!  

## 🚩 Introduction
DrugGPT presents a ligand design strategy based on the autoregressive model, GPT, focusing on chemical space exploration and the discovery of ligands for specific proteins. Deep learning language models have shown significant potential in various domains including protein design and 
biomedical text analysis, providing strong support for the proposition of DrugGPT. 

In this study, we employ the DrugGPT model to learn a substantial amount of protein-ligand binding data, aiming to discover novel molecules that can bind with specific proteins. This strategy not only significantly improves the efficiency of ligand design but also offers a swift and effective avenue for the drug development process, bringing new possibilities to the pharmaceutical domain
## 📥 Deployment
### Clone
```shell
git clone https://github.com/LIYUESEN/druggpt.git
cd druggpt
```
Or you can just click *Code>Download ZIP* to download this repo.
### Create Python virtual environment
```shell
conda create -n druggpt python=3.8
conda activate druggpt
```
### Install PyTorch and other requirements
```shell
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
pip install datasets transformers scipy scikit-learn psutil
conda install conda-forge/label/cf202003::openbabel
```
## 🗝 How to use
### 💻 Run in command
Use [drug_generator.py](https://github.com/LIYUESEN/druggpt/blob/main/drug_generator.py)

Required parameters:
- `-p` | `--pro_seq`: Input a protein amino acid sequence.
- `-f` | `--fasta`: Input a FASTA file including one protein amino acid sequence.

  Only one of -p and -f should be specified.
- `-l` | `--ligand_prompt`: Input a ligand prompt.
- `-n` | `--number`: The expected number of molecules to be generated.
- `-d` | `--device`: Hardware device to be used. Default is 'cuda'.
- `-o` | `--output`: Output directory for generated molecules. Default is './ligand_output/'.
- `-b` | `--batch_size`: The number of molecules to be generated per batch. Try to reduce this value if you have low RAM. Default is 16.
- `-t` | `--temperature`: Adjusts the randomness of text generation; higher values produce more diverse outputs. Default is 1.0.
- `--top_k`: The number of highest probability tokens to be considered for top-k sampling. Default is 9.
- `--top_p`: The cumulative probability threshold (0.0 - 1.0) for top-p (nucleus) sampling. It defines the minimum subset of tokens to consider for random sampling. Default is 0.9.
- `--min_atoms`: Minimum number of non-H atoms allowed for generation. Default is None.
- `--max_atoms`: Maximum number of non-H atoms allowed for generation. Default is 35.
- `--no_limit`: Disable the default max atoms limit.

   If the `-l` | `--ligand_prompt` option is used, the `--max_atoms` and `--min_atoms` parameters will be disregarded.

### 🌎 Run in Google Colab
[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1DBJWuAQc1Tl-SiIk6QWcXvBAWHQ01_kw)
## 🔬 Example usage 
- If you want to input a protein FASTA file
    ```shell
    python drug_generator.py -f BCL2L11.fasta -n 50
    ```
- If you want to input the amino acid sequence of the protein
    ```shell
    python drug_generator.py -p MAKQPSDVSSECDREGRQLQPAERPPQLRPGAPTSLQTEPQGNPEGNHGGEGDSCPHGSPQGPLAPPASPGPFATRSPLFIFMRRSSLLSRSSSGYFSFDTDRSPAPMSCDKSTQTPSPPCQAFNHYLSAMASMRQAEPADMRPEIWIAQELRRIGDEFNAYYARRVFLNNYQAAEDHPRMVILRLLRYIVRLVWRMH -n 50
    ```
    
- If you want to provide a prompt for the ligand  
    ```shell
    python drug_generator.py -f BCL2L11.fasta -l COc1ccc(cc1)C(=O) -n 50
    ```
    
- Note: If you are running in a Linux environment, you need to enclose the ligand's prompt with single quotes ('').  
    ```shell
    python drug_generator.py -f BCL2L11.fasta -l 'COc1ccc(cc1)C(=O)' -n 50
    ```
## ✉️ Contact
Yuesen Li      lisen2286@gmail.com  

Yungang Xu     yungang.xu@xjtu.edu.cn

## 📝 How to reference this work
DrugGPT: A GPT-based Strategy for Designing Potential Ligands Targeting Specific Proteins

Yuesen Li, Chengyi Gao, Xin Song, Xiangyu Wang, Yungang Xu, Suxia Han

bioRxiv 2023.06.29.543848; doi: [https://doi.org/10.1101/2023.06.29.543848](https://doi.org/10.1101/2023.06.29.543848)

[![DOI](https://img.shields.io/badge/DOI-10.1101/2023.06.29.543848-blue)](https://doi.org/10.1101/2023.06.29.543848)
## ⚖ License
[GNU General Public License v3.0](https://www.gnu.org/licenses/gpl-3.0.html)

	a/README.md		b/README.md
1	<div class="title" align=center>	1	<div class="title" align=center>
2	<h1>💊DrugGPT</h1>	2	<h1>💊DrugGPT</h1>
3	<div>A GPT-based Strategy for Designing Potential Ligands Targeting Specific Proteins</div>	3	<div>A GPT-based Strategy for Designing Potential Ligands Targeting Specific Proteins</div>
4	<br/>	4	<br/>
5	<p>	5	<p>
6	<img src="https://img.shields.io/github/license/LIYUESEN/druggpt">	6	<img src="https://img.shields.io/github/license/LIYUESEN/druggpt">
7	<img src="https://img.shields.io/badge/python-3.8-blue">	7	<img src="https://img.shields.io/badge/python-3.8-blue">
8	<a href="https://colab.research.google.com/drive/1DBJWuAQc1Tl-SiIk6QWcXvBAWHQ01_kw">	8	<a href="https://colab.research.google.com/drive/1DBJWuAQc1Tl-SiIk6QWcXvBAWHQ01_kw">
9	<img src="https://colab.research.google.com/assets/colab-badge.svg"></a>	9	<img src="https://colab.research.google.com/assets/colab-badge.svg"></a>
10	<img src="https://img.shields.io/github/stars/LIYUESEN/druggpt?style=social">	10	<img src="https://img.shields.io/github/stars/LIYUESEN/druggpt?style=social">
11	</div>	11	</div>
12		12
13	## 💥 NEWS	13	## 💥 NEWS
14	2024/08/11 We're excited to announce a new feature, Ligand Energy Minimization, now available in our latest release. Additionally, explore our new tool, druggpt_min_multi.py, designed specifically for efficient energy minimization of multiple ligands.	14	2024/08/11 We're excited to announce a new feature, Ligand Energy Minimization, now available in our latest release. Additionally, explore our new tool, druggpt_min_multi.py, designed specifically for efficient energy minimization of multiple ligands.
15	2024/07/30 All wet-lab validations have been completed, confirming that DrugGPT possesses ligand optimization capabilities.	15	2024/07/30 All wet-lab validations have been completed, confirming that DrugGPT possesses ligand optimization capabilities.
16	2024/05/16 Wet-lab experiments confirm druggpt's ability to design ligands with new scaffolds from scratch and to repurpose existing ligands. Ligand optimization remains under evaluation. Stay tuned for more updates!	16	2024/05/16 Wet-lab experiments confirm druggpt's ability to design ligands with new scaffolds from scratch and to repurpose existing ligands. Ligand optimization remains under evaluation. Stay tuned for more updates!
17	2024/05/16 The version has been upgraded to druggpt_v1.2, featuring new atom number control capabilities. Due to compatibility issues, the webui has been removed.	17	2024/05/16 The version has been upgraded to druggpt_v1.2, featuring new atom number control capabilities. Due to compatibility issues, the webui has been removed.
18	2024/04/03 Version upgraded to druggpt_v1.1, enhancing stability and adding a webui. Future versions will feature atom number control in molecules. Stay tuned.	18	2024/04/03 Version upgraded to druggpt_v1.1, enhancing stability and adding a webui. Future versions will feature atom number control in molecules. Stay tuned.
19	2024/03/31 After careful consideration, I plan to create new repositories named druggpt_toolbox and druggpt_train to store post-processing tool scripts and training scripts, respectively. This repository should focus primarily on the generation of drug candidate molecules.	19	2024/03/31 After careful consideration, I plan to create new repositories named druggpt_toolbox and druggpt_train to store post-processing tool scripts and training scripts, respectively. This repository should focus primarily on the generation of drug candidate molecules.
20	2024/03/31 I've decided to create a branch named druggpt_v1.0 for the current version since it is a stable release. Subsequently, I will continue to update the code.	20	2024/03/31 I've decided to create a branch named druggpt_v1.0 for the current version since it is a stable release. Subsequently, I will continue to update the code.
21	2024/01/18 This project is now under experimental evaluation to confirm its actual value in drug research. Please continue to follow us!	21	2024/01/18 This project is now under experimental evaluation to confirm its actual value in drug research. Please continue to follow us!
22		22
23	## 🚩 Introduction	23	## 🚩 Introduction
24	DrugGPT presents a ligand design strategy based on the autoregressive model, GPT, focusing on chemical space exploration and the discovery of ligands for specific proteins. Deep learning language models have shown significant potential in various domains including protein design and	24	DrugGPT presents a ligand design strategy based on the autoregressive model, GPT, focusing on chemical space exploration and the discovery of ligands for specific proteins. Deep learning language models have shown significant potential in various domains including protein design and
25	biomedical text analysis, providing strong support for the proposition of DrugGPT.	25	biomedical text analysis, providing strong support for the proposition of DrugGPT.
26		26
27	In this study, we employ the DrugGPT model to learn a substantial amount of protein-ligand binding data, aiming to discover novel molecules that can bind with specific proteins. This strategy not only significantly improves the efficiency of ligand design but also offers a swift and effective avenue for the drug development process, bringing new possibilities to the pharmaceutical domain	27	In this study, we employ the DrugGPT model to learn a substantial amount of protein-ligand binding data, aiming to discover novel molecules that can bind with specific proteins. This strategy not only significantly improves the efficiency of ligand design but also offers a swift and effective avenue for the drug development process, bringing new possibilities to the pharmaceutical domain
28	## 📥 Deployment	28	## 📥 Deployment
29	### Clone	29	### Clone
30	```shell	30	```shell
31	git clone https://github.com/LIYUESEN/druggpt.git	31	git clone https://github.com/LIYUESEN/druggpt.git
32	cd druggpt	32	cd druggpt
33	```	33	```
34	> Or you can just click Code>Download ZIP to download this repo.	34	Or you can just click Code>Download ZIP to download this repo.
35	### Create Python virtual environment	35	### Create Python virtual environment
36	```shell	36	```shell
37	conda create -n druggpt python=3.8	37	conda create -n druggpt python=3.8
38	conda activate druggpt	38	conda activate druggpt
39	```	39	```
40	### Install PyTorch and other requirements	40	### Install PyTorch and other requirements
41	```shell	41	```shell
42	pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117	42	pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
43	pip install datasets transformers scipy scikit-learn psutil	43	pip install datasets transformers scipy scikit-learn psutil
44	conda install conda-forge/label/cf202003::openbabel	44	conda install conda-forge/label/cf202003::openbabel
45	```	45	```
46	## 🗝 How to use	46	## 🗝 How to use
47	### 💻 Run in command	47	### 💻 Run in command
48	Use [drug_generator.py](https://github.com/LIYUESEN/druggpt/blob/main/drug_generator.py)	48	Use [drug_generator.py](https://github.com/LIYUESEN/druggpt/blob/main/drug_generator.py)
49		49
50	Required parameters:	50	Required parameters:
51	- `-p` \| `--pro_seq`: Input a protein amino acid sequence.	51	- `-p` \| `--pro_seq`: Input a protein amino acid sequence.
52	- `-f` \| `--fasta`: Input a FASTA file including one protein amino acid sequence.	52	- `-f` \| `--fasta`: Input a FASTA file including one protein amino acid sequence.
53		53
54	> Only one of -p and -f should be specified.	54	Only one of -p and -f should be specified.
55	- `-l` \| `--ligand_prompt`: Input a ligand prompt.	55	- `-l` \| `--ligand_prompt`: Input a ligand prompt.
56	- `-n` \| `--number`: The expected number of molecules to be generated.	56	- `-n` \| `--number`: The expected number of molecules to be generated.
57	- `-d` \| `--device`: Hardware device to be used. Default is 'cuda'.	57	- `-d` \| `--device`: Hardware device to be used. Default is 'cuda'.
58	- `-o` \| `--output`: Output directory for generated molecules. Default is './ligand_output/'.	58	- `-o` \| `--output`: Output directory for generated molecules. Default is './ligand_output/'.
59	- `-b` \| `--batch_size`: The number of molecules to be generated per batch. Try to reduce this value if you have low RAM. Default is 16.	59	- `-b` \| `--batch_size`: The number of molecules to be generated per batch. Try to reduce this value if you have low RAM. Default is 16.
60	- `-t` \| `--temperature`: Adjusts the randomness of text generation; higher values produce more diverse outputs. Default is 1.0.	60	- `-t` \| `--temperature`: Adjusts the randomness of text generation; higher values produce more diverse outputs. Default is 1.0.
61	- `--top_k`: The number of highest probability tokens to be considered for top-k sampling. Default is 9.	61	- `--top_k`: The number of highest probability tokens to be considered for top-k sampling. Default is 9.
62	- `--top_p`: The cumulative probability threshold (0.0 - 1.0) for top-p (nucleus) sampling. It defines the minimum subset of tokens to consider for random sampling. Default is 0.9.	62	- `--top_p`: The cumulative probability threshold (0.0 - 1.0) for top-p (nucleus) sampling. It defines the minimum subset of tokens to consider for random sampling. Default is 0.9.
63	- `--min_atoms`: Minimum number of non-H atoms allowed for generation. Default is None.	63	- `--min_atoms`: Minimum number of non-H atoms allowed for generation. Default is None.
64	- `--max_atoms`: Maximum number of non-H atoms allowed for generation. Default is 35.	64	- `--max_atoms`: Maximum number of non-H atoms allowed for generation. Default is 35.
65	- `--no_limit`: Disable the default max atoms limit.	65	- `--no_limit`: Disable the default max atoms limit.
66		66
67	> If the `-l` \| `--ligand_prompt` option is used, the `--max_atoms` and `--min_atoms` parameters will be disregarded.	67	If the `-l` \| `--ligand_prompt` option is used, the `--max_atoms` and `--min_atoms` parameters will be disregarded.
68		68
69	### 🌎 Run in Google Colab	69	### 🌎 Run in Google Colab
70	[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1DBJWuAQc1Tl-SiIk6QWcXvBAWHQ01_kw)	70	[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1DBJWuAQc1Tl-SiIk6QWcXvBAWHQ01_kw)
71	## 🔬 Example usage	71	## 🔬 Example usage
72	- If you want to input a protein FASTA file	72	- If you want to input a protein FASTA file
73	```shell	73	```shell
74	python drug_generator.py -f BCL2L11.fasta -n 50	74	python drug_generator.py -f BCL2L11.fasta -n 50
75	```	75	```
76	- If you want to input the amino acid sequence of the protein	76	- If you want to input the amino acid sequence of the protein
77	```shell	77	```shell
78	python drug_generator.py -p MAKQPSDVSSECDREGRQLQPAERPPQLRPGAPTSLQTEPQGNPEGNHGGEGDSCPHGSPQGPLAPPASPGPFATRSPLFIFMRRSSLLSRSSSGYFSFDTDRSPAPMSCDKSTQTPSPPCQAFNHYLSAMASMRQAEPADMRPEIWIAQELRRIGDEFNAYYARRVFLNNYQAAEDHPRMVILRLLRYIVRLVWRMH -n 50	78	python drug_generator.py -p MAKQPSDVSSECDREGRQLQPAERPPQLRPGAPTSLQTEPQGNPEGNHGGEGDSCPHGSPQGPLAPPASPGPFATRSPLFIFMRRSSLLSRSSSGYFSFDTDRSPAPMSCDKSTQTPSPPCQAFNHYLSAMASMRQAEPADMRPEIWIAQELRRIGDEFNAYYARRVFLNNYQAAEDHPRMVILRLLRYIVRLVWRMH -n 50
79	```	79	```
80		80
81	- If you want to provide a prompt for the ligand	81	- If you want to provide a prompt for the ligand
82	```shell	82	```shell
83	python drug_generator.py -f BCL2L11.fasta -l COc1ccc(cc1)C(=O) -n 50	83	python drug_generator.py -f BCL2L11.fasta -l COc1ccc(cc1)C(=O) -n 50
84	```	84	```
85		85
86	- Note: If you are running in a Linux environment, you need to enclose the ligand's prompt with single quotes ('').	86	- Note: If you are running in a Linux environment, you need to enclose the ligand's prompt with single quotes ('').
87	```shell	87	```shell
88	python drug_generator.py -f BCL2L11.fasta -l 'COc1ccc(cc1)C(=O)' -n 50	88	python drug_generator.py -f BCL2L11.fasta -l 'COc1ccc(cc1)C(=O)' -n 50
89	```	89	```
90	## ✉️ Contact	90	## ✉️ Contact
91	Yuesen Li lisen2286@gmail.com	91	Yuesen Li lisen2286@gmail.com
92		92
93	Yungang Xu yungang.xu@xjtu.edu.cn	93	Yungang Xu yungang.xu@xjtu.edu.cn
94		94
95	## 📝 How to reference this work	95	## 📝 How to reference this work
96	DrugGPT: A GPT-based Strategy for Designing Potential Ligands Targeting Specific Proteins	96	DrugGPT: A GPT-based Strategy for Designing Potential Ligands Targeting Specific Proteins
97		97
98	Yuesen Li, Chengyi Gao, Xin Song, Xiangyu Wang, Yungang Xu, Suxia Han	98	Yuesen Li, Chengyi Gao, Xin Song, Xiangyu Wang, Yungang Xu, Suxia Han
99		99
100	bioRxiv 2023.06.29.543848; doi: [https://doi.org/10.1101/2023.06.29.543848](https://doi.org/10.1101/2023.06.29.543848)	100	bioRxiv 2023.06.29.543848; doi: [https://doi.org/10.1101/2023.06.29.543848](https://doi.org/10.1101/2023.06.29.543848)
101		101
102	[![DOI](https://img.shields.io/badge/DOI-10.1101/2023.06.29.543848-blue)](https://doi.org/10.1101/2023.06.29.543848)	102	[![DOI](https://img.shields.io/badge/DOI-10.1101/2023.06.29.543848-blue)](https://doi.org/10.1101/2023.06.29.543848)
103	## ⚖ License	103	## ⚖ License
104	[GNU General Public License v3.0](https://www.gnu.org/licenses/gpl-3.0.html)	104	[GNU General Public License v3.0](https://www.gnu.org/licenses/gpl-3.0.html)