biophysics-protein-genom / Git / Diff of /README.md

Models:

MarcoTheBlack/

biophysics-protein-genom

Downloads: 1

Diff of /README.md [f2203b] .. [607b0b]

Switch to unified view




# Biophysics & AI

Exploring the intersection of artificial intelligence and the intricate realms of proteins and genomics has been a captivating journey for me. Delving into the intricacies of molecular biology, my experiments with AI have been centered on deciphering complex biological data, elucidating protein structures, and unraveling genomic mysteries. Through a multifaceted approach, I've harnessed AI algorithms to analyze vast datasets, predict protein interactions, and contribute to advancements in genomics research. This immersive exploration has not only deepened my understanding of the biological intricacies at play but has also unveiled the potential of AI in revolutionizing our comprehension of the fundamental building blocks of life.



## protein-ligand interaction
To-be-moved

## molecular property prediction (small protein/ligand: binding)
To-be-moved

## molecule generation (small protein/ligand)
To-be- moved

## ProteinStructurePredictionCNN

Jian Zhou and Olga G. Troyanskaya (2014) - "Deep Supervised and Convolutional Generative Stochastic Network for Protein Secondary Structure Prediction" - https://arxiv.org/pdf/1403.1347.pdf

Sheng Wang et al. (2016) - "Protein Secondary Structure Prediction Using Deep Convolutional Neural Fields" - https://arxiv.org/pdf/1512.00843.pdf

This algorithm leverages the combined ability of two deep learning techniques, including a combination of convolutional neural networks (CNNs) and generative stochastic networks (GSNs), to achieve state-of-the-art accuracy in predicting the secondary structure of proteins. By effectively capturing complex dependencies and patterns in protein sequences, DCGSN offers a powerful tool for researchers and practitioners in the field of bioinformatics and structural biology to improve the accuracy of secondary structure prediction, ultimately advancing our understanding of protein functionality and interactions. Additionally, Deep Convolutional Neural Fields (DeepCNF) is used to refine the protein secondary structure prediction; this combines the power of deep learning and graphical models to enhance accuracy further. The method leverages deep convolutional neural networks to capture informative features from protein sequences, facilitating precise secondary structure predictions to build whole protein structures.

- [ ] Transformer (with attention) implementations (AA sequence)

## CodonCraft ProGen: Precision Translation Model for Optimal Bacterial Expression (with Attention)

0. Basic LSTM model (RNN)

1. BERT (with Attention)

2. GPT-esk Transformer (uni-directional)

- [x] project: code moved from private git
- [ ] data needs to anonymized 


# Data

## CodonCraft

https://www.ncbi.nlm.nih.gov/home/develop/api/

## ProteinStructurePredictions

 CullPDB53 Dataset (6125 proteins):The CullPDB53 dataset is a non-redundant set of protein structures from the Protein Data Bank (PDB). https://www.rcsb.org/.

 The CB513 dataset is often used for protein secondary structure prediction. https://www.princeton.edu/~jzthree/datasets/ICML2014/.

The Critical Assessment of Structure Prediction (CASP) datasets are used for protein structure prediction and related tasks.  http://predictioncenter.org/.

 CAMEO Test Proteins (6 months): The CAMEO (Continuous Automated Model EvaluatiOn) test proteins are used for protein structure prediction evaluation. http://www.cameo3d.org/sp/6-months/.

 JPRED Training and Test Data (1338 training and 149 test proteins): The JPRED dataset provides training and test data for protein secondary structure prediction. http://www.compbio.dundee.ac.uk/jpred4/about.shtml.

# Project Structure

```
project_root/
|-- data/
|   |-- raw/              # Raw data files
|   |-- processed/        # Processed and preprocessed data
|   |-- dataset.py        # Custom dataset classes and data loading utilities
|
|-- models/
|   |-- architecture.py   # Model architecture definition
|   |-- loss.py           # Custom loss functions
|   |-- metrics.py        # Evaluation metrics
|   |-- train.py          # Training script
|   |-- predict.py        # Inference script
|
|-- utils/
|   |-- helpers.py        # Utility functions
|   |-- visualization.py  # Visualization functions
|
|-- config/
|   |-- config.yaml       # Configuration file for hyperparameters
|
|-- notebooks/            # Jupyter notebooks for experimentation and analysis
|
|-- experiments/
|   |-- experiment_1/     # Directory for experiment 1 (can have multiple experiments)
|       |-- logs/         # TensorBoard logs, training/validation metrics
|       |-- saved_models/ # Saved model checkpoints
|
|-- requirements.txt       # Python dependencies file
|-- README.md              # Project documentation
```

# Note

I give sample anonymized data (50 MB; <project>/data/processed/) to run/test models. Code to curate and preprocess your own complete dataset will also be provided. These models are POCs (proof of concepts) that can scaled and tuned for particular use cases. 

	a/README.md		b/README.md
1	![Banner Image](ProteinStructurePredictionCNN/images/banner-bpg.png)	1
2
3	# Biophysics & AI	2	# Biophysics & AI
4		3
5	Exploring the intersection of artificial intelligence and the intricate realms of proteins and genomics has been a captivating journey for me. Delving into the intricacies of molecular biology, my experiments with AI have been centered on deciphering complex biological data, elucidating protein structures, and unraveling genomic mysteries. Through a multifaceted approach, I've harnessed AI algorithms to analyze vast datasets, predict protein interactions, and contribute to advancements in genomics research. This immersive exploration has not only deepened my understanding of the biological intricacies at play but has also unveiled the potential of AI in revolutionizing our comprehension of the fundamental building blocks of life.	4	Exploring the intersection of artificial intelligence and the intricate realms of proteins and genomics has been a captivating journey for me. Delving into the intricacies of molecular biology, my experiments with AI have been centered on deciphering complex biological data, elucidating protein structures, and unraveling genomic mysteries. Through a multifaceted approach, I've harnessed AI algorithms to analyze vast datasets, predict protein interactions, and contribute to advancements in genomics research. This immersive exploration has not only deepened my understanding of the biological intricacies at play but has also unveiled the potential of AI in revolutionizing our comprehension of the fundamental building blocks of life.
6		5
7		6
8		7
9	## protein-ligand interaction	8	## protein-ligand interaction
10	To-be-moved	9	To-be-moved
11		10
12	## molecular property prediction (small protein/ligand: binding)	11	## molecular property prediction (small protein/ligand: binding)
13	To-be-moved	12	To-be-moved
14		13
15	## molecule generation (small protein/ligand)	14	## molecule generation (small protein/ligand)
16	To-be- moved	15	To-be- moved
17		16
18	## ProteinStructurePredictionCNN	17	## ProteinStructurePredictionCNN
19		18
20	> Jian Zhou and Olga G. Troyanskaya (2014) - "Deep Supervised and Convolutional Generative Stochastic Network for Protein Secondary Structure Prediction" - https://arxiv.org/pdf/1403.1347.pdf	19	Jian Zhou and Olga G. Troyanskaya (2014) - "Deep Supervised and Convolutional Generative Stochastic Network for Protein Secondary Structure Prediction" - https://arxiv.org/pdf/1403.1347.pdf
21		20
22	> Sheng Wang et al. (2016) - "Protein Secondary Structure Prediction Using Deep Convolutional Neural Fields" - https://arxiv.org/pdf/1512.00843.pdf	21	Sheng Wang et al. (2016) - "Protein Secondary Structure Prediction Using Deep Convolutional Neural Fields" - https://arxiv.org/pdf/1512.00843.pdf
23		22
24	This algorithm leverages the combined ability of two deep learning techniques, including a combination of convolutional neural networks (CNNs) and generative stochastic networks (GSNs), to achieve state-of-the-art accuracy in predicting the secondary structure of proteins. By effectively capturing complex dependencies and patterns in protein sequences, DCGSN offers a powerful tool for researchers and practitioners in the field of bioinformatics and structural biology to improve the accuracy of secondary structure prediction, ultimately advancing our understanding of protein functionality and interactions. Additionally, Deep Convolutional Neural Fields (DeepCNF) is used to refine the protein secondary structure prediction; this combines the power of deep learning and graphical models to enhance accuracy further. The method leverages deep convolutional neural networks to capture informative features from protein sequences, facilitating precise secondary structure predictions to build whole protein structures.	23	This algorithm leverages the combined ability of two deep learning techniques, including a combination of convolutional neural networks (CNNs) and generative stochastic networks (GSNs), to achieve state-of-the-art accuracy in predicting the secondary structure of proteins. By effectively capturing complex dependencies and patterns in protein sequences, DCGSN offers a powerful tool for researchers and practitioners in the field of bioinformatics and structural biology to improve the accuracy of secondary structure prediction, ultimately advancing our understanding of protein functionality and interactions. Additionally, Deep Convolutional Neural Fields (DeepCNF) is used to refine the protein secondary structure prediction; this combines the power of deep learning and graphical models to enhance accuracy further. The method leverages deep convolutional neural networks to capture informative features from protein sequences, facilitating precise secondary structure predictions to build whole protein structures.
25		24
26	- [ ] Transformer (with attention) implementations (AA sequence)	25	- [ ] Transformer (with attention) implementations (AA sequence)
27		26
28	## CodonCraft ProGen: Precision Translation Model for Optimal Bacterial Expression (with Attention)	27	## CodonCraft ProGen: Precision Translation Model for Optimal Bacterial Expression (with Attention)
29		28
30	> 0. Basic LSTM model (RNN)	29	0. Basic LSTM model (RNN)
31		30
32	> 1. BERT (with Attention)	31	1. BERT (with Attention)
33		32
34	> 2. GPT-esk Transformer (uni-directional)	33	2. GPT-esk Transformer (uni-directional)
35		34
36	- [x] project: code moved from private git	35	- [x] project: code moved from private git
37	- [ ] data needs to anonymized	36	- [ ] data needs to anonymized
38		37
39		38
40	# Data	39	# Data
41		40
42	## CodonCraft	41	## CodonCraft
43		42
44	> https://www.ncbi.nlm.nih.gov/home/develop/api/	43	https://www.ncbi.nlm.nih.gov/home/develop/api/
45		44
46	## ProteinStructurePredictions	45	## ProteinStructurePredictions
47		46
48	> CullPDB53 Dataset (6125 proteins):The CullPDB53 dataset is a non-redundant set of protein structures from the Protein Data Bank (PDB). https://www.rcsb.org/.	47	CullPDB53 Dataset (6125 proteins):The CullPDB53 dataset is a non-redundant set of protein structures from the Protein Data Bank (PDB). https://www.rcsb.org/.
49		48
50	> The CB513 dataset is often used for protein secondary structure prediction. https://www.princeton.edu/~jzthree/datasets/ICML2014/.	49	The CB513 dataset is often used for protein secondary structure prediction. https://www.princeton.edu/~jzthree/datasets/ICML2014/.
51		50
52	> The Critical Assessment of Structure Prediction (CASP) datasets are used for protein structure prediction and related tasks. http://predictioncenter.org/.	51	The Critical Assessment of Structure Prediction (CASP) datasets are used for protein structure prediction and related tasks. http://predictioncenter.org/.
53		52
54	> CAMEO Test Proteins (6 months): The CAMEO (Continuous Automated Model EvaluatiOn) test proteins are used for protein structure prediction evaluation. http://www.cameo3d.org/sp/6-months/.	53	CAMEO Test Proteins (6 months): The CAMEO (Continuous Automated Model EvaluatiOn) test proteins are used for protein structure prediction evaluation. http://www.cameo3d.org/sp/6-months/.
55		54
56	> JPRED Training and Test Data (1338 training and 149 test proteins): The JPRED dataset provides training and test data for protein secondary structure prediction. http://www.compbio.dundee.ac.uk/jpred4/about.shtml.	55	JPRED Training and Test Data (1338 training and 149 test proteins): The JPRED dataset provides training and test data for protein secondary structure prediction. http://www.compbio.dundee.ac.uk/jpred4/about.shtml.
57		56
58	# Project Structure	57	# Project Structure
59		58
60	```	59	```
61	project_root/	60	project_root/
62	\|-- data/	61	\|-- data/
63	\| \|-- raw/ # Raw data files	62	\| \|-- raw/ # Raw data files
64	\| \|-- processed/ # Processed and preprocessed data	63	\| \|-- processed/ # Processed and preprocessed data
65	\| \|-- dataset.py # Custom dataset classes and data loading utilities	64	\| \|-- dataset.py # Custom dataset classes and data loading utilities
66	\|	65	\|
67	\|-- models/	66	\|-- models/
68	\| \|-- architecture.py # Model architecture definition	67	\| \|-- architecture.py # Model architecture definition
69	\| \|-- loss.py # Custom loss functions	68	\| \|-- loss.py # Custom loss functions
70	\| \|-- metrics.py # Evaluation metrics	69	\| \|-- metrics.py # Evaluation metrics
71	\| \|-- train.py # Training script	70	\| \|-- train.py # Training script
72	\| \|-- predict.py # Inference script	71	\| \|-- predict.py # Inference script
73	\|	72	\|
74	\|-- utils/	73	\|-- utils/
75	\| \|-- helpers.py # Utility functions	74	\| \|-- helpers.py # Utility functions
76	\| \|-- visualization.py # Visualization functions	75	\| \|-- visualization.py # Visualization functions
77	\|	76	\|
78	\|-- config/	77	\|-- config/
79	\| \|-- config.yaml # Configuration file for hyperparameters	78	\| \|-- config.yaml # Configuration file for hyperparameters
80	\|	79	\|
81	\|-- notebooks/ # Jupyter notebooks for experimentation and analysis	80	\|-- notebooks/ # Jupyter notebooks for experimentation and analysis
82	\|	81	\|
83	\|-- experiments/	82	\|-- experiments/
84	\| \|-- experiment_1/ # Directory for experiment 1 (can have multiple experiments)	83	\| \|-- experiment_1/ # Directory for experiment 1 (can have multiple experiments)
85	\| \|-- logs/ # TensorBoard logs, training/validation metrics	84	\| \|-- logs/ # TensorBoard logs, training/validation metrics
86	\| \|-- saved_models/ # Saved model checkpoints	85	\| \|-- saved_models/ # Saved model checkpoints
87	\|	86	\|
88	\|-- requirements.txt # Python dependencies file	87	\|-- requirements.txt # Python dependencies file
89	\|-- README.md # Project documentation	88	\|-- README.md # Project documentation
90	```	89	```
91		90
92	# Note	91	# Note
93		92
94	I give sample anonymized data (50 MB; <project>/data/processed/) to run/test models. Code to curate and preprocess your own complete dataset will also be provided. These models are POCs (proof of concepts) that can scaled and tuned for particular use cases.	93	I give sample anonymized data (50 MB; <project>/data/processed/) to run/test models. Code to curate and preprocess your own complete dataset will also be provided. These models are POCs (proof of concepts) that can scaled and tuned for particular use cases.