Switch to unified view

a/README.md b/README.md
1
![Banner Image](ProteinStructurePredictionCNN/images/banner-bpg.png)
1
2
3
# Biophysics & AI
2
# Biophysics & AI
4
3
5
Exploring the intersection of artificial intelligence and the intricate realms of proteins and genomics has been a captivating journey for me. Delving into the intricacies of molecular biology, my experiments with AI have been centered on deciphering complex biological data, elucidating protein structures, and unraveling genomic mysteries. Through a multifaceted approach, I've harnessed AI algorithms to analyze vast datasets, predict protein interactions, and contribute to advancements in genomics research. This immersive exploration has not only deepened my understanding of the biological intricacies at play but has also unveiled the potential of AI in revolutionizing our comprehension of the fundamental building blocks of life.
4
Exploring the intersection of artificial intelligence and the intricate realms of proteins and genomics has been a captivating journey for me. Delving into the intricacies of molecular biology, my experiments with AI have been centered on deciphering complex biological data, elucidating protein structures, and unraveling genomic mysteries. Through a multifaceted approach, I've harnessed AI algorithms to analyze vast datasets, predict protein interactions, and contribute to advancements in genomics research. This immersive exploration has not only deepened my understanding of the biological intricacies at play but has also unveiled the potential of AI in revolutionizing our comprehension of the fundamental building blocks of life.
6
5
7
6
8
7
9
## protein-ligand interaction
8
## protein-ligand interaction
10
To-be-moved
9
To-be-moved
11
10
12
## molecular property prediction (small protein/ligand: binding)
11
## molecular property prediction (small protein/ligand: binding)
13
To-be-moved
12
To-be-moved
14
13
15
## molecule generation (small protein/ligand)
14
## molecule generation (small protein/ligand)
16
To-be- moved
15
To-be- moved
17
16
18
## ProteinStructurePredictionCNN
17
## ProteinStructurePredictionCNN
19
18
20
> Jian Zhou and Olga G. Troyanskaya (2014) - "Deep Supervised and Convolutional Generative Stochastic Network for Protein Secondary Structure Prediction" - https://arxiv.org/pdf/1403.1347.pdf
19
Jian Zhou and Olga G. Troyanskaya (2014) - "Deep Supervised and Convolutional Generative Stochastic Network for Protein Secondary Structure Prediction" - https://arxiv.org/pdf/1403.1347.pdf
21
20
22
> Sheng Wang et al. (2016) - "Protein Secondary Structure Prediction Using Deep Convolutional Neural Fields" - https://arxiv.org/pdf/1512.00843.pdf
21
Sheng Wang et al. (2016) - "Protein Secondary Structure Prediction Using Deep Convolutional Neural Fields" - https://arxiv.org/pdf/1512.00843.pdf
23
22
24
This algorithm leverages the combined ability of two deep learning techniques, including a combination of convolutional neural networks (CNNs) and generative stochastic networks (GSNs), to achieve state-of-the-art accuracy in predicting the secondary structure of proteins. By effectively capturing complex dependencies and patterns in protein sequences, DCGSN offers a powerful tool for researchers and practitioners in the field of bioinformatics and structural biology to improve the accuracy of secondary structure prediction, ultimately advancing our understanding of protein functionality and interactions. Additionally, Deep Convolutional Neural Fields (DeepCNF) is used to refine the protein secondary structure prediction; this combines the power of deep learning and graphical models to enhance accuracy further. The method leverages deep convolutional neural networks to capture informative features from protein sequences, facilitating precise secondary structure predictions to build whole protein structures.
23
This algorithm leverages the combined ability of two deep learning techniques, including a combination of convolutional neural networks (CNNs) and generative stochastic networks (GSNs), to achieve state-of-the-art accuracy in predicting the secondary structure of proteins. By effectively capturing complex dependencies and patterns in protein sequences, DCGSN offers a powerful tool for researchers and practitioners in the field of bioinformatics and structural biology to improve the accuracy of secondary structure prediction, ultimately advancing our understanding of protein functionality and interactions. Additionally, Deep Convolutional Neural Fields (DeepCNF) is used to refine the protein secondary structure prediction; this combines the power of deep learning and graphical models to enhance accuracy further. The method leverages deep convolutional neural networks to capture informative features from protein sequences, facilitating precise secondary structure predictions to build whole protein structures.
25
24
26
- [ ] Transformer (with attention) implementations (AA sequence)
25
- [ ] Transformer (with attention) implementations (AA sequence)
27
26
28
## CodonCraft ProGen: Precision Translation Model for Optimal Bacterial Expression (with Attention)
27
## CodonCraft ProGen: Precision Translation Model for Optimal Bacterial Expression (with Attention)
29
28
30
> 0. Basic LSTM model (RNN)
29
0. Basic LSTM model (RNN)
31
30
32
> 1. BERT (with Attention)
31
1. BERT (with Attention)
33
32
34
> 2. GPT-esk Transformer (uni-directional)
33
2. GPT-esk Transformer (uni-directional)
35
34
36
- [x] project: code moved from private git
35
- [x] project: code moved from private git
37
- [ ] data needs to anonymized 
36
- [ ] data needs to anonymized 
38
37
39
38
40
# Data
39
# Data
41
40
42
## CodonCraft
41
## CodonCraft
43
42
44
> https://www.ncbi.nlm.nih.gov/home/develop/api/
43
https://www.ncbi.nlm.nih.gov/home/develop/api/
45
44
46
## ProteinStructurePredictions
45
## ProteinStructurePredictions
47
46
48
> CullPDB53 Dataset (6125 proteins):The CullPDB53 dataset is a non-redundant set of protein structures from the Protein Data Bank (PDB). https://www.rcsb.org/.
47
 CullPDB53 Dataset (6125 proteins):The CullPDB53 dataset is a non-redundant set of protein structures from the Protein Data Bank (PDB). https://www.rcsb.org/.
49
48
50
> The CB513 dataset is often used for protein secondary structure prediction. https://www.princeton.edu/~jzthree/datasets/ICML2014/.
49
 The CB513 dataset is often used for protein secondary structure prediction. https://www.princeton.edu/~jzthree/datasets/ICML2014/.
51
50
52
> The Critical Assessment of Structure Prediction (CASP) datasets are used for protein structure prediction and related tasks.  http://predictioncenter.org/.
51
The Critical Assessment of Structure Prediction (CASP) datasets are used for protein structure prediction and related tasks.  http://predictioncenter.org/.
53
52
54
> CAMEO Test Proteins (6 months): The CAMEO (Continuous Automated Model EvaluatiOn) test proteins are used for protein structure prediction evaluation. http://www.cameo3d.org/sp/6-months/.
53
 CAMEO Test Proteins (6 months): The CAMEO (Continuous Automated Model EvaluatiOn) test proteins are used for protein structure prediction evaluation. http://www.cameo3d.org/sp/6-months/.
55
54
56
> JPRED Training and Test Data (1338 training and 149 test proteins): The JPRED dataset provides training and test data for protein secondary structure prediction. http://www.compbio.dundee.ac.uk/jpred4/about.shtml.
55
 JPRED Training and Test Data (1338 training and 149 test proteins): The JPRED dataset provides training and test data for protein secondary structure prediction. http://www.compbio.dundee.ac.uk/jpred4/about.shtml.
57
56
58
# Project Structure
57
# Project Structure
59
58
60
```
59
```
61
project_root/
60
project_root/
62
|-- data/
61
|-- data/
63
|   |-- raw/              # Raw data files
62
|   |-- raw/              # Raw data files
64
|   |-- processed/        # Processed and preprocessed data
63
|   |-- processed/        # Processed and preprocessed data
65
|   |-- dataset.py        # Custom dataset classes and data loading utilities
64
|   |-- dataset.py        # Custom dataset classes and data loading utilities
66
|
65
|
67
|-- models/
66
|-- models/
68
|   |-- architecture.py   # Model architecture definition
67
|   |-- architecture.py   # Model architecture definition
69
|   |-- loss.py           # Custom loss functions
68
|   |-- loss.py           # Custom loss functions
70
|   |-- metrics.py        # Evaluation metrics
69
|   |-- metrics.py        # Evaluation metrics
71
|   |-- train.py          # Training script
70
|   |-- train.py          # Training script
72
|   |-- predict.py        # Inference script
71
|   |-- predict.py        # Inference script
73
|
72
|
74
|-- utils/
73
|-- utils/
75
|   |-- helpers.py        # Utility functions
74
|   |-- helpers.py        # Utility functions
76
|   |-- visualization.py  # Visualization functions
75
|   |-- visualization.py  # Visualization functions
77
|
76
|
78
|-- config/
77
|-- config/
79
|   |-- config.yaml       # Configuration file for hyperparameters
78
|   |-- config.yaml       # Configuration file for hyperparameters
80
|
79
|
81
|-- notebooks/            # Jupyter notebooks for experimentation and analysis
80
|-- notebooks/            # Jupyter notebooks for experimentation and analysis
82
|
81
|
83
|-- experiments/
82
|-- experiments/
84
|   |-- experiment_1/     # Directory for experiment 1 (can have multiple experiments)
83
|   |-- experiment_1/     # Directory for experiment 1 (can have multiple experiments)
85
|       |-- logs/         # TensorBoard logs, training/validation metrics
84
|       |-- logs/         # TensorBoard logs, training/validation metrics
86
|       |-- saved_models/ # Saved model checkpoints
85
|       |-- saved_models/ # Saved model checkpoints
87
|
86
|
88
|-- requirements.txt       # Python dependencies file
87
|-- requirements.txt       # Python dependencies file
89
|-- README.md              # Project documentation
88
|-- README.md              # Project documentation
90
```
89
```
91
90
92
# Note
91
# Note
93
92
94
I give sample anonymized data (50 MB; <project>/data/processed/) to run/test models. Code to curate and preprocess your own complete dataset will also be provided. These models are POCs (proof of concepts) that can scaled and tuned for particular use cases. 
93
I give sample anonymized data (50 MB; <project>/data/processed/) to run/test models. Code to curate and preprocess your own complete dataset will also be provided. These models are POCs (proof of concepts) that can scaled and tuned for particular use cases.