a b/README.md
1
# AI_Genomics Project
2
3
This repository contains the implementation and integration of two powerful genomics models: GET (Gene Expression Transformer) and AlphaFold.
4
5
## Project Structure
6
7
```
8
AI_Genomics/
9
├── models/
10
│   ├── get_model/          # GET model implementation (175MB)
11
│   │   ├── tutorials/      # Jupyter notebooks for data processing and model usage
12
│   │   │   ├── prepare_pbmc.ipynb     # Data processing tutorial
13
│   │   │   ├── finetune_pbmc.ipynb    # Model fine-tuning tutorial
14
│   │   │   ├── predict_atac.ipynb     # ATAC prediction demo
15
│   │   │   └── pretrain_pbmc.ipynb    # Pre-training tutorial
16
│   │   ├── get_model/     # Core model implementation
17
│   │   └── env.yml        # Conda environment specification
18
│   └── alphafold/         # AlphaFold implementation (34MB)
19
│       ├── data/          # Symbolic link to alphafold_data
20
│       ├── configs/       # Model configurations
21
│       └── checkpoints/   # Model checkpoints
22
├── experiments/
23
│   ├── get_experiments/   # GET experiment scripts and results
24
│   └── af_experiments/    # AlphaFold experiment scripts and results
25
├── utils/                 # Shared utility functions
26
├── notebooks/            # Jupyter notebooks for analysis
27
└── docs/                 # Documentation and model mindmaps
28
```
29
30
## Model Data Locations
31
32
### AlphaFold Data
33
Required data includes:
34
  - Sequence databases (UniRef90, BFD, MGnify)
35
  - Structure templates (PDB70)
36
  - Parameter files
37
  - Model weights
38
39
Note: AlphaFold data setup will be done separately following the official installation guide.
40
41
### GET Model Data
42
The GET model requires the following data preparation steps:
43
1. PBMC Data Processing:
44
   - Follow the tutorial in `models/get_model/tutorials/prepare_pbmc.ipynb`
45
   - Data processing pipeline includes:
46
     - Peak sorting (chr1, chr2, chr3 order)
47
     - Count matrix preparation
48
     - Quality checks (>3M depth recommended)
49
50
2. Model Training Data:
51
   - Fine-tuning data: Follow `models/get_model/tutorials/finetune_pbmc.ipynb`
52
   - ATAC prediction: Use `models/get_model/tutorials/predict_atac.ipynb`
53
   - Pre-training: Reference `models/get_model/tutorials/pretrain_pbmc.ipynb`
54
55
## Setup and Installation
56
57
1. Clone the repository:
58
```bash
59
git clone https://github.com/[your-username]/AI_Genomics.git /home/caom/AI_Genomics
60
cd /home/caom/AI_Genomics
61
```
62
63
2. Create and activate a conda environment:
64
```bash
65
conda create -n ai_genomics python=3.8
66
conda activate ai_genomics
67
```
68
69
3. Install dependencies:
70
```bash
71
pip install -r requirements.txt
72
```
73
74
4. Set up model-specific requirements:
75
   - GET Model:
76
     ```bash
77
     cd models/get_model
78
     conda env create -f env.yml
79
     conda activate get
80
     ```
81
   - AlphaFold:
82
     ```bash
83
     # Create AlphaFold conda environment
84
     cd models/alphafold
85
     conda create -n alphafold python=3.10
86
     conda activate alphafold
87
     
88
     # Install JAX with CUDA support
89
     pip install --upgrade "jax[cuda]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
90
     
91
     # Install other dependencies
92
     conda install -y -c conda-forge openmm=7.5.1 pdbfixer
93
     conda install -y -c bioconda hmmer hhsuite kalign2
94
     pip install -r docker/requirements.txt
95
     
96
     # Download genetic databases and model parameters
97
     # (This will be done separately following cluster-specific storage guidelines)
98
     ```
99
     
100
     Note: We use conda environment instead of Docker in the cluster environment for:
101
     - Better integration with SLURM job scheduler
102
     - Direct access to cluster's optimized CUDA libraries
103
     - Improved performance without Docker virtualization
104
     - Better resource management
105
     - Direct access to cluster storage
106
107
## Computational Resources
108
109
### GPU Access
110
To access GPU resources for model training and inference, use the following SLURM command:
111
112
```bash
113
srun -p general --pty -t 120:00 --cpus-per-task=32 --mem=64G --gres=gpu:a100:2 /bin/bash
114
```
115
116
This command requests:
117
- Partition: general
118
- Time limit: 120 hours
119
- CPUs: 32 cores
120
- Memory: 64GB
121
- GPUs: 2 NVIDIA A100 GPUs
122
- Interactive bash session
123
124
Note: Adjust the resources (time, CPU, memory, GPU) based on your specific needs.
125
126
## Version Control
127
128
This repository uses Git for version control. Important files:
129
- `.gitignore`: Excludes large data files, model checkpoints, and environment-specific files
130
- `.gitattributes`: Handles large file storage using Git LFS
131
- `requirements.txt`: Lists all Python dependencies
132
133
## Contributing
134
135
1. Create a new branch for your feature
136
2. Make your changes
137
3. Submit a pull request
138
139
## License
140
141
Please refer to the original licenses of [AlphaFold](https://github.com/google-deepmind/alphafold) and [GET](https://github.com/GET-Foundation/get_model) models.