Switch to unified view

a b/docs/alphafold_output_explanation.md
1
# Understanding AlphaFold Output Files
2
3
This document explains the output files generated by AlphaFold and their significance for protein structure prediction.
4
5
## Output Directory Structure
6
7
The output directory contains several types of files for each model prediction:
8
9
### 1. Model Predictions
10
- `unrelaxed_model_[1-5]_pred_0.pdb`: Initial predicted structure in PDB format
11
- `unrelaxed_model_[1-5]_pred_0.cif`: Initial predicted structure in mmCIF format
12
- `relaxed_model_[1-5]_pred_0.pdb`: Relaxed (energy-minimized) structure in PDB format
13
- `relaxed_model_[1-5]_pred_0.cif`: Relaxed structure in mmCIF format
14
15
### 2. Confidence Metrics
16
- `confidence_model_[1-5]_pred_0.json`: Per-residue confidence scores
17
  - Contains three key arrays:
18
    - `residueNumber`: Position in the sequence
19
    - `confidenceScore`: pLDDT score (0-100)
20
    - `confidenceCategory`: Confidence category (L=Low, M=Medium, H=High)
21
  - pLDDT interpretation:
22
    - < 50: Low confidence (L)
23
    - 50-70: Medium confidence (M)
24
    - 70-90: High confidence (H)
25
    - > 90: Very high confidence (H)
26
27
### 3. Ranking and Analysis
28
- `ranked_[0-4].pdb/.cif`: Models sorted by confidence, with 0 being the best
29
- `ranking_debug.json`: Details about the model ranking process
30
- `relax_metrics.json`: Metrics from the structure relaxation step
31
- `timings.json`: Performance metrics for different stages of prediction
32
33
### 4. Intermediate Data
34
- `features.pkl`: Input features extracted from the sequence
35
- `result_model_[1-5]_pred_0.pkl`: Raw prediction results
36
- `msas/`: Directory containing multiple sequence alignments used for prediction
37
38
## Key Files for Analysis
39
40
1. **Best Model**: Always check `ranked_0.pdb` first - this is AlphaFold's best prediction
41
2. **Confidence Assessment**: Review `confidence_model_[1-5]_pred_0.json` to understand prediction reliability
42
   - High confidence scores (>90) suggest very reliable predictions
43
   - Lower scores may indicate flexible or disordered regions
44
45
## File Formats
46
47
1. **PDB Files** (.pdb):
48
   - Standard format for protein structures
49
   - Easily viewable in molecular visualization software (PyMOL, VMD, etc.)
50
   - Contains atomic coordinates and basic metadata
51
52
2. **mmCIF Files** (.cif):
53
   - Modern format for protein structures
54
   - More detailed than PDB format
55
   - Better handles large structures and contains more metadata
56
57
## Using the Output
58
59
1. **Structure Analysis**:
60
   - Use `ranked_0.pdb` for your primary analysis
61
   - Compare with other ranked models to assess structural variability
62
   - Pay attention to regions with high confidence scores
63
64
2. **Quality Assessment**:
65
   - Check confidence scores to identify reliable regions
66
   - Look for consistently high-confidence regions across models
67
   - Be cautious about interpreting low-confidence regions
68
69
3. **Visualization**:
70
   - Color structure by pLDDT score to highlight reliable regions
71
   - Compare multiple models to understand structural flexibility
72
   - Focus analysis on high-confidence regions
73
74
## Visualizing PDB Files
75
76
To visualize the predicted protein structures (PDB files), you have several options:
77
78
1. **Desktop Applications**:
79
   - **PyMOL**: Professional-grade molecular visualization (recommended)
80
     - Download from: https://pymol.org/
81
     - Features:
82
       - High-quality rendering
83
       - Powerful analysis tools
84
       - Script automation support
85
       - Can color by B-factor (confidence scores in AlphaFold)
86
   
87
   - **UCSF Chimera**: Free academic visualization tool
88
     - Download from: https://www.cgl.ucsf.edu/chimera/
89
     - Good for structure analysis and comparison
90
91
   - **VMD**: Specialized in molecular dynamics
92
     - Download from: https://www.ks.uiuc.edu/Research/vmd/
93
     - Excellent for trajectory analysis
94
95
2. **Web-Based Viewers**:
96
   - **Mol***: Modern web-based viewer
97
     - Access via: https://molstar.org/viewer/
98
     - Just drag and drop your PDB file
99
   
100
   - **NGL Viewer**: Lightweight web viewer
101
     - Access via: http://nglviewer.org/ngl/
102
     - Good for quick visualization
103
104
3. **Analyzing ranked_0.pdb**:
105
   - This is AlphaFold's best prediction
106
   - The B-factor column (last number in each ATOM record) contains pLDDT confidence scores
107
   - In PyMOL, you can color by B-factor to visualize confidence:
108
     ```python
109
     # PyMOL commands
110
     load ranked_0.pdb
111
     spectrum b, rainbow   # Colors structure by confidence scores
112
     ```
113
114
4. **Best Practices**:
115
   - Always start by viewing `ranked_0.pdb`
116
   - Color the structure by confidence scores
117
   - Compare multiple models to understand flexibility
118
   - Look for regions with high confidence scores (>90)
119
   - Be cautious about interpreting low-confidence regions
120
121
5. **Important Features to Look For**:
122
   - Secondary structure elements (helices, sheets)
123
   - Overall fold and domain organization
124
   - Regions of high vs. low confidence
125
   - Potential flexible regions (varying between models)
126
   - Biologically important sites or motifs
127
128
## Understanding Pickle (.pkl) Files
129
130
AlphaFold generates two types of pickle files that contain detailed prediction data:
131
132
1. **result_model_[1-5]_pred_0.pkl**:
133
   Contains the raw prediction outputs including:
134
   - `distogram`: Distance predictions between residue pairs
135
   - `experimentally_resolved`: Predictions about atom positions
136
   - `masked_msa`: Multiple sequence alignment information
137
   - `predicted_lddt`: Raw predictions for local confidence
138
   - `structure_module`: Final atom positions and masks
139
   - `plddt`: Per-residue confidence scores (0-100)
140
   - `ranking_confidence`: Overall model confidence
141
142
2. **features.pkl**:
143
   Contains input features used for prediction:
144
   - `sequence`: The input protein sequence
145
   - `aatype`: Amino acid type encodings
146
   - `msa`: Multiple sequence alignment data
147
   - `template_*`: Information about structural templates
148
   - `residue_index`: Numbering of residues
149
   - `domain_name`: Name of the protein domain
150
151
### Key Metrics from Pickle Files
152
153
1. **Confidence Scores (pLDDT)**:
154
   - Range: 0-100
155
   - Higher is better
156
   - Your results show:
157
     - Average score: 89.16 (Very good)
158
     - Range: 65.29 - 93.71
159
     - Most residues have high confidence (>70)
160
161
2. **Multiple Sequence Alignment (MSA)**:
162
   - Your protein had 3,303 aligned sequences
163
   - This is a good number for prediction accuracy
164
   - More diverse alignments generally improve prediction quality
165
166
3. **Structure Module**:
167
   - Contains final atomic coordinates
168
   - Includes all backbone and side chain atoms
169
   - Shape: (34 residues, 37 atoms per residue, 3 coordinates)
170
171
### Analyzing Pickle Files
172
173
To analyze these files, you can use the provided `inspect_pkl.py` script:
174
```python
175
python inspect_pkl.py
176
```
177
178
This will show:
179
- Data structure of predictions
180
- Confidence scores
181
- Sequence information
182
- Template usage
183
- MSA statistics
184
185
## Why Five Models?
186
187
AlphaFold generates five models for each prediction for several important reasons:
188
189
1. **Sampling Different Conformations**:
190
   - Proteins can exist in multiple stable conformations
191
   - Different models may capture different possible structural states
192
   - Helps identify flexible or dynamic regions of the protein
193
194
2. **Confidence Assessment**:
195
   - Agreement between models indicates prediction reliability
196
   - Regions that vary between models may be:
197
     - Naturally flexible
198
     - Have multiple possible conformations
199
     - Harder to predict accurately
200
201
3. **Model Architecture**:
202
   - Each model uses slightly different neural network parameters
203
   - Models are trained independently with different random seeds
204
   - This ensemble approach improves prediction robustness
205
206
4. **Ranking System**:
207
   - AlphaFold ranks the five models based on predicted confidence
208
   - `ranked_0` represents the most confident prediction
209
   - Comparing ranks helps identify the most likely structure
210
211
5. **Scientific Best Practice**:
212
   - Multiple models follow the scientific principle of ensemble sampling
213
   - Helps avoid over-relying on a single prediction
214
   - Provides error estimates for the prediction
215
216
When analyzing results, it's important to:
217
- Start with the highest-ranked model (`ranked_0`)
218
- Compare models to identify consistent and variable regions
219
- Consider all models when the confidence scores are similar
220
221
## Performance Metrics
222
223
The `timings.json` file provides detailed information about:
224
- MSA generation time
225
- Feature processing time
226
- Model prediction time
227
- Structure relaxation time
228
229
This can be useful for:
230
- Optimizing future runs
231
- Understanding computational requirements
232
- Identifying bottlenecks in the prediction pipeline