|
a |
|
b/docs/alphafold_output_explanation.md |
|
|
1 |
# Understanding AlphaFold Output Files |
|
|
2 |
|
|
|
3 |
This document explains the output files generated by AlphaFold and their significance for protein structure prediction. |
|
|
4 |
|
|
|
5 |
## Output Directory Structure |
|
|
6 |
|
|
|
7 |
The output directory contains several types of files for each model prediction: |
|
|
8 |
|
|
|
9 |
### 1. Model Predictions |
|
|
10 |
- `unrelaxed_model_[1-5]_pred_0.pdb`: Initial predicted structure in PDB format |
|
|
11 |
- `unrelaxed_model_[1-5]_pred_0.cif`: Initial predicted structure in mmCIF format |
|
|
12 |
- `relaxed_model_[1-5]_pred_0.pdb`: Relaxed (energy-minimized) structure in PDB format |
|
|
13 |
- `relaxed_model_[1-5]_pred_0.cif`: Relaxed structure in mmCIF format |
|
|
14 |
|
|
|
15 |
### 2. Confidence Metrics |
|
|
16 |
- `confidence_model_[1-5]_pred_0.json`: Per-residue confidence scores |
|
|
17 |
- Contains three key arrays: |
|
|
18 |
- `residueNumber`: Position in the sequence |
|
|
19 |
- `confidenceScore`: pLDDT score (0-100) |
|
|
20 |
- `confidenceCategory`: Confidence category (L=Low, M=Medium, H=High) |
|
|
21 |
- pLDDT interpretation: |
|
|
22 |
- < 50: Low confidence (L) |
|
|
23 |
- 50-70: Medium confidence (M) |
|
|
24 |
- 70-90: High confidence (H) |
|
|
25 |
- > 90: Very high confidence (H) |
|
|
26 |
|
|
|
27 |
### 3. Ranking and Analysis |
|
|
28 |
- `ranked_[0-4].pdb/.cif`: Models sorted by confidence, with 0 being the best |
|
|
29 |
- `ranking_debug.json`: Details about the model ranking process |
|
|
30 |
- `relax_metrics.json`: Metrics from the structure relaxation step |
|
|
31 |
- `timings.json`: Performance metrics for different stages of prediction |
|
|
32 |
|
|
|
33 |
### 4. Intermediate Data |
|
|
34 |
- `features.pkl`: Input features extracted from the sequence |
|
|
35 |
- `result_model_[1-5]_pred_0.pkl`: Raw prediction results |
|
|
36 |
- `msas/`: Directory containing multiple sequence alignments used for prediction |
|
|
37 |
|
|
|
38 |
## Key Files for Analysis |
|
|
39 |
|
|
|
40 |
1. **Best Model**: Always check `ranked_0.pdb` first - this is AlphaFold's best prediction |
|
|
41 |
2. **Confidence Assessment**: Review `confidence_model_[1-5]_pred_0.json` to understand prediction reliability |
|
|
42 |
- High confidence scores (>90) suggest very reliable predictions |
|
|
43 |
- Lower scores may indicate flexible or disordered regions |
|
|
44 |
|
|
|
45 |
## File Formats |
|
|
46 |
|
|
|
47 |
1. **PDB Files** (.pdb): |
|
|
48 |
- Standard format for protein structures |
|
|
49 |
- Easily viewable in molecular visualization software (PyMOL, VMD, etc.) |
|
|
50 |
- Contains atomic coordinates and basic metadata |
|
|
51 |
|
|
|
52 |
2. **mmCIF Files** (.cif): |
|
|
53 |
- Modern format for protein structures |
|
|
54 |
- More detailed than PDB format |
|
|
55 |
- Better handles large structures and contains more metadata |
|
|
56 |
|
|
|
57 |
## Using the Output |
|
|
58 |
|
|
|
59 |
1. **Structure Analysis**: |
|
|
60 |
- Use `ranked_0.pdb` for your primary analysis |
|
|
61 |
- Compare with other ranked models to assess structural variability |
|
|
62 |
- Pay attention to regions with high confidence scores |
|
|
63 |
|
|
|
64 |
2. **Quality Assessment**: |
|
|
65 |
- Check confidence scores to identify reliable regions |
|
|
66 |
- Look for consistently high-confidence regions across models |
|
|
67 |
- Be cautious about interpreting low-confidence regions |
|
|
68 |
|
|
|
69 |
3. **Visualization**: |
|
|
70 |
- Color structure by pLDDT score to highlight reliable regions |
|
|
71 |
- Compare multiple models to understand structural flexibility |
|
|
72 |
- Focus analysis on high-confidence regions |
|
|
73 |
|
|
|
74 |
## Visualizing PDB Files |
|
|
75 |
|
|
|
76 |
To visualize the predicted protein structures (PDB files), you have several options: |
|
|
77 |
|
|
|
78 |
1. **Desktop Applications**: |
|
|
79 |
- **PyMOL**: Professional-grade molecular visualization (recommended) |
|
|
80 |
- Download from: https://pymol.org/ |
|
|
81 |
- Features: |
|
|
82 |
- High-quality rendering |
|
|
83 |
- Powerful analysis tools |
|
|
84 |
- Script automation support |
|
|
85 |
- Can color by B-factor (confidence scores in AlphaFold) |
|
|
86 |
|
|
|
87 |
- **UCSF Chimera**: Free academic visualization tool |
|
|
88 |
- Download from: https://www.cgl.ucsf.edu/chimera/ |
|
|
89 |
- Good for structure analysis and comparison |
|
|
90 |
|
|
|
91 |
- **VMD**: Specialized in molecular dynamics |
|
|
92 |
- Download from: https://www.ks.uiuc.edu/Research/vmd/ |
|
|
93 |
- Excellent for trajectory analysis |
|
|
94 |
|
|
|
95 |
2. **Web-Based Viewers**: |
|
|
96 |
- **Mol***: Modern web-based viewer |
|
|
97 |
- Access via: https://molstar.org/viewer/ |
|
|
98 |
- Just drag and drop your PDB file |
|
|
99 |
|
|
|
100 |
- **NGL Viewer**: Lightweight web viewer |
|
|
101 |
- Access via: http://nglviewer.org/ngl/ |
|
|
102 |
- Good for quick visualization |
|
|
103 |
|
|
|
104 |
3. **Analyzing ranked_0.pdb**: |
|
|
105 |
- This is AlphaFold's best prediction |
|
|
106 |
- The B-factor column (last number in each ATOM record) contains pLDDT confidence scores |
|
|
107 |
- In PyMOL, you can color by B-factor to visualize confidence: |
|
|
108 |
```python |
|
|
109 |
# PyMOL commands |
|
|
110 |
load ranked_0.pdb |
|
|
111 |
spectrum b, rainbow # Colors structure by confidence scores |
|
|
112 |
``` |
|
|
113 |
|
|
|
114 |
4. **Best Practices**: |
|
|
115 |
- Always start by viewing `ranked_0.pdb` |
|
|
116 |
- Color the structure by confidence scores |
|
|
117 |
- Compare multiple models to understand flexibility |
|
|
118 |
- Look for regions with high confidence scores (>90) |
|
|
119 |
- Be cautious about interpreting low-confidence regions |
|
|
120 |
|
|
|
121 |
5. **Important Features to Look For**: |
|
|
122 |
- Secondary structure elements (helices, sheets) |
|
|
123 |
- Overall fold and domain organization |
|
|
124 |
- Regions of high vs. low confidence |
|
|
125 |
- Potential flexible regions (varying between models) |
|
|
126 |
- Biologically important sites or motifs |
|
|
127 |
|
|
|
128 |
## Understanding Pickle (.pkl) Files |
|
|
129 |
|
|
|
130 |
AlphaFold generates two types of pickle files that contain detailed prediction data: |
|
|
131 |
|
|
|
132 |
1. **result_model_[1-5]_pred_0.pkl**: |
|
|
133 |
Contains the raw prediction outputs including: |
|
|
134 |
- `distogram`: Distance predictions between residue pairs |
|
|
135 |
- `experimentally_resolved`: Predictions about atom positions |
|
|
136 |
- `masked_msa`: Multiple sequence alignment information |
|
|
137 |
- `predicted_lddt`: Raw predictions for local confidence |
|
|
138 |
- `structure_module`: Final atom positions and masks |
|
|
139 |
- `plddt`: Per-residue confidence scores (0-100) |
|
|
140 |
- `ranking_confidence`: Overall model confidence |
|
|
141 |
|
|
|
142 |
2. **features.pkl**: |
|
|
143 |
Contains input features used for prediction: |
|
|
144 |
- `sequence`: The input protein sequence |
|
|
145 |
- `aatype`: Amino acid type encodings |
|
|
146 |
- `msa`: Multiple sequence alignment data |
|
|
147 |
- `template_*`: Information about structural templates |
|
|
148 |
- `residue_index`: Numbering of residues |
|
|
149 |
- `domain_name`: Name of the protein domain |
|
|
150 |
|
|
|
151 |
### Key Metrics from Pickle Files |
|
|
152 |
|
|
|
153 |
1. **Confidence Scores (pLDDT)**: |
|
|
154 |
- Range: 0-100 |
|
|
155 |
- Higher is better |
|
|
156 |
- Your results show: |
|
|
157 |
- Average score: 89.16 (Very good) |
|
|
158 |
- Range: 65.29 - 93.71 |
|
|
159 |
- Most residues have high confidence (>70) |
|
|
160 |
|
|
|
161 |
2. **Multiple Sequence Alignment (MSA)**: |
|
|
162 |
- Your protein had 3,303 aligned sequences |
|
|
163 |
- This is a good number for prediction accuracy |
|
|
164 |
- More diverse alignments generally improve prediction quality |
|
|
165 |
|
|
|
166 |
3. **Structure Module**: |
|
|
167 |
- Contains final atomic coordinates |
|
|
168 |
- Includes all backbone and side chain atoms |
|
|
169 |
- Shape: (34 residues, 37 atoms per residue, 3 coordinates) |
|
|
170 |
|
|
|
171 |
### Analyzing Pickle Files |
|
|
172 |
|
|
|
173 |
To analyze these files, you can use the provided `inspect_pkl.py` script: |
|
|
174 |
```python |
|
|
175 |
python inspect_pkl.py |
|
|
176 |
``` |
|
|
177 |
|
|
|
178 |
This will show: |
|
|
179 |
- Data structure of predictions |
|
|
180 |
- Confidence scores |
|
|
181 |
- Sequence information |
|
|
182 |
- Template usage |
|
|
183 |
- MSA statistics |
|
|
184 |
|
|
|
185 |
## Why Five Models? |
|
|
186 |
|
|
|
187 |
AlphaFold generates five models for each prediction for several important reasons: |
|
|
188 |
|
|
|
189 |
1. **Sampling Different Conformations**: |
|
|
190 |
- Proteins can exist in multiple stable conformations |
|
|
191 |
- Different models may capture different possible structural states |
|
|
192 |
- Helps identify flexible or dynamic regions of the protein |
|
|
193 |
|
|
|
194 |
2. **Confidence Assessment**: |
|
|
195 |
- Agreement between models indicates prediction reliability |
|
|
196 |
- Regions that vary between models may be: |
|
|
197 |
- Naturally flexible |
|
|
198 |
- Have multiple possible conformations |
|
|
199 |
- Harder to predict accurately |
|
|
200 |
|
|
|
201 |
3. **Model Architecture**: |
|
|
202 |
- Each model uses slightly different neural network parameters |
|
|
203 |
- Models are trained independently with different random seeds |
|
|
204 |
- This ensemble approach improves prediction robustness |
|
|
205 |
|
|
|
206 |
4. **Ranking System**: |
|
|
207 |
- AlphaFold ranks the five models based on predicted confidence |
|
|
208 |
- `ranked_0` represents the most confident prediction |
|
|
209 |
- Comparing ranks helps identify the most likely structure |
|
|
210 |
|
|
|
211 |
5. **Scientific Best Practice**: |
|
|
212 |
- Multiple models follow the scientific principle of ensemble sampling |
|
|
213 |
- Helps avoid over-relying on a single prediction |
|
|
214 |
- Provides error estimates for the prediction |
|
|
215 |
|
|
|
216 |
When analyzing results, it's important to: |
|
|
217 |
- Start with the highest-ranked model (`ranked_0`) |
|
|
218 |
- Compare models to identify consistent and variable regions |
|
|
219 |
- Consider all models when the confidence scores are similar |
|
|
220 |
|
|
|
221 |
## Performance Metrics |
|
|
222 |
|
|
|
223 |
The `timings.json` file provides detailed information about: |
|
|
224 |
- MSA generation time |
|
|
225 |
- Feature processing time |
|
|
226 |
- Model prediction time |
|
|
227 |
- Structure relaxation time |
|
|
228 |
|
|
|
229 |
This can be useful for: |
|
|
230 |
- Optimizing future runs |
|
|
231 |
- Understanding computational requirements |
|
|
232 |
- Identifying bottlenecks in the prediction pipeline |