This document explains the output files generated by AlphaFold and their significance for protein structure prediction.
The output directory contains several types of files for each model prediction:
unrelaxed_model_[1-5]_pred_0.pdb
: Initial predicted structure in PDB formatunrelaxed_model_[1-5]_pred_0.cif
: Initial predicted structure in mmCIF formatrelaxed_model_[1-5]_pred_0.pdb
: Relaxed (energy-minimized) structure in PDB formatrelaxed_model_[1-5]_pred_0.cif
: Relaxed structure in mmCIF formatconfidence_model_[1-5]_pred_0.json
: Per-residue confidence scoresresidueNumber
: Position in the sequenceconfidenceScore
: pLDDT score (0-100)confidenceCategory
: Confidence category (L=Low, M=Medium, H=High)90: Very high confidence (H)
ranked_[0-4].pdb/.cif
: Models sorted by confidence, with 0 being the bestranking_debug.json
: Details about the model ranking processrelax_metrics.json
: Metrics from the structure relaxation steptimings.json
: Performance metrics for different stages of predictionfeatures.pkl
: Input features extracted from the sequenceresult_model_[1-5]_pred_0.pkl
: Raw prediction resultsmsas/
: Directory containing multiple sequence alignments used for predictionranked_0.pdb
first - this is AlphaFold's best predictionconfidence_model_[1-5]_pred_0.json
to understand prediction reliabilityContains atomic coordinates and basic metadata
mmCIF Files (.cif):
ranked_0.pdb
for your primary analysisPay attention to regions with high confidence scores
Quality Assessment:
Be cautious about interpreting low-confidence regions
Visualization:
To visualize the predicted protein structures (PDB files), you have several options:
PyMOL: Professional-grade molecular visualization (recommended)
UCSF Chimera: Free academic visualization tool
VMD: Specialized in molecular dynamics
Web-Based Viewers:
Mol*: Modern web-based viewer
NGL Viewer: Lightweight web viewer
Analyzing ranked_0.pdb:
In PyMOL, you can color by B-factor to visualize confidence:
python
# PyMOL commands
load ranked_0.pdb
spectrum b, rainbow # Colors structure by confidence scores
Best Practices:
ranked_0.pdb
Be cautious about interpreting low-confidence regions
Important Features to Look For:
AlphaFold generates two types of pickle files that contain detailed prediction data:
distogram
: Distance predictions between residue pairsexperimentally_resolved
: Predictions about atom positionsmasked_msa
: Multiple sequence alignment informationpredicted_lddt
: Raw predictions for local confidencestructure_module
: Final atom positions and masksplddt
: Per-residue confidence scores (0-100)ranking_confidence
: Overall model confidence
features.pkl:
Contains input features used for prediction:
sequence
: The input protein sequenceaatype
: Amino acid type encodingsmsa
: Multiple sequence alignment datatemplate_*
: Information about structural templatesresidue_index
: Numbering of residuesdomain_name
: Name of the protein domainYour results show:
Multiple Sequence Alignment (MSA):
More diverse alignments generally improve prediction quality
Structure Module:
To analyze these files, you can use the provided inspect_pkl.py
script:
python inspect_pkl.py
This will show:
- Data structure of predictions
- Confidence scores
- Sequence information
- Template usage
- MSA statistics
AlphaFold generates five models for each prediction for several important reasons:
Helps identify flexible or dynamic regions of the protein
Confidence Assessment:
Regions that vary between models may be:
Model Architecture:
This ensemble approach improves prediction robustness
Ranking System:
ranked_0
represents the most confident predictionComparing ranks helps identify the most likely structure
Scientific Best Practice:
When analyzing results, it's important to:
- Start with the highest-ranked model (ranked_0
)
- Compare models to identify consistent and variable regions
- Consider all models when the confidence scores are similar
The timings.json
file provides detailed information about:
- MSA generation time
- Feature processing time
- Model prediction time
- Structure relaxation time
This can be useful for:
- Optimizing future runs
- Understanding computational requirements
- Identifying bottlenecks in the prediction pipeline