parse¶
Convert predictor outputs to intermediate format.
Synopsis¶
Predictors¶
| Subcommand | Predictor |
|---|---|
af2 |
AlphaFold2 |
af3 |
AlphaFold3 |
boltz2 |
Boltz2 |
Arguments¶
| Argument | Description |
|---|---|
INPUT_DIR |
Directory containing predictor outputs |
OUTPUT_DIR |
Directory to write intermediate format |
Options¶
| Option | Default | Description |
|---|---|---|
--top-n |
5 | Number of top-ranked models to extract |
--id-pattern |
None | Optional regex to filter/extract IDs from folder names |
-q, --quiet |
Suppress progress output |
Examples¶
Basic Usage¶
# Parse AlphaFold3 outputs
tsp-maker parse af3 /data/af3_predictions /intermediate
# Parse with fewer models
tsp-maker parse af3 /data/af3_predictions /intermediate --top-n 3
Multiple Predictors¶
Run multiple parse commands to the same output directory:
tsp-maker parse af2 /data/af2 /intermediate --top-n 5
tsp-maker parse af3 /data/af3 /intermediate --top-n 5
tsp-maker parse boltz2 /data/boltz2 /intermediate --top-n 3
Files are named with predictor suffixes (_AF2_, _AF3_, _BZ2_) to avoid collisions.
Filtering with ID Pattern¶
Use --id-pattern to filter folders or extract IDs from complex folder names:
# Only process folders matching a pattern
tsp-maker parse af3 /data/mixed /intermediate \
--id-pattern "AT[0-9]G[0-9]{5}"
# Extract ID from complex folder names (use capture group)
tsp-maker parse af3 /data/jobs /intermediate \
--id-pattern "job_(\w+)_model"
Output Format¶
The intermediate format contains:
OUTPUT_DIR/
├── structures/
│ ├── P12345_AF3_1.cif
│ ├── P12345_AF3_2.cif
│ ├── Q67890_AF3_1.cif
│ └── ...
├── scores/
│ ├── P12345_AF3.json
│ ├── Q67890_AF3.json
│ └── ...
└── pae/
├── P12345_AF3_1.npy
├── P12345_AF3_2.npy
└── ...
Score JSON Format¶
[
{
"model_id": "P12345_AF3_1",
"structure_id": "P12345",
"rank": 1,
"predictor": "alphafold3",
"plddt_mean": 85.2,
"ptm": 0.82,
"ranking_score": 0.85
}
]
Protein ID Rules¶
Folder Names = Protein IDs
tsp-maker uses folder names directly as protein identifiers. This is a deliberate design decision.
We do not extract IDs from file contents, metadata, or job names. If your folders are named job_001, run_2024_001, etc., those become your protein IDs—the actual protein names will be lost.
Check your folder names before running tsp-maker.
Requirements¶
| Requirement | Rule |
|---|---|
| Max length | 16 characters |
| Allowed characters | A-Z, a-z, 0-9, _, - |
| Not allowed | Spaces, brackets, path separators, special characters |
Valid Examples¶
| Folder Name | Protein ID |
|---|---|
P12345 |
P12345 |
AT1G01010 |
AT1G01010 |
gene-001 |
gene-001 |
NbD_12345 |
NbD_12345 |
Problem Examples¶
| Folder Name | Problem |
|---|---|
job_001 |
Valid syntax, but ID will be job_001 not the protein name |
alphafold_P12345_run1 |
Exceeds 16 characters—skipped |
my protein |
Contains space—skipped |
results (copy) |
Contains brackets and space—skipped |
Fixing Folder Names¶
If your folders don't contain protein IDs, rename them before parsing:
Or use --id-pattern with a capture group to extract IDs from complex names:
# Extract protein ID from "job_P12345_model" → "P12345"
tsp-maker parse af3 /data/predictions /output \
--id-pattern "job_([A-Za-z0-9_-]+)_model"
Complex IDs
For complexes with multiple proteins, join IDs with underscore: P12345_Q67890