parse¶

Convert predictor outputs to intermediate format.

Synopsis¶

tsp-maker parse <predictor> INPUT_DIR OUTPUT_DIR [OPTIONS]

Predictors¶

Subcommand	Predictor
`af2`	AlphaFold2
`af3`	AlphaFold3
`boltz2`	Boltz2

Arguments¶

Argument	Description
`INPUT_DIR`	Directory containing predictor outputs
`OUTPUT_DIR`	Directory to write intermediate format

Options¶

Option	Default	Description
`--top-n`	5	Number of top-ranked models to extract
`--id-pattern`	None	Optional regex to filter/extract IDs from folder names
`-q, --quiet`		Suppress progress output

Examples¶

Basic Usage¶

# Parse AlphaFold3 outputs
tsp-maker parse af3 /data/af3_predictions /intermediate

# Parse with fewer models
tsp-maker parse af3 /data/af3_predictions /intermediate --top-n 3

Multiple Predictors¶

Run multiple parse commands to the same output directory:

tsp-maker parse af2 /data/af2 /intermediate --top-n 5
tsp-maker parse af3 /data/af3 /intermediate --top-n 5
tsp-maker parse boltz2 /data/boltz2 /intermediate --top-n 3

Files are named with predictor suffixes (_AF2_, _AF3_, _BZ2_) to avoid collisions.

Filtering with ID Pattern¶

Use --id-pattern to filter folders or extract IDs from complex folder names:

# Only process folders matching a pattern
tsp-maker parse af3 /data/mixed /intermediate \
    --id-pattern "AT[0-9]G[0-9]{5}"

# Extract ID from complex folder names (use capture group)
tsp-maker parse af3 /data/jobs /intermediate \
    --id-pattern "job_(\w+)_model"

Output Format¶

The intermediate format contains:

OUTPUT_DIR/
├── structures/
│   ├── P12345_AF3_1.cif
│   ├── P12345_AF3_2.cif
│   ├── Q67890_AF3_1.cif
│   └── ...
├── scores/
│   ├── P12345_AF3.json
│   ├── Q67890_AF3.json
│   └── ...
└── pae/
    ├── P12345_AF3_1.npy
    ├── P12345_AF3_2.npy
    └── ...

Score JSON Format¶

[
  {
    "model_id": "P12345_AF3_1",
    "structure_id": "P12345",
    "rank": 1,
    "predictor": "alphafold3",
    "plddt_mean": 85.2,
    "ptm": 0.82,
    "ranking_score": 0.85
  }
]

Protein ID Rules¶

Folder Names = Protein IDs

tsp-maker uses folder names directly as protein identifiers. This is a deliberate design decision.

We do not extract IDs from file contents, metadata, or job names. If your folders are named job_001, run_2024_001, etc., those become your protein IDs—the actual protein names will be lost.

Check your folder names before running tsp-maker.

Requirements¶

Requirement	Rule
Max length	16 characters
Allowed characters	`A-Z`, `a-z`, `0-9`, `_`, `-`
Not allowed	Spaces, brackets, path separators, special characters

Valid Examples¶

Folder Name	Protein ID
`P12345`	P12345
`AT1G01010`	AT1G01010
`gene-001`	gene-001
`NbD_12345`	NbD_12345

Problem Examples¶

Folder Name	Problem
`job_001`	Valid syntax, but ID will be `job_001` not the protein name
`alphafold_P12345_run1`	Exceeds 16 characters—skipped
`my protein`	Contains space—skipped
`results (copy)`	Contains brackets and space—skipped

Fixing Folder Names¶

If your folders don't contain protein IDs, rename them before parsing:

# Rename folders to protein IDs
mv /data/job_001 /data/P12345
mv /data/job_002 /data/Q67890

Or use --id-pattern with a capture group to extract IDs from complex names:

# Extract protein ID from "job_P12345_model" → "P12345"
tsp-maker parse af3 /data/predictions /output \
    --id-pattern "job_([A-Za-z0-9_-]+)_model"

Complex IDs

For complexes with multiple proteins, join IDs with underscore: P12345_Q67890