Skip to content

parse

Convert predictor outputs to intermediate format.

Synopsis

tsp-maker parse <predictor> INPUT_DIR OUTPUT_DIR [OPTIONS]

Predictors

Subcommand Predictor
af2 AlphaFold2
af3 AlphaFold3
boltz2 Boltz2

Arguments

Argument Description
INPUT_DIR Directory containing predictor outputs
OUTPUT_DIR Directory to write intermediate format

Options

Option Default Description
--top-n 5 Number of top-ranked models to extract
--id-pattern None Optional regex to filter/extract IDs from folder names
-q, --quiet Suppress progress output

Examples

Basic Usage

# Parse AlphaFold3 outputs
tsp-maker parse af3 /data/af3_predictions /intermediate

# Parse with fewer models
tsp-maker parse af3 /data/af3_predictions /intermediate --top-n 3

Multiple Predictors

Run multiple parse commands to the same output directory:

tsp-maker parse af2 /data/af2 /intermediate --top-n 5
tsp-maker parse af3 /data/af3 /intermediate --top-n 5
tsp-maker parse boltz2 /data/boltz2 /intermediate --top-n 3

Files are named with predictor suffixes (_AF2_, _AF3_, _BZ2_) to avoid collisions.

Filtering with ID Pattern

Use --id-pattern to filter folders or extract IDs from complex folder names:

# Only process folders matching a pattern
tsp-maker parse af3 /data/mixed /intermediate \
    --id-pattern "AT[0-9]G[0-9]{5}"

# Extract ID from complex folder names (use capture group)
tsp-maker parse af3 /data/jobs /intermediate \
    --id-pattern "job_(\w+)_model"

Output Format

The intermediate format contains:

OUTPUT_DIR/
├── structures/
│   ├── P12345_AF3_1.cif
│   ├── P12345_AF3_2.cif
│   ├── Q67890_AF3_1.cif
│   └── ...
├── scores/
│   ├── P12345_AF3.json
│   ├── Q67890_AF3.json
│   └── ...
└── pae/
    ├── P12345_AF3_1.npy
    ├── P12345_AF3_2.npy
    └── ...

Score JSON Format

[
  {
    "model_id": "P12345_AF3_1",
    "structure_id": "P12345",
    "rank": 1,
    "predictor": "alphafold3",
    "plddt_mean": 85.2,
    "ptm": 0.82,
    "ranking_score": 0.85
  }
]

Protein ID Rules

Folder Names = Protein IDs

tsp-maker uses folder names directly as protein identifiers. This is a deliberate design decision.

We do not extract IDs from file contents, metadata, or job names. If your folders are named job_001, run_2024_001, etc., those become your protein IDs—the actual protein names will be lost.

Check your folder names before running tsp-maker.

Requirements

Requirement Rule
Max length 16 characters
Allowed characters A-Z, a-z, 0-9, _, -
Not allowed Spaces, brackets, path separators, special characters

Valid Examples

Folder Name Protein ID
P12345 P12345
AT1G01010 AT1G01010
gene-001 gene-001
NbD_12345 NbD_12345

Problem Examples

Folder Name Problem
job_001 Valid syntax, but ID will be job_001 not the protein name
alphafold_P12345_run1 Exceeds 16 characters—skipped
my protein Contains space—skipped
results (copy) Contains brackets and space—skipped

Fixing Folder Names

If your folders don't contain protein IDs, rename them before parsing:

# Rename folders to protein IDs
mv /data/job_001 /data/P12345
mv /data/job_002 /data/Q67890

Or use --id-pattern with a capture group to extract IDs from complex names:

# Extract protein ID from "job_P12345_model" → "P12345"
tsp-maker parse af3 /data/predictions /output \
    --id-pattern "job_([A-Za-z0-9_-]+)_model"

Complex IDs

For complexes with multiple proteins, join IDs with underscore: P12345_Q67890