build¶

Assemble intermediate format into a TSP package.

Synopsis¶

tsp-maker build INPUT_DIR OUTPUT_DIR --name NAME [OPTIONS]

Arguments¶

Argument	Description
`INPUT_DIR`	Directory containing structures/, scores/, pae/
`OUTPUT_DIR`	Directory to write TSP package

Options¶

Option	Required	Default	Description
`--name`	Yes		Dataset name (lowercase, hyphens)
`--title`	No		Human-readable title
`--description`	No		Dataset description
`--author`	No		Author name
`--affiliation`	No		Author affiliation
`--version`	No	1.0.0	Semantic version
`--license`	No	CC-BY-4.0	License identifier
`--batch-size-mb`	No	100	Target batch size in MB
`-q, --quiet`	No		Suppress progress output

Examples¶

Minimal¶

tsp-maker build /intermediate /my-dataset --name my-structures

Full Metadata¶

tsp-maker build /intermediate /my-dataset \
    --name arabidopsis-kinases \
    --title "Arabidopsis Kinase Structure Predictions" \
    --description "AlphaFold3 and Boltz2 predictions for 500 kinases" \
    --author "Jane Doe" \
    --affiliation "The Sainsbury Laboratory" \
    --version "1.0.0" \
    --license "CC-BY-4.0"

Large Datasets¶

For large datasets, increase batch size:

tsp-maker build /intermediate /my-dataset \
    --name large-proteome \
    --batch-size-mb 2000

Output Structure¶

OUTPUT_DIR/
├── datapackage.json          # Package manifest
├── metadata.parquet          # Per-structure metadata
├── structures/
│   ├── batch_001.tar.gz      # Structure archives
│   ├── batch_002.tar.gz
│   └── ...
└── predictions/
    ├── scores.parquet        # Prediction scores
    └── pae/
        ├── batch_001.tar.gz  # PAE matrices
        └── ...

datapackage.json¶

The manifest includes:

{
  "profile": "tsl-structure-package",
  "profile_version": "1.0.0",
  "name": "my-structures",
  "title": "My Structure Dataset",
  "description": "...",
  "version": "1.0.0",
  "created": "2024-12-16T...",
  "licenses": [...],
  "contributors": [...],
  "stats": {
    "structure_count": 150,
    "total_size_bytes": 52428800,
    "structure_formats": ["cif", "pdb"],
    "prediction_sources": ["alphafold3", "boltz2"]
  },
  "resources": [...]
}

Batching¶

Structures are grouped into tar.gz archives:

Default batch size: 100 MB
Each batch contains multiple structures
Batches are numbered sequentially: batch_001.tar.gz, batch_002.tar.gz

Batch Size Trade-off

Smaller batches = faster individual file access
Larger batches = fewer files to manage, better for large datasets

Metadata Columns¶

The metadata.parquet file contains:

Column	Type	Description
`id`	string	Unique model ID (e.g., `P12345_AF3_1`)
`structure_id`	string	Protein ID
`filename`	string	Structure filename
`batch`	string	Batch archive name
`format`	string	File format (pdb/cif)
`residue_count`	int	Number of residues
`chain_count`	int	Number of chains
`prediction_source`	string	Predictor name
`model_rank`	int	Rank within protein
`mean_plddt`	float	Mean pLDDT score
`ptm_score`	float	pTM score
`iptm_score`	float	ipTM score (multimers)
`ranking_score`	float	Overall ranking score