Skip to content

build

Assemble intermediate format into a TSP package.

Synopsis

tsp-maker build INPUT_DIR OUTPUT_DIR --name NAME [OPTIONS]

Arguments

Argument Description
INPUT_DIR Directory containing structures/, scores/, pae/
OUTPUT_DIR Directory to write TSP package

Options

Option Required Default Description
--name Yes Dataset name (lowercase, hyphens)
--title No Human-readable title
--description No Dataset description
--author No Author name
--affiliation No Author affiliation
--version No 1.0.0 Semantic version
--license No CC-BY-4.0 License identifier
--batch-size-mb No 100 Target batch size in MB
-q, --quiet No Suppress progress output

Examples

Minimal

tsp-maker build /intermediate /my-dataset --name my-structures

Full Metadata

tsp-maker build /intermediate /my-dataset \
    --name arabidopsis-kinases \
    --title "Arabidopsis Kinase Structure Predictions" \
    --description "AlphaFold3 and Boltz2 predictions for 500 kinases" \
    --author "Jane Doe" \
    --affiliation "The Sainsbury Laboratory" \
    --version "1.0.0" \
    --license "CC-BY-4.0"

Large Datasets

For large datasets, increase batch size:

tsp-maker build /intermediate /my-dataset \
    --name large-proteome \
    --batch-size-mb 2000

Output Structure

OUTPUT_DIR/
├── datapackage.json          # Package manifest
├── metadata.parquet          # Per-structure metadata
├── structures/
│   ├── batch_001.tar.gz      # Structure archives
│   ├── batch_002.tar.gz
│   └── ...
└── predictions/
    ├── scores.parquet        # Prediction scores
    └── pae/
        ├── batch_001.tar.gz  # PAE matrices
        └── ...

datapackage.json

The manifest includes:

{
  "profile": "tsl-structure-package",
  "profile_version": "1.0.0",
  "name": "my-structures",
  "title": "My Structure Dataset",
  "description": "...",
  "version": "1.0.0",
  "created": "2024-12-16T...",
  "licenses": [...],
  "contributors": [...],
  "stats": {
    "structure_count": 150,
    "total_size_bytes": 52428800,
    "structure_formats": ["cif", "pdb"],
    "prediction_sources": ["alphafold3", "boltz2"]
  },
  "resources": [...]
}

Batching

Structures are grouped into tar.gz archives:

  • Default batch size: 100 MB
  • Each batch contains multiple structures
  • Batches are numbered sequentially: batch_001.tar.gz, batch_002.tar.gz

Batch Size Trade-off

  • Smaller batches = faster individual file access
  • Larger batches = fewer files to manage, better for large datasets

Metadata Columns

The metadata.parquet file contains:

Column Type Description
id string Unique model ID (e.g., P12345_AF3_1)
structure_id string Protein ID
filename string Structure filename
batch string Batch archive name
format string File format (pdb/cif)
residue_count int Number of residues
chain_count int Number of chains
prediction_source string Predictor name
model_rank int Rank within protein
mean_plddt float Mean pLDDT score
ptm_score float pTM score
iptm_score float ipTM score (multimers)
ranking_score float Overall ranking score