build¶
Assemble intermediate format into a TSP package.
Synopsis¶
Arguments¶
| Argument | Description |
|---|---|
INPUT_DIR |
Directory containing structures/, scores/, pae/ |
OUTPUT_DIR |
Directory to write TSP package |
Options¶
| Option | Required | Default | Description |
|---|---|---|---|
--name |
Yes | Dataset name (lowercase, hyphens) | |
--title |
No | Human-readable title | |
--description |
No | Dataset description | |
--author |
No | Author name | |
--affiliation |
No | Author affiliation | |
--version |
No | 1.0.0 | Semantic version |
--license |
No | CC-BY-4.0 | License identifier |
--batch-size-mb |
No | 100 | Target batch size in MB |
-q, --quiet |
No | Suppress progress output |
Examples¶
Minimal¶
Full Metadata¶
tsp-maker build /intermediate /my-dataset \
--name arabidopsis-kinases \
--title "Arabidopsis Kinase Structure Predictions" \
--description "AlphaFold3 and Boltz2 predictions for 500 kinases" \
--author "Jane Doe" \
--affiliation "The Sainsbury Laboratory" \
--version "1.0.0" \
--license "CC-BY-4.0"
Large Datasets¶
For large datasets, increase batch size:
Output Structure¶
OUTPUT_DIR/
├── datapackage.json # Package manifest
├── metadata.parquet # Per-structure metadata
├── structures/
│ ├── batch_001.tar.gz # Structure archives
│ ├── batch_002.tar.gz
│ └── ...
└── predictions/
├── scores.parquet # Prediction scores
└── pae/
├── batch_001.tar.gz # PAE matrices
└── ...
datapackage.json¶
The manifest includes:
{
"profile": "tsl-structure-package",
"profile_version": "1.0.0",
"name": "my-structures",
"title": "My Structure Dataset",
"description": "...",
"version": "1.0.0",
"created": "2024-12-16T...",
"licenses": [...],
"contributors": [...],
"stats": {
"structure_count": 150,
"total_size_bytes": 52428800,
"structure_formats": ["cif", "pdb"],
"prediction_sources": ["alphafold3", "boltz2"]
},
"resources": [...]
}
Batching¶
Structures are grouped into tar.gz archives:
- Default batch size: 100 MB
- Each batch contains multiple structures
- Batches are numbered sequentially:
batch_001.tar.gz,batch_002.tar.gz
Batch Size Trade-off
- Smaller batches = faster individual file access
- Larger batches = fewer files to manage, better for large datasets
Metadata Columns¶
The metadata.parquet file contains:
| Column | Type | Description |
|---|---|---|
id |
string | Unique model ID (e.g., P12345_AF3_1) |
structure_id |
string | Protein ID |
filename |
string | Structure filename |
batch |
string | Batch archive name |
format |
string | File format (pdb/cif) |
residue_count |
int | Number of residues |
chain_count |
int | Number of chains |
prediction_source |
string | Predictor name |
model_rank |
int | Rank within protein |
mean_plddt |
float | Mean pLDDT score |
ptm_score |
float | pTM score |
iptm_score |
float | ipTM score (multimers) |
ranking_score |
float | Overall ranking score |