Creating TSP Datasets¶
This guide describes the workflow for converting structure predictions into a TSP package and uploading to Zenodo using tsp-maker.
Overview¶
%%{init: {'theme': 'base'}}%%
flowchart LR
A[Raw Predictions] --> B[Parse]
B --> C[Intermediate Files]
C --> D[Build TSP]
D --> E[Validate]
E --> F[Upload to Zenodo]
Requirements:
- Python 3.10+ with tsp-maker installed
- R with tslstructures package (for validation)
- Zenodo account with API token
Install tsp-maker:
Important: Folder Names = Protein IDs¶
Critical Design Decision
tsp-maker uses folder names as protein identifiers. The name of each prediction folder becomes the protein ID in your final dataset.
This is deliberate—we do not extract IDs from file contents or metadata. If your folders are named job_001, run_2024_001, etc., those become your protein IDs and the actual protein names will be lost.
Before running tsp-maker, ensure your folder names are the protein identifiers you want.
Folder name requirements:
- Max 16 characters
- Allowed: letters (
A-Z,a-z), digits (0-9), underscore (_), hyphen (-) - Not allowed: spaces, brackets, special characters
If folders need renaming:
# Rename from job IDs to protein IDs
mv predictions/af3/job_001 predictions/af3/P12345
mv predictions/af3/job_002 predictions/af3/Q67890
Or use --id-pattern to extract IDs from complex folder names (see tsp-maker documentation).
Step 1: Organise Predictions¶
Structure raw prediction outputs by protein and predictor. Folder names become protein IDs:
predictions/
├── af2/
│ ├── P12345/ ← Folder name "P12345" becomes protein ID
│ │ ├── ranked_0.pdb
│ │ ├── ranked_1.pdb
│ │ ├── result_model_1_*.pkl
│ │ └── ranking_debug.json
│ └── Q67890/ ← Folder name "Q67890" becomes protein ID
│ └── ...
├── af3/
│ ├── P12345/
│ │ ├── seed-1_sample-0/
│ │ │ ├── model.cif
│ │ │ ├── confidences.json
│ │ │ └── summary_confidences.json
│ │ └── ...
│ └── ...
└── boltz2/
├── P12345/
│ ├── predictions/
│ │ ├── P12345_model_0.pdb
│ │ ├── pae_P12345_model_0.npz
│ │ └── confidence_P12345_model_0.json
│ └── ...
└── ...
For complexes with multiple proteins, join IDs with underscore: P12345_Q67890.
Step 2: Parse Predictions¶
Use the appropriate parser for each predictor. All parsers output to a common intermediate format.
AlphaFold2¶
Options:
- --top-n N: Keep only top N ranked models per protein (default: 5)
- --id-pattern REGEX: Filter folders or extract IDs from complex folder names
Expected input: ranked_*.pdb files and ranking_debug.json with scores.
AlphaFold3¶
Expected input: seed-*_sample-*/model.cif structure files, confidences.json with PAE matrix, summary_confidences.json with scores.
Boltz2¶
Expected input: predictions/*_model_*.pdb structure files, pae_*_model_*.npz PAE matrices, confidence_*_model_*.json scores.
Multiple Predictors¶
For datasets with multiple predictors, run each parser to the same output directory:
OUTPUT=/tmp/parsed_structures
tsp-maker parse af2 /data/af2 $OUTPUT --top-n 5
tsp-maker parse af3 /data/af3 $OUTPUT --top-n 5
tsp-maker parse boltz2 /data/boltz2 $OUTPUT --top-n 5
Parsers use predictor-specific suffixes (_AF2_, _AF3_, _BZ2_) to avoid filename collisions.
Step 3: Verify Intermediate Files¶
After parsing:
/tmp/parsed_structures/
├── structures/
│ ├── P12345_AF2_1.pdb
│ ├── P12345_AF2_2.pdb
│ ├── P12345_AF3_1.cif
│ ├── P12345_BZ2_1.pdb
│ └── ...
├── pae/
│ ├── P12345_AF2_1_pae.json
│ ├── P12345_AF3_1_pae.json
│ └── ...
└── metadata/
├── P12345_AF2_1.json
├── P12345_AF3_1.json
└── ...
Verify: - Structure count matches expectations - No parsing errors in output - Metadata JSON files contain required fields
Step 4: Build TSP Package¶
tsp-maker build \
/tmp/parsed_structures \
/tmp/my-dataset \
--name "my-structure-dataset" \
--title "My Structure Dataset" \
--description "Predicted structures for..." \
--author "Your Name" \
--affiliation "Your Institution"
Required options:
- --name: Short identifier (lowercase, hyphens)
- --title: Human-readable title
- --description: Dataset description
Optional:
- --batch-size SIZE: Target batch size in MB (default: 100)
- --license LICENSE: License identifier (default: CC-BY-4.0)
Output:
/tmp/my-dataset/
├── datapackage.json
├── metadata.parquet
├── structures/
│ └── batch_001.tar.gz
└── predictions/
├── scores.parquet
└── pae/
└── batch_001.tar.gz
Step 5: Validate¶
Use the R package to validate the TSP:
library(tslstructures)
result <- validate_tsp("/tmp/my-dataset")
if (result$valid) {
cat("TSP is valid\n")
cat("Structures:", result$stats$structure_count, "\n")
} else {
cat("Validation errors:\n")
print(result$errors)
}
Validation checks:
- Required files exist
- datapackage.json is valid JSON with required fields
- Parquet files are readable with expected columns
- All structures referenced in metadata exist in batches
- Scores match structure IDs
Step 6: Upload to Zenodo¶
6.1 Obtain API Token¶
To obtain a Zenodo API token:
1. Log into Zenodo (or sandbox.zenodo.org for testing)
2. Navigate to Account → Applications → Personal access tokens
3. Create token with scopes: deposit:write, deposit:actions
6.2 Test Upload (Sandbox)¶
Test on sandbox before production:
# Upload as draft
tsp-maker upload /tmp/my-dataset --token YOUR_TOKEN --sandbox
# Upload and publish
tsp-maker upload /tmp/my-dataset --token YOUR_TOKEN --sandbox --publish
Sandbox DOIs (10.5072/zenodo.XXXXX) are not permanent and records can be deleted.
6.3 Production Upload¶
For permanent publication:
Production uploads create permanent DOIs.
6.4 After Upload¶
The script outputs deposit information:
{
"id": "12345678",
"doi": "10.5281/zenodo.12345678",
"record_url": "https://zenodo.org/records/12345678"
}
Note: The datapackage.json uploaded to Zenodo does not contain the DOI (chicken-and-egg problem). This is intentional - the tslstructures R package retrieves DOI information from the Zenodo API, so self-referencing isn't needed. You can optionally update your local copy with the DOI for reference.
Complete Example¶
# Parse all prediction sources
tsp-maker parse af2 /data/arabidopsis/af2 /tmp/parsed --top-n 5
tsp-maker parse af3 /data/arabidopsis/af3 /tmp/parsed --top-n 5
tsp-maker parse boltz2 /data/arabidopsis/boltz2 /tmp/parsed --top-n 5
# Build TSP
tsp-maker build \
/tmp/parsed \
/tmp/arabidopsis-structures \
--name "arabidopsis-structures" \
--title "Arabidopsis thaliana Predicted Structures" \
--description "Structure predictions for A. thaliana proteome" \
--author "Dan MacLean" \
--affiliation "The Sainsbury Laboratory"
# Validate (in R)
# library(tslstructures)
# validate_tsp("/tmp/arabidopsis-structures")
# Upload
tsp-maker upload /tmp/arabidopsis-structures --token YOUR_TOKEN --publish
Troubleshooting¶
Folder names not meeting requirements¶
Folder names must be max 16 characters, alphanumeric plus _ and - only. Folders that don't meet requirements are skipped with a warning.
To filter or extract IDs from complex folder names, use --id-pattern:
# Extract ID from complex folder names
tsp-maker parse af2 /data/predictions /tmp/out \
--id-pattern "gene_(\w+)"
Batch size adjustment¶
Adjust with --batch-size (in MB):
# Smaller batches for smaller datasets
tsp-maker build input output --name my-dataset --batch-size 50
# Larger batches for large datasets
tsp-maker build input output --name my-dataset --batch-size 500
Upload fails with 413 error¶
Files exceed upload size limit. The upload script handles chunked uploads automatically, but very large batches (>50 GB) may require manual splitting.
Validation reports missing structures¶
Verify all structures referenced in metadata exist in batch archives:
After Upload¶
- Share the Zenodo DOI with collaborators
- Update project documentation with the DOI
- Consider submitting to TSL Structures community (when available)
For consuming datasets: Using Datasets
Reference: tsp-maker documentation | TSP Specification