Creating TSP Datasets¶

← Home | TSP Format

This guide describes the workflow for converting structure predictions into a TSP package and uploading to Zenodo using tsp-maker.

Overview¶

%%{init: {'theme': 'base'}}%%
flowchart LR
    A[Raw Predictions] --> B[Parse]
    B --> C[Intermediate Files]
    C --> D[Build TSP]
    D --> E[Validate]
    E --> F[Upload to Zenodo]

Requirements: - Python 3.10+ with tsp-maker installed - R with tslstructures package (for validation) - Zenodo account with API token

Install tsp-maker:

pip install git+https://github.com/TeamMacLean/tsp-maker.git

Important: Folder Names = Protein IDs¶

Critical Design Decision

tsp-maker uses folder names as protein identifiers. The name of each prediction folder becomes the protein ID in your final dataset.

This is deliberate—we do not extract IDs from file contents or metadata. If your folders are named job_001, run_2024_001, etc., those become your protein IDs and the actual protein names will be lost.

Before running tsp-maker, ensure your folder names are the protein identifiers you want.

Folder name requirements:

Max 16 characters
Allowed: letters (A-Z, a-z), digits (0-9), underscore (_), hyphen (-)
Not allowed: spaces, brackets, special characters

If folders need renaming:

# Rename from job IDs to protein IDs
mv predictions/af3/job_001 predictions/af3/P12345
mv predictions/af3/job_002 predictions/af3/Q67890

Or use --id-pattern to extract IDs from complex folder names (see tsp-maker documentation).

Step 1: Organise Predictions¶

Structure raw prediction outputs by protein and predictor. Folder names become protein IDs:

predictions/
├── af2/
│   ├── P12345/              ← Folder name "P12345" becomes protein ID
│   │   ├── ranked_0.pdb
│   │   ├── ranked_1.pdb
│   │   ├── result_model_1_*.pkl
│   │   └── ranking_debug.json
│   └── Q67890/              ← Folder name "Q67890" becomes protein ID
│       └── ...
├── af3/
│   ├── P12345/
│   │   ├── seed-1_sample-0/
│   │   │   ├── model.cif
│   │   │   ├── confidences.json
│   │   │   └── summary_confidences.json
│   │   └── ...
│   └── ...
└── boltz2/
    ├── P12345/
    │   ├── predictions/
    │   │   ├── P12345_model_0.pdb
    │   │   ├── pae_P12345_model_0.npz
    │   │   └── confidence_P12345_model_0.json
    │   └── ...
    └── ...

For complexes with multiple proteins, join IDs with underscore: P12345_Q67890.

Step 2: Parse Predictions¶

Use the appropriate parser for each predictor. All parsers output to a common intermediate format.

AlphaFold2¶

tsp-maker parse af2 \
    /path/to/af2/predictions \
    /tmp/parsed_structures \
    --top-n 5

Options: - --top-n N: Keep only top N ranked models per protein (default: 5) - --id-pattern REGEX: Filter folders or extract IDs from complex folder names

Expected input: ranked_*.pdb files and ranking_debug.json with scores.

AlphaFold3¶

tsp-maker parse af3 \
    /path/to/af3/predictions \
    /tmp/parsed_structures \
    --top-n 5

Expected input: seed-*_sample-*/model.cif structure files, confidences.json with PAE matrix, summary_confidences.json with scores.

Boltz2¶

tsp-maker parse boltz2 \
    /path/to/boltz2/predictions \
    /tmp/parsed_structures \
    --top-n 5

Expected input: predictions/*_model_*.pdb structure files, pae_*_model_*.npz PAE matrices, confidence_*_model_*.json scores.

Multiple Predictors¶

For datasets with multiple predictors, run each parser to the same output directory:

OUTPUT=/tmp/parsed_structures

tsp-maker parse af2 /data/af2 $OUTPUT --top-n 5
tsp-maker parse af3 /data/af3 $OUTPUT --top-n 5
tsp-maker parse boltz2 /data/boltz2 $OUTPUT --top-n 5

Parsers use predictor-specific suffixes (_AF2_, _AF3_, _BZ2_) to avoid filename collisions.

Step 3: Verify Intermediate Files¶

After parsing:

/tmp/parsed_structures/
├── structures/
│   ├── P12345_AF2_1.pdb
│   ├── P12345_AF2_2.pdb
│   ├── P12345_AF3_1.cif
│   ├── P12345_BZ2_1.pdb
│   └── ...
├── pae/
│   ├── P12345_AF2_1_pae.json
│   ├── P12345_AF3_1_pae.json
│   └── ...
└── metadata/
    ├── P12345_AF2_1.json
    ├── P12345_AF3_1.json
    └── ...

Verify: - Structure count matches expectations - No parsing errors in output - Metadata JSON files contain required fields

Step 4: Build TSP Package¶

tsp-maker build \
    /tmp/parsed_structures \
    /tmp/my-dataset \
    --name "my-structure-dataset" \
    --title "My Structure Dataset" \
    --description "Predicted structures for..." \
    --author "Your Name" \
    --affiliation "Your Institution"

Required options: - --name: Short identifier (lowercase, hyphens) - --title: Human-readable title - --description: Dataset description

Optional: - --batch-size SIZE: Target batch size in MB (default: 100) - --license LICENSE: License identifier (default: CC-BY-4.0)

Output:

/tmp/my-dataset/
├── datapackage.json
├── metadata.parquet
├── structures/
│   └── batch_001.tar.gz
└── predictions/
    ├── scores.parquet
    └── pae/
        └── batch_001.tar.gz

Step 5: Validate¶

Use the R package to validate the TSP:

library(tslstructures)

result <- validate_tsp("/tmp/my-dataset")

if (result$valid) {
  cat("TSP is valid\n")
  cat("Structures:", result$stats$structure_count, "\n")
} else {
  cat("Validation errors:\n")
  print(result$errors)
}

Validation checks: - Required files exist - datapackage.json is valid JSON with required fields - Parquet files are readable with expected columns - All structures referenced in metadata exist in batches - Scores match structure IDs

Step 6: Upload to Zenodo¶

6.1 Obtain API Token¶

To obtain a Zenodo API token: 1. Log into Zenodo (or sandbox.zenodo.org for testing) 2. Navigate to Account → Applications → Personal access tokens 3. Create token with scopes: deposit:write, deposit:actions

6.2 Test Upload (Sandbox)¶

Test on sandbox before production:

# Upload as draft
tsp-maker upload /tmp/my-dataset --token YOUR_TOKEN --sandbox

# Upload and publish
tsp-maker upload /tmp/my-dataset --token YOUR_TOKEN --sandbox --publish

Sandbox DOIs (10.5072/zenodo.XXXXX) are not permanent and records can be deleted.

6.3 Production Upload¶

For permanent publication:

tsp-maker upload /tmp/my-dataset --token YOUR_TOKEN --publish

Production uploads create permanent DOIs.

6.4 After Upload¶

The script outputs deposit information:

{
  "id": "12345678",
  "doi": "10.5281/zenodo.12345678",
  "record_url": "https://zenodo.org/records/12345678"
}

Note: The datapackage.json uploaded to Zenodo does not contain the DOI (chicken-and-egg problem). This is intentional - the tslstructures R package retrieves DOI information from the Zenodo API, so self-referencing isn't needed. You can optionally update your local copy with the DOI for reference.

Complete Example¶

# Parse all prediction sources
tsp-maker parse af2 /data/arabidopsis/af2 /tmp/parsed --top-n 5
tsp-maker parse af3 /data/arabidopsis/af3 /tmp/parsed --top-n 5
tsp-maker parse boltz2 /data/arabidopsis/boltz2 /tmp/parsed --top-n 5

# Build TSP
tsp-maker build \
    /tmp/parsed \
    /tmp/arabidopsis-structures \
    --name "arabidopsis-structures" \
    --title "Arabidopsis thaliana Predicted Structures" \
    --description "Structure predictions for A. thaliana proteome" \
    --author "Dan MacLean" \
    --affiliation "The Sainsbury Laboratory"

# Validate (in R)
# library(tslstructures)
# validate_tsp("/tmp/arabidopsis-structures")

# Upload
tsp-maker upload /tmp/arabidopsis-structures --token YOUR_TOKEN --publish

Troubleshooting¶

Folder names not meeting requirements¶

Folder names must be max 16 characters, alphanumeric plus _ and - only. Folders that don't meet requirements are skipped with a warning.

To filter or extract IDs from complex folder names, use --id-pattern:

# Extract ID from complex folder names
tsp-maker parse af2 /data/predictions /tmp/out \
    --id-pattern "gene_(\w+)"

Batch size adjustment¶

Adjust with --batch-size (in MB):

# Smaller batches for smaller datasets
tsp-maker build input output --name my-dataset --batch-size 50

# Larger batches for large datasets
tsp-maker build input output --name my-dataset --batch-size 500

Upload fails with 413 error¶

Files exceed upload size limit. The upload script handles chunked uploads automatically, but very large batches (>50 GB) may require manual splitting.

Validation reports missing structures¶

Verify all structures referenced in metadata exist in batch archives:

tar -tzf /tmp/my-dataset/structures/batch_001.tar.gz

After Upload¶

Share the Zenodo DOI with collaborators
Update project documentation with the DOI
Consider submitting to TSL Structures community (when available)

For consuming datasets: Using Datasets

Reference: tsp-maker documentation | TSP Specification