TSL Structures: Sharing Protein Structure Predictions at Scale¶
Summary¶
Structure prediction now generates data at a scale that existing sharing methods cannot handle. A proteome prediction produces a difficult-to-manage volume of data across thousands of structures, in formats that differ between predictors and contain far more raw output than collaborators actually need.
TSP (TSL Structure Package) is a data standard designed to solve this. It captures the essential information from structure predictions—the structures themselves, confidence scores, and PAE matrices—in a compact, standardised format that uses modern columnar data tools. A TSP package is a single portable object that can be easily shared, archived, or distributed. The format reduces storage requirements dramatically while making the data queryable: users can filter by confidence scores or protein identifiers before downloading any structure files.
To make these packages discoverable, we provide a Zenodo community that acts as a central index for structure prediction datasets. Datasets remain under individual ownership with permanent DOIs for citation, but the community provides an umbrella that makes them findable in one place.
We provide two tools to work with TSP packages:
| Tool | Purpose |
|---|---|
| tsp-maker (Python) | Automates parsing and packaging of raw predictor output from AlphaFold2, AlphaFold3, and Boltz2 |
| tslstructures (R) | Enables easy analysis and extraction from large structure datasets |
%%{init: {'theme': 'base'}}%%
flowchart LR
subgraph Predictions["Raw Predictions"]
AF2[AlphaFold2]
AF3[AlphaFold3]
BZ2[Boltz2]
end
subgraph Tools["tsp-maker"]
Parse[Parse & normalise]
Build[Build TSP]
end
subgraph Package["TSP Package"]
Meta[Queryable metadata]
Struct[Batched structures]
end
subgraph Distribution["Zenodo"]
DOI[(Your dataset<br/>with DOI)]
Community[TSL Structures<br/>Community]
end
subgraph Access["tslstructures"]
Query[Query & filter]
Download[Download selected]
end
AF2 --> Parse
AF3 --> Parse
BZ2 --> Parse
Parse --> Build
Build --> Meta
Build --> Struct
Meta --> DOI
Struct --> DOI
DOI --> Community
Community --> Query
DOI --> Query
Query --> Download
Producers: Use tsp-maker to parse predictions → build TSP → upload to Zenodo → add to community
Consumers: Browse community for datasets → use tslstructures to query metadata → download only structures of interest
Background¶
A single research group can now generate tens of thousands of predicted structures in weeks. At The Sainsbury Laboratory, we routinely predict structures for complete proteomes (20,000+ proteins per species), multiple plant species and pathogens, protein complexes, and protein-ligand interactions.
This volume creates practical problems:
Data volume. A proteome-wide prediction run generates a substantial volume of raw output. With multiple predictors and multiple species, storage requirements quickly become unmanageable.
Format heterogeneity. Each predictor outputs different formats: AlphaFold2 produces pickle files, PDB structures, and ranking JSON; AlphaFold3 produces mmCIF structures and JSON confidence matrices; Boltz2 produces NPZ arrays, PDB structures, and JSON metrics.
Distribution limitations. Standard file sharing methods do not scale to thousands of structures. There is no established way to version, cite, or systematically query these datasets.
Duplicated effort. Without accessible shared datasets, research groups independently re-predict the same proteins.
TSP: TSL Structure Package¶
TSP is a data standard for packaging, sharing, and accessing protein structure predictions.
┌─────────────────────────────────────────────────────────────────┐
│ TSP Package │
├─────────────────────────────────────────────────────────────────┤
│ datapackage.json │ Manifest and metadata │
│ metadata.parquet │ Per-structure index (queryable) │
│ structures/batch_*.tar.gz │ Structure files in batches │
│ predictions/scores.parquet│ Confidence scores (queryable) │
│ predictions/pae/*.tar.gz │ PAE matrices in batches │
└─────────────────────────────────────────────────────────────────┘
Design Principles¶
Metadata-first access. With 20,000 structures, users rarely need all of them. TSP uses Parquet files for metadata—a columnar format that supports efficient filtering. Users identify structures of interest without downloading structure files.
Batched distribution. Structures are grouped into tar.gz archives (~100 MB each). This enables downloading specific batches, resuming interrupted transfers, and efficient compression.
Self-describing format. Each TSP includes a datapackage.json manifest describing contents, provenance, and schema. The format extends the Frictionless Data Package standard.
Version control and citation. TSP packages are designed for Zenodo, which provides DOIs for citation, versioning for updates, and long-term archival.
Workflow¶
%%{init: {'theme': 'base'}}%%
flowchart TB
subgraph Prediction["Structure Prediction"]
AF2[AlphaFold2]
AF3[AlphaFold3]
BZ2[Boltz2]
end
subgraph Producer["tsp-maker (Python)"]
Parse["Parse & normalise"]
Build["Build TSP"]
Upload["Upload to Zenodo"]
end
subgraph Zenodo["Zenodo"]
Dataset[(Your dataset<br/>with DOI)]
Community["TSL Structures<br/>Community"]
end
subgraph Consumer["tslstructures (R)"]
Discover["Discover datasets"]
Query["Query metadata"]
Download["Download structures"]
end
AF2 --> Parse
AF3 --> Parse
BZ2 --> Parse
Parse --> Build
Build --> Upload
Upload --> Dataset
Dataset --> Community
Community --> Discover
Discover --> Query
Query --> Download
For Data Producers¶
Use tsp-maker (Python CLI) to:
- Parse predictor outputs into a common format (handles AF2, AF3, Boltz2 differences)
- Build a TSP package with queryable metadata
- Upload to Zenodo (automatic DOI assignment)
- Add to TSL Structures community for discoverability
→ Creating Datasets Tutorial | tsp-maker Documentation
For Data Consumers¶
Use tslstructures (R package) to:
- Discover datasets via the TSL Structures community
- Query metadata to filter by confidence scores, protein IDs, predictors
- Download only the structures you need (smart caching avoids re-downloads)
- Analyse with standard R/Bioconductor workflows
→ Using Datasets Tutorial | tslstructures Documentation
Scale¶
TSP is designed to handle datasets ranging from individual proteins to complete proteomes and multi-species collections. The batched, compressed format substantially reduces storage requirements compared to raw predictor output, while the Parquet metadata files remain small enough for efficient querying regardless of dataset size.
Distribution¶
Local Use¶
TSP packages reduce storage requirements independently of external distribution. The batched, compressed format is practical for local HPC storage, internal sharing, or archival before Zenodo upload.
Zenodo¶
TSP is designed for distribution via Zenodo, which provides:
- DOIs for permanent citation
- Versioning for dataset updates
- Long-term archival (CERN infrastructure)
- Large file support (50 GB per file)
Ownership and Communities¶
Zenodo datasets remain under the uploader's ownership. The TSL Structures community provides a curated index for discovery, but:
- Datasets are uploaded to your own Zenodo account
- You retain full control (editing, versioning, access settings)
- Adding to a community does not transfer ownership
- Datasets can belong to multiple communities simultaneously
This allows datasets to be discoverable via TSL Structures while also appearing in other relevant communities (e.g., species-specific, project-specific, or institutional collections).
Tools¶
| Package | Language | What it does |
|---|---|---|
| tsp-maker | Python | Parses AF2/AF3/Boltz2 outputs, builds TSP packages, uploads to Zenodo. Handles format differences automatically. |
| tslstructures | R | Discovers datasets via community, queries metadata, downloads selected structures. Smart caching avoids re-downloads. |
Documentation¶
Getting Started
Tutorials
- Creating Datasets — for data producers
- Using Datasets — for data consumers
Reference
Status¶
| Component | Status |
|---|---|
| TSP Specification | v1.0.0 |
| tsp-maker | Complete |
| tslstructures | Complete |
| Zenodo Community | Pending |