TSL Structures: Sharing Protein Structure Predictions at Scale¶

Summary¶

Structure prediction now generates data at a scale that existing sharing methods cannot handle. A proteome prediction produces a difficult-to-manage volume of data across thousands of structures, in formats that differ between predictors and contain far more raw output than collaborators actually need.

TSP (TSL Structure Package) is a data standard designed to solve this. It captures the essential information from structure predictions—the structures themselves, confidence scores, and PAE matrices—in a compact, standardised format that uses modern columnar data tools. A TSP package is a single portable object that can be easily shared, archived, or distributed. The format reduces storage requirements dramatically while making the data queryable: users can filter by confidence scores or protein identifiers before downloading any structure files.

To make these packages discoverable, we provide a Zenodo community that acts as a central index for structure prediction datasets. Datasets remain under individual ownership with permanent DOIs for citation, but the community provides an umbrella that makes them findable in one place.

We provide two tools to work with TSP packages:

Tool	Purpose
tsp-maker (Python)	Automates parsing and packaging of raw predictor output from AlphaFold2, AlphaFold3, and Boltz2
tslstructures (R)	Enables easy analysis and extraction from large structure datasets

%%{init: {'theme': 'base'}}%%
flowchart LR
    subgraph Predictions["Raw Predictions"]
        AF2[AlphaFold2]
        AF3[AlphaFold3]
        BZ2[Boltz2]
    end

    subgraph Tools["tsp-maker"]
        Parse[Parse & normalise]
        Build[Build TSP]
    end

    subgraph Package["TSP Package"]
        Meta[Queryable metadata]
        Struct[Batched structures]
    end

    subgraph Distribution["Zenodo"]
        DOI[(Your dataset<br/>with DOI)]
        Community[TSL Structures<br/>Community]
    end

    subgraph Access["tslstructures"]
        Query[Query & filter]
        Download[Download selected]
    end

    AF2 --> Parse
    AF3 --> Parse
    BZ2 --> Parse
    Parse --> Build
    Build --> Meta
    Build --> Struct
    Meta --> DOI
    Struct --> DOI
    DOI --> Community
    Community --> Query
    DOI --> Query
    Query --> Download

Producers: Use tsp-maker to parse predictions → build TSP → upload to Zenodo → add to community

Consumers: Browse community for datasets → use tslstructures to query metadata → download only structures of interest

Background¶

A single research group can now generate tens of thousands of predicted structures in weeks. At The Sainsbury Laboratory, we routinely predict structures for complete proteomes (20,000+ proteins per species), multiple plant species and pathogens, protein complexes, and protein-ligand interactions.

This volume creates practical problems:

Data volume. A proteome-wide prediction run generates a substantial volume of raw output. With multiple predictors and multiple species, storage requirements quickly become unmanageable.

Format heterogeneity. Each predictor outputs different formats: AlphaFold2 produces pickle files, PDB structures, and ranking JSON; AlphaFold3 produces mmCIF structures and JSON confidence matrices; Boltz2 produces NPZ arrays, PDB structures, and JSON metrics.

Distribution limitations. Standard file sharing methods do not scale to thousands of structures. There is no established way to version, cite, or systematically query these datasets.

Duplicated effort. Without accessible shared datasets, research groups independently re-predict the same proteins.

TSP: TSL Structure Package¶

TSP is a data standard for packaging, sharing, and accessing protein structure predictions.

┌─────────────────────────────────────────────────────────────────┐
│                         TSP Package                             │
├─────────────────────────────────────────────────────────────────┤
│  datapackage.json          │  Manifest and metadata             │
│  metadata.parquet          │  Per-structure index (queryable)   │
│  structures/batch_*.tar.gz │  Structure files in batches        │
│  predictions/scores.parquet│  Confidence scores (queryable)     │
│  predictions/pae/*.tar.gz  │  PAE matrices in batches           │
└─────────────────────────────────────────────────────────────────┘

Design Principles¶

Metadata-first access. With 20,000 structures, users rarely need all of them. TSP uses Parquet files for metadata—a columnar format that supports efficient filtering. Users identify structures of interest without downloading structure files.

Batched distribution. Structures are grouped into tar.gz archives (~100 MB each). This enables downloading specific batches, resuming interrupted transfers, and efficient compression.

Self-describing format. Each TSP includes a datapackage.json manifest describing contents, provenance, and schema. The format extends the Frictionless Data Package standard.

Version control and citation. TSP packages are designed for Zenodo, which provides DOIs for citation, versioning for updates, and long-term archival.

Workflow¶

%%{init: {'theme': 'base'}}%%
flowchart TB
    subgraph Prediction["Structure Prediction"]
        AF2[AlphaFold2]
        AF3[AlphaFold3]
        BZ2[Boltz2]
    end

    subgraph Producer["tsp-maker (Python)"]
        Parse["Parse & normalise"]
        Build["Build TSP"]
        Upload["Upload to Zenodo"]
    end

    subgraph Zenodo["Zenodo"]
        Dataset[(Your dataset<br/>with DOI)]
        Community["TSL Structures<br/>Community"]
    end

    subgraph Consumer["tslstructures (R)"]
        Discover["Discover datasets"]
        Query["Query metadata"]
        Download["Download structures"]
    end

    AF2 --> Parse
    AF3 --> Parse
    BZ2 --> Parse
    Parse --> Build
    Build --> Upload
    Upload --> Dataset
    Dataset --> Community
    Community --> Discover
    Discover --> Query
    Query --> Download

For Data Producers¶

Use tsp-maker (Python CLI) to:

Parse predictor outputs into a common format (handles AF2, AF3, Boltz2 differences)
Build a TSP package with queryable metadata
Upload to Zenodo (automatic DOI assignment)
Add to TSL Structures community for discoverability

→ Creating Datasets Tutorial | tsp-maker Documentation

For Data Consumers¶

Use tslstructures (R package) to:

Discover datasets via the TSL Structures community
Query metadata to filter by confidence scores, protein IDs, predictors
Download only the structures you need (smart caching avoids re-downloads)
Analyse with standard R/Bioconductor workflows

→ Using Datasets Tutorial | tslstructures Documentation

Scale¶

TSP is designed to handle datasets ranging from individual proteins to complete proteomes and multi-species collections. The batched, compressed format substantially reduces storage requirements compared to raw predictor output, while the Parquet metadata files remain small enough for efficient querying regardless of dataset size.

Distribution¶

Local Use¶

TSP packages reduce storage requirements independently of external distribution. The batched, compressed format is practical for local HPC storage, internal sharing, or archival before Zenodo upload.

Zenodo¶

TSP is designed for distribution via Zenodo, which provides:

DOIs for permanent citation
Versioning for dataset updates
Long-term archival (CERN infrastructure)
Large file support (50 GB per file)

Ownership and Communities¶

Zenodo datasets remain under the uploader's ownership. The TSL Structures community provides a curated index for discovery, but:

Datasets are uploaded to your own Zenodo account
You retain full control (editing, versioning, access settings)
Adding to a community does not transfer ownership
Datasets can belong to multiple communities simultaneously

This allows datasets to be discoverable via TSL Structures while also appearing in other relevant communities (e.g., species-specific, project-specific, or institutional collections).

→ Zenodo Community Setup

Tools¶

Package	Language	What it does
tsp-maker	Python	Parses AF2/AF3/Boltz2 outputs, builds TSP packages, uploads to Zenodo. Handles format differences automatically.
tslstructures	R	Discovers datasets via community, queries metadata, downloads selected structures. Smart caching avoids re-downloads.

Documentation¶

Getting Started

Tutorials

Creating Datasets — for data producers
Using Datasets — for data consumers

Reference

Status¶

Component	Status
TSP Specification	v1.0.0
tsp-maker	Complete
tslstructures	Complete
Zenodo Community	Pending