What is TSP?¶

TSP (TSL Structure Package) is a data standard for packaging and distributing protein structure predictions at scale.

The Problem¶

Structure prediction tools (AlphaFold2, AlphaFold3, Boltz2) generate vast amounts of data in different formats. A proteome run produces a difficult-to-manage volume of raw output, but collaborators typically need only the structures, confidence scores, and PAE matrices—not the full predictor output. Without a standard way to package and share this essential information, data ends up siloed on local storage, hard to find, and hard to query.

The Solution¶

TSP captures the essential information from structure predictions in a compact, portable format. A TSP package is a single object containing structures, confidence scores, and PAE matrices in a standardised schema that works across predictors. The format uses modern columnar data tools (Parquet) so users can query metadata—filtering by confidence scores or protein identifiers—before downloading any structure files.

To create TSP packages, we provide tsp-maker, a Python tool that automates parsing and packaging of raw predictor output from AlphaFold2, AlphaFold3, and Boltz2.

To work with TSP packages, we provide tslstructures, an R package that enables easy analysis and extraction from large structure datasets.

To make packages discoverable, we provide a Zenodo community that acts as a central index. Datasets remain under individual ownership with permanent DOIs for citation, but the community provides an umbrella that makes them findable in one place.

Package Structure¶

my-structures/
├── datapackage.json       # Package manifest
├── metadata.parquet       # Queryable structure index
├── structures/            # Structure files (batched)
│   └── batch_001.tar.gz
└── predictions/           # Confidence metrics
    ├── scores.parquet
    └── pae/
        └── batch_001.tar.gz

Why These Design Choices?¶

Metadata-first access: With 20,000 structures, you rarely need all of them. Parquet files let you filter by confidence score, protein ID, or predictor before downloading structure files.

Batched distribution: Structures are grouped into ~100 MB archives. This enables downloading specific batches, resuming interrupted transfers, and efficient compression.

Consistent schema: AlphaFold2, AlphaFold3, and Boltz2 all output different formats. TSP normalises these into a common schema, making cross-predictor comparisons straightforward.

Community discovery: The Zenodo community provides a central index. Before predicting a protein, check if someone has already shared it.

Permanent citation: Each dataset gets a DOI. Cite the exact version used in your analysis.

What is TSP?¶

The Problem¶

The Solution¶

Package Structure¶

Why These Design Choices?¶

Intended Users¶

Further Reading¶