What is TSP?¶
TSP (TSL Structure Package) is a data standard for packaging and distributing protein structure predictions at scale.
The Problem¶
Structure prediction tools (AlphaFold2, AlphaFold3, Boltz2) generate vast amounts of data in different formats. A proteome run produces a difficult-to-manage volume of raw output, but collaborators typically need only the structures, confidence scores, and PAE matrices—not the full predictor output. Without a standard way to package and share this essential information, data ends up siloed on local storage, hard to find, and hard to query.
The Solution¶
TSP captures the essential information from structure predictions in a compact, portable format. A TSP package is a single object containing structures, confidence scores, and PAE matrices in a standardised schema that works across predictors. The format uses modern columnar data tools (Parquet) so users can query metadata—filtering by confidence scores or protein identifiers—before downloading any structure files.
To create TSP packages, we provide tsp-maker, a Python tool that automates parsing and packaging of raw predictor output from AlphaFold2, AlphaFold3, and Boltz2.
To work with TSP packages, we provide tslstructures, an R package that enables easy analysis and extraction from large structure datasets.
To make packages discoverable, we provide a Zenodo community that acts as a central index. Datasets remain under individual ownership with permanent DOIs for citation, but the community provides an umbrella that makes them findable in one place.
Package Structure¶
my-structures/
├── datapackage.json # Package manifest
├── metadata.parquet # Queryable structure index
├── structures/ # Structure files (batched)
│ └── batch_001.tar.gz
└── predictions/ # Confidence metrics
├── scores.parquet
└── pae/
└── batch_001.tar.gz
Why These Design Choices?¶
Metadata-first access: With 20,000 structures, you rarely need all of them. Parquet files let you filter by confidence score, protein ID, or predictor before downloading structure files.
Batched distribution: Structures are grouped into ~100 MB archives. This enables downloading specific batches, resuming interrupted transfers, and efficient compression.
Consistent schema: AlphaFold2, AlphaFold3, and Boltz2 all output different formats. TSP normalises these into a common schema, making cross-predictor comparisons straightforward.
Community discovery: The Zenodo community provides a central index. Before predicting a protein, check if someone has already shared it.
Permanent citation: Each dataset gets a DOI. Cite the exact version used in your analysis.
Intended Users¶
Data Producers share predictions with the community:
- Run structure predictions (AF2, AF3, Boltz2)
- Use tsp-maker to parse outputs and build TSP package
- Upload to Zenodo and add to TSL Structures community
- Share DOI for citation
Data Consumers access shared datasets efficiently:
- Browse TSL Structures community to find datasets
- Use tslstructures to query metadata and filter
- Download only the structures you need
- Analyse in R with smart caching
Further Reading¶
- Ecosystem Overview — tools and components
- Installation — setup instructions
- Creating Datasets — producer workflow
- Using Datasets — consumer workflow