TSP Specification
TSL Structure Package Format v1.0
Source:vignettes/tsp-specification.Rmd
tsp-specification.RmdOverview
TSP (TSL Structure Package) is a data format for distributing protein structure datasets. It extends the Frictionless Data Package specification with structural biology-specific resources.
A TSP dataset contains:
- Metadata describing each structure (parquet format)
- Structure files in batched archives (PDB/mmCIF)
- Prediction scores from tools like AlphaFold2/3, Boltz2
- PAE matrices (Predicted Aligned Error)
- Optionally, a pre-computed Foldseek database
Directory Structure
<dataset_name>/
├── datapackage.json # Required: manifest
├── metadata.parquet # Required: per-structure metadata
├── structures/ # Required: structure files
│ ├── batch_001.tar.gz
│ ├── batch_002.tar.gz
│ └── ...
├── predictions/ # Required: prediction outputs
│ ├── scores.parquet # Summary scores
│ └── pae/ # Full PAE matrices
│ ├── batch_001.tar.gz
│ └── ...
├── foldseek/ # Optional: pre-computed database
│ └── ...
└── schema/ # Optional: field definitions
└── metadata_fields.json
datapackage.json
The manifest file describes the dataset and its resources. It follows Frictionless Data Package format with TSP-specific extensions.
Required Fields
| Field | Type | Description |
|---|---|---|
$schema |
string | Schema URL (see below) |
profile |
string | Must be "tsl-structure-package"
|
profile_version |
string | TSP spec version (e.g., "1.0.0") |
name |
string | Dataset identifier (lowercase, hyphens OK) |
version |
string | Semantic version of this dataset |
resources |
array | List of resources (see below) |
Recommended Fields
| Field | Type | Description |
|---|---|---|
title |
string | Human-readable title |
description |
string | Dataset description |
created |
string | ISO 8601 timestamp |
licenses |
array | License information |
contributors |
array | Authors and maintainers |
stats |
object | Summary statistics |
zenodo |
object | Zenodo DOI and record info |
Example datapackage.json
{
"$schema": "https://raw.githubusercontent.com/TeamMacLean/tslstructures/main/inst/schema/tsp-v1.json",
"profile": "tsl-structure-package",
"profile_version": "1.0.0",
"name": "designed-proteins-2024",
"title": "TSL Designed Protein Structures 2024",
"description": "20,000 computationally designed protein structures",
"version": "1.0.0",
"created": "2024-12-01T00:00:00Z",
"licenses": [
{"name": "CC-BY-4.0", "path": "https://creativecommons.org/licenses/by/4.0/"}
],
"contributors": [
{"title": "The Sainsbury Laboratory", "role": "publisher"}
],
"stats": {
"structure_count": 20000,
"total_size_bytes": 4500000000,
"structure_formats": ["pdb", "cif"],
"prediction_sources": ["alphafold2", "boltz2"]
},
"resources": [
{
"name": "metadata",
"path": "metadata.parquet",
"format": "parquet",
"description": "Per-structure statistics and annotations",
"schema": "schema/metadata_fields.json",
"bytes": 15000000
},
{
"name": "structures",
"path": "structures/",
"format": "pdb-archive",
"description": "Structure files in batched archives",
"batches": [
{"path": "batch_001.tar.gz", "sha256": "abc123...", "bytes": 2000000000, "count": 4500},
{"path": "batch_002.tar.gz", "sha256": "def456...", "bytes": 2000000000, "count": 4500}
]
},
{
"name": "prediction_scores",
"path": "predictions/scores.parquet",
"format": "parquet",
"description": "Summary prediction scores"
},
{
"name": "pae_matrices",
"path": "predictions/pae/",
"format": "json-archive",
"description": "Full PAE matrices",
"batches": [
{"path": "batch_001.tar.gz", "sha256": "...", "bytes": 2000000000}
]
},
{
"name": "foldseek_db",
"path": "foldseek/",
"format": "foldseek-db",
"optional": true,
"foldseek_version": "9.427df8a"
}
],
"zenodo": {
"record_id": "12345678",
"doi": "10.5281/zenodo.12345678",
"concept_doi": "10.5281/zenodo.12345670",
"community": "tsl-structures"
}
}Resources
metadata (Required)
Parquet file containing per-structure metadata. Must be readable with
arrow::read_parquet().
Required columns:
| Column | Type | Description |
|---|---|---|
id |
string | Unique structure identifier |
filename |
string | Structure filename within archive |
batch |
string | Which batch archive contains this structure |
format |
string | File format: "pdb" or "cif"
|
Recommended columns:
| Column | Type | Description |
|---|---|---|
residue_count |
integer | Number of residues |
chain_count |
integer | Number of chains |
prediction_source |
string |
"alphafold2", "alphafold3",
"boltz2", "experimental"
|
mean_plddt |
float | Mean pLDDT score (0-100) |
ptm_score |
float | Predicted TM score |
iptm_score |
float | Interface pTM (for complexes) |
Additional columns are permitted. Document custom columns in
schema/metadata_fields.json.
structures (Required)
Batched tar.gz archives containing structure files.
Batch requirements:
- Target size: ~2GB per batch (for streaming downloads)
- Files within batch: PDB (
.pdb) or mmCIF (.cif) - Filenames must match
filenamecolumn in metadata - Each batch listed in datapackage.json with checksum
Archive structure:
batch_001.tar.gz
├── structure_00001.pdb
├── structure_00002.pdb
├── structure_00003.cif
└── ...
prediction_scores (Required)
Parquet file containing prediction outputs from structure prediction tools.
Required columns:
| Column | Type | Description |
|---|---|---|
id |
string | Structure ID (joins to metadata) |
prediction_source |
string | Tool that generated prediction |
source_version |
string | Version of prediction tool |
mean_plddt |
float | Mean pLDDT score |
Tool-specific columns:
Store additional tool-specific outputs in a
source_specific JSON column or as additional typed columns.
Common fields:
| Column | Type | Tools | Description |
|---|---|---|---|
ptm |
float | AF2, AF3, Boltz2 | Predicted TM score |
iptm |
float | AF3, Boltz2 | Interface pTM |
model_rank |
integer | All | Which ranked model (0 = best) |
plddt_above_70_pct |
float | All | % residues with pLDDT > 70 |
plddt_above_90_pct |
float | All | % residues with pLDDT > 90 |
pae_file |
string | All | Path to PAE JSON in archive |
pae_matrices (Required)
Batched archives of PAE (Predicted Aligned Error) matrices and per-residue pLDDT scores.
JSON format for each structure:
{
"id": "structure_00001",
"prediction_source": "alphafold2",
"residue_count": 150,
"plddt": [92.1, 89.3, 87.6, ...],
"pae": [
[0.5, 1.2, 2.3, ...],
[1.1, 0.4, 1.8, ...],
...
]
}Archive structure:
predictions/pae/batch_001.tar.gz
├── structure_00001.json
├── structure_00002.json
└── ...
Batch numbering should match structure batches where possible.
foldseek_db (Optional)
Pre-computed Foldseek database for similarity searching.
Contents:
Standard Foldseek database files as produced by
foldseek createdb:
foldseek/
├── db
├── db.dbtype
├── db.index
├── db.lookup
├── db.source
├── db_ca
├── db_ca.dbtype
├── db_ca.index
├── db_h
├── db_h.dbtype
├── db_h.index
├── db_ss
├── db_ss.dbtype
└── db_ss.index
The foldseek_version field in datapackage.json indicates
which Foldseek version created the database.
Versioning
Datasets follow semantic versioning:
- Major: Breaking changes to structure/schema
- Minor: New structures added, fields added
- Patch: Corrections, metadata fixes
Each version gets a separate Zenodo DOI. The concept_doi
links all versions.
Validation
Use tslstructures::validate_tsp() to check
conformance:
library(tslstructures)
validate_tsp("path/to/dataset")Validation checks:
-
datapackage.jsonexists and is valid JSON - Required fields present with correct types
- All declared resources exist
- Parquet files readable with expected columns
- Batch checksums match (optional, slow)
Creating a TSP Dataset
See the Creating TSP Datasets guide for packaging your own structures in TSP format.