TSP Specification

Overview

TSP (TSL Structure Package) is a data format for distributing protein structure datasets. It extends the Frictionless Data Package specification with structural biology-specific resources.

A TSP dataset contains:

Metadata describing each structure (parquet format)
Structure files in batched archives (PDB/mmCIF)
Prediction scores from tools like AlphaFold2/3, Boltz2
PAE matrices (Predicted Aligned Error)
Optionally, a pre-computed Foldseek database

Directory Structure

<dataset_name>/
├── datapackage.json          # Required: manifest
├── metadata.parquet          # Required: per-structure metadata
├── structures/               # Required: structure files
│   ├── batch_001.tar.gz
│   ├── batch_002.tar.gz
│   └── ...
├── predictions/              # Required: prediction outputs
│   ├── scores.parquet        # Summary scores
│   └── pae/                  # Full PAE matrices
│       ├── batch_001.tar.gz
│       └── ...
├── foldseek/                 # Optional: pre-computed database
│   └── ...
└── schema/                   # Optional: field definitions
   └── metadata_fields.json

datapackage.json

The manifest file describes the dataset and its resources. It follows Frictionless Data Package format with TSP-specific extensions.

Required Fields

Field	Type	Description
`$schema`	string	Schema URL (see below)
`profile`	string	Must be `"tsl-structure-package"`
`profile_version`	string	TSP spec version (e.g., `"1.0.0"`)
`name`	string	Dataset identifier (lowercase, hyphens OK)
`version`	string	Semantic version of this dataset
`resources`	array	List of resources (see below)

Recommended Fields

Field	Type	Description
`title`	string	Human-readable title
`description`	string	Dataset description
`created`	string	ISO 8601 timestamp
`licenses`	array	License information
`contributors`	array	Authors and maintainers
`stats`	object	Summary statistics
`zenodo`	object	Zenodo DOI and record info

Example datapackage.json

{
 "$schema": "https://raw.githubusercontent.com/TeamMacLean/tslstructures/main/inst/schema/tsp-v1.json",
 "profile": "tsl-structure-package",
 "profile_version": "1.0.0",

 "name": "designed-proteins-2024",
 "title": "TSL Designed Protein Structures 2024",
 "description": "20,000 computationally designed protein structures",
 "version": "1.0.0",
 "created": "2024-12-01T00:00:00Z",

 "licenses": [
   {"name": "CC-BY-4.0", "path": "https://creativecommons.org/licenses/by/4.0/"}
 ],

 "contributors": [
   {"title": "The Sainsbury Laboratory", "role": "publisher"}
 ],

 "stats": {
   "structure_count": 20000,
   "total_size_bytes": 4500000000,
   "structure_formats": ["pdb", "cif"],
   "prediction_sources": ["alphafold2", "boltz2"]
 },

 "resources": [
   {
     "name": "metadata",
     "path": "metadata.parquet",
     "format": "parquet",
     "description": "Per-structure statistics and annotations",
     "schema": "schema/metadata_fields.json",
     "bytes": 15000000
   },
   {
     "name": "structures",
     "path": "structures/",
     "format": "pdb-archive",
     "description": "Structure files in batched archives",
     "batches": [
       {"path": "batch_001.tar.gz", "sha256": "abc123...", "bytes": 2000000000, "count": 4500},
       {"path": "batch_002.tar.gz", "sha256": "def456...", "bytes": 2000000000, "count": 4500}
     ]
   },
   {
     "name": "prediction_scores",
     "path": "predictions/scores.parquet",
     "format": "parquet",
     "description": "Summary prediction scores"
   },
   {
     "name": "pae_matrices",
     "path": "predictions/pae/",
     "format": "json-archive",
     "description": "Full PAE matrices",
     "batches": [
       {"path": "batch_001.tar.gz", "sha256": "...", "bytes": 2000000000}
     ]
   },
   {
     "name": "foldseek_db",
     "path": "foldseek/",
     "format": "foldseek-db",
     "optional": true,
     "foldseek_version": "9.427df8a"
   }
 ],

 "zenodo": {
   "record_id": "12345678",
   "doi": "10.5281/zenodo.12345678",
   "concept_doi": "10.5281/zenodo.12345670",
   "community": "tsl-structures"
 }
}

Resources

metadata (Required)

Parquet file containing per-structure metadata. Must be readable with arrow::read_parquet().

Required columns:

Column	Type	Description
`id`	string	Unique structure identifier
`filename`	string	Structure filename within archive
`batch`	string	Which batch archive contains this structure
`format`	string	File format: `"pdb"` or `"cif"`

Recommended columns:

Column	Type	Description
`residue_count`	integer	Number of residues
`chain_count`	integer	Number of chains
`prediction_source`	string	`"alphafold2"`, `"alphafold3"`, `"boltz2"`, `"experimental"`
`mean_plddt`	float	Mean pLDDT score (0-100)
`ptm_score`	float	Predicted TM score
`iptm_score`	float	Interface pTM (for complexes)

Additional columns are permitted. Document custom columns in schema/metadata_fields.json.

structures (Required)

Batched tar.gz archives containing structure files.

Batch requirements:

Target size: ~2GB per batch (for streaming downloads)
Files within batch: PDB (.pdb) or mmCIF (.cif)
Filenames must match filename column in metadata
Each batch listed in datapackage.json with checksum

Archive structure:

batch_001.tar.gz
├── structure_00001.pdb
├── structure_00002.pdb
├── structure_00003.cif
└── ...

prediction_scores (Required)

Parquet file containing prediction outputs from structure prediction tools.

Required columns:

Column	Type	Description
`id`	string	Structure ID (joins to metadata)
`prediction_source`	string	Tool that generated prediction
`source_version`	string	Version of prediction tool
`mean_plddt`	float	Mean pLDDT score

Tool-specific columns:

Store additional tool-specific outputs in a source_specific JSON column or as additional typed columns. Common fields:

Column	Type	Tools	Description
`ptm`	float	AF2, AF3, Boltz2	Predicted TM score
`iptm`	float	AF3, Boltz2	Interface pTM
`model_rank`	integer	All	Which ranked model (0 = best)
`plddt_above_70_pct`	float	All	% residues with pLDDT > 70
`plddt_above_90_pct`	float	All	% residues with pLDDT > 90
`pae_file`	string	All	Path to PAE JSON in archive

pae_matrices (Required)

Batched archives of PAE (Predicted Aligned Error) matrices and per-residue pLDDT scores.

JSON format for each structure:

{
 "id": "structure_00001",
 "prediction_source": "alphafold2",
 "residue_count": 150,
 "plddt": [92.1, 89.3, 87.6, ...],
 "pae": [
   [0.5, 1.2, 2.3, ...],
   [1.1, 0.4, 1.8, ...],
   ...
 ]
}

Archive structure:

predictions/pae/batch_001.tar.gz
├── structure_00001.json
├── structure_00002.json
└── ...

Batch numbering should match structure batches where possible.

foldseek_db (Optional)

Pre-computed Foldseek database for similarity searching.

Contents:

Standard Foldseek database files as produced by foldseek createdb:

foldseek/
├── db
├── db.dbtype
├── db.index
├── db.lookup
├── db.source
├── db_ca
├── db_ca.dbtype
├── db_ca.index
├── db_h
├── db_h.dbtype
├── db_h.index
├── db_ss
├── db_ss.dbtype
└── db_ss.index

The foldseek_version field in datapackage.json indicates which Foldseek version created the database.

Schema Files

schema/metadata_fields.json

Optional file documenting metadata columns:

{
 "fields": [
   {
     "name": "id",
     "type": "string",
     "required": true,
     "description": "Unique structure identifier"
   },
   {
     "name": "custom_score",
     "type": "number",
     "required": false,
     "description": "Dataset-specific quality score"
   }
 ],
 "custom_fields_allowed": true
}

Versioning

Datasets follow semantic versioning:

Major: Breaking changes to structure/schema
Minor: New structures added, fields added
Patch: Corrections, metadata fixes

Each version gets a separate Zenodo DOI. The concept_doi links all versions.

Validation

Use tslstructures::validate_tsp() to check conformance:

library(tslstructures)
validate_tsp("path/to/dataset")

Validation checks:

datapackage.json exists and is valid JSON
Required fields present with correct types
All declared resources exist
Parquet files readable with expected columns
Batch checksums match (optional, slow)

Creating a TSP Dataset

See the Creating TSP Datasets guide for packaging your own structures in TSP format.

Changelog

v1.0.0

Initial specification

TSL Structure Package Format v1.0