Skip to contents

Overview

TSP (TSL Structure Package) is a data format for distributing protein structure datasets. It extends the Frictionless Data Package specification with structural biology-specific resources.

A TSP dataset contains:

  • Metadata describing each structure (parquet format)
  • Structure files in batched archives (PDB/mmCIF)
  • Prediction scores from tools like AlphaFold2/3, Boltz2
  • PAE matrices (Predicted Aligned Error)
  • Optionally, a pre-computed Foldseek database

Directory Structure

<dataset_name>/
├── datapackage.json          # Required: manifest
├── metadata.parquet          # Required: per-structure metadata
├── structures/               # Required: structure files
│   ├── batch_001.tar.gz
│   ├── batch_002.tar.gz
│   └── ...
├── predictions/              # Required: prediction outputs
│   ├── scores.parquet        # Summary scores
│   └── pae/                  # Full PAE matrices
│       ├── batch_001.tar.gz
│       └── ...
├── foldseek/                 # Optional: pre-computed database
│   └── ...
└── schema/                   # Optional: field definitions
   └── metadata_fields.json

datapackage.json

The manifest file describes the dataset and its resources. It follows Frictionless Data Package format with TSP-specific extensions.

Required Fields

Field Type Description
$schema string Schema URL (see below)
profile string Must be "tsl-structure-package"
profile_version string TSP spec version (e.g., "1.0.0")
name string Dataset identifier (lowercase, hyphens OK)
version string Semantic version of this dataset
resources array List of resources (see below)
Field Type Description
title string Human-readable title
description string Dataset description
created string ISO 8601 timestamp
licenses array License information
contributors array Authors and maintainers
stats object Summary statistics
zenodo object Zenodo DOI and record info

Example datapackage.json

{
 "$schema": "https://raw.githubusercontent.com/TeamMacLean/tslstructures/main/inst/schema/tsp-v1.json",
 "profile": "tsl-structure-package",
 "profile_version": "1.0.0",

 "name": "designed-proteins-2024",
 "title": "TSL Designed Protein Structures 2024",
 "description": "20,000 computationally designed protein structures",
 "version": "1.0.0",
 "created": "2024-12-01T00:00:00Z",

 "licenses": [
   {"name": "CC-BY-4.0", "path": "https://creativecommons.org/licenses/by/4.0/"}
 ],

 "contributors": [
   {"title": "The Sainsbury Laboratory", "role": "publisher"}
 ],

 "stats": {
   "structure_count": 20000,
   "total_size_bytes": 4500000000,
   "structure_formats": ["pdb", "cif"],
   "prediction_sources": ["alphafold2", "boltz2"]
 },

 "resources": [
   {
     "name": "metadata",
     "path": "metadata.parquet",
     "format": "parquet",
     "description": "Per-structure statistics and annotations",
     "schema": "schema/metadata_fields.json",
     "bytes": 15000000
   },
   {
     "name": "structures",
     "path": "structures/",
     "format": "pdb-archive",
     "description": "Structure files in batched archives",
     "batches": [
       {"path": "batch_001.tar.gz", "sha256": "abc123...", "bytes": 2000000000, "count": 4500},
       {"path": "batch_002.tar.gz", "sha256": "def456...", "bytes": 2000000000, "count": 4500}
     ]
   },
   {
     "name": "prediction_scores",
     "path": "predictions/scores.parquet",
     "format": "parquet",
     "description": "Summary prediction scores"
   },
   {
     "name": "pae_matrices",
     "path": "predictions/pae/",
     "format": "json-archive",
     "description": "Full PAE matrices",
     "batches": [
       {"path": "batch_001.tar.gz", "sha256": "...", "bytes": 2000000000}
     ]
   },
   {
     "name": "foldseek_db",
     "path": "foldseek/",
     "format": "foldseek-db",
     "optional": true,
     "foldseek_version": "9.427df8a"
   }
 ],

 "zenodo": {
   "record_id": "12345678",
   "doi": "10.5281/zenodo.12345678",
   "concept_doi": "10.5281/zenodo.12345670",
   "community": "tsl-structures"
 }
}

Resources

metadata (Required)

Parquet file containing per-structure metadata. Must be readable with arrow::read_parquet().

Required columns:

Column Type Description
id string Unique structure identifier
filename string Structure filename within archive
batch string Which batch archive contains this structure
format string File format: "pdb" or "cif"

Recommended columns:

Column Type Description
residue_count integer Number of residues
chain_count integer Number of chains
prediction_source string "alphafold2", "alphafold3", "boltz2", "experimental"
mean_plddt float Mean pLDDT score (0-100)
ptm_score float Predicted TM score
iptm_score float Interface pTM (for complexes)

Additional columns are permitted. Document custom columns in schema/metadata_fields.json.

structures (Required)

Batched tar.gz archives containing structure files.

Batch requirements:

  • Target size: ~2GB per batch (for streaming downloads)
  • Files within batch: PDB (.pdb) or mmCIF (.cif)
  • Filenames must match filename column in metadata
  • Each batch listed in datapackage.json with checksum

Archive structure:

batch_001.tar.gz
├── structure_00001.pdb
├── structure_00002.pdb
├── structure_00003.cif
└── ...

prediction_scores (Required)

Parquet file containing prediction outputs from structure prediction tools.

Required columns:

Column Type Description
id string Structure ID (joins to metadata)
prediction_source string Tool that generated prediction
source_version string Version of prediction tool
mean_plddt float Mean pLDDT score

Tool-specific columns:

Store additional tool-specific outputs in a source_specific JSON column or as additional typed columns. Common fields:

Column Type Tools Description
ptm float AF2, AF3, Boltz2 Predicted TM score
iptm float AF3, Boltz2 Interface pTM
model_rank integer All Which ranked model (0 = best)
plddt_above_70_pct float All % residues with pLDDT > 70
plddt_above_90_pct float All % residues with pLDDT > 90
pae_file string All Path to PAE JSON in archive

pae_matrices (Required)

Batched archives of PAE (Predicted Aligned Error) matrices and per-residue pLDDT scores.

JSON format for each structure:

{
 "id": "structure_00001",
 "prediction_source": "alphafold2",
 "residue_count": 150,
 "plddt": [92.1, 89.3, 87.6, ...],
 "pae": [
   [0.5, 1.2, 2.3, ...],
   [1.1, 0.4, 1.8, ...],
   ...
 ]
}

Archive structure:

predictions/pae/batch_001.tar.gz
├── structure_00001.json
├── structure_00002.json
└── ...

Batch numbering should match structure batches where possible.

foldseek_db (Optional)

Pre-computed Foldseek database for similarity searching.

Contents:

Standard Foldseek database files as produced by foldseek createdb:

foldseek/
├── db
├── db.dbtype
├── db.index
├── db.lookup
├── db.source
├── db_ca
├── db_ca.dbtype
├── db_ca.index
├── db_h
├── db_h.dbtype
├── db_h.index
├── db_ss
├── db_ss.dbtype
└── db_ss.index

The foldseek_version field in datapackage.json indicates which Foldseek version created the database.

Schema Files

schema/metadata_fields.json

Optional file documenting metadata columns:

{
 "fields": [
   {
     "name": "id",
     "type": "string",
     "required": true,
     "description": "Unique structure identifier"
   },
   {
     "name": "custom_score",
     "type": "number",
     "required": false,
     "description": "Dataset-specific quality score"
   }
 ],
 "custom_fields_allowed": true
}

Versioning

Datasets follow semantic versioning:

  • Major: Breaking changes to structure/schema
  • Minor: New structures added, fields added
  • Patch: Corrections, metadata fixes

Each version gets a separate Zenodo DOI. The concept_doi links all versions.

Validation

Use tslstructures::validate_tsp() to check conformance:

library(tslstructures)
validate_tsp("path/to/dataset")

Validation checks:

  1. datapackage.json exists and is valid JSON
  2. Required fields present with correct types
  3. All declared resources exist
  4. Parquet files readable with expected columns
  5. Batch checksums match (optional, slow)

Creating a TSP Dataset

See the Creating TSP Datasets guide for packaging your own structures in TSP format.

Changelog

v1.0.0

  • Initial specification