Skip to content

TSP Package Structure

← Home

This document describes the internal structure of a TSP (TSL Structure Package).


Package Layout

dataset/
├── datapackage.json           # Manifest and metadata
├── metadata.parquet           # Per-structure index
├── structures/
│   ├── batch_001.tar.gz       # Structure files (batched)
│   ├── batch_002.tar.gz
│   └── ...
└── predictions/
    ├── scores.parquet         # Confidence metrics
    └── pae/
        ├── batch_001.tar.gz   # PAE matrices (batched)
        └── ...

Components

datapackage.json

The manifest file describing package contents. Based on the Frictionless Data Package standard with TSP-specific extensions.

{
  "name": "tsl-test-structures",
  "title": "TSL Test Structure Dataset",
  "version": "1.0.0",
  "tsp_version": "1.0.0",

  "contributors": [
    {"title": "Dan MacLean", "role": "author"}
  ],

  "licenses": [
    {"name": "CC-BY-4.0", "path": "https://creativecommons.org/licenses/by/4.0/"}
  ],

  "stats": {
    "structure_count": 62,
    "protein_count": 7,
    "prediction_sources": ["alphafold2", "alphafold3", "boltz2"]
  },

  "resources": [
    {"name": "metadata", "path": "metadata.parquet", "format": "parquet"},
    {"name": "scores", "path": "predictions/scores.parquet", "format": "parquet"},
    {"name": "structures", "path": "structures/", "format": "tar.gz"},
    {"name": "pae", "path": "predictions/pae/", "format": "tar.gz"}
  ]
}

Key fields: - tsp_version: Format version for compatibility - stats: Summary statistics - resources: List of package components with paths and formats


metadata.parquet

A queryable index of all structures. Each row represents one structure file.

structure_id protein_id source model_rank batch
P12345_AF2_1 P12345 af2 1 batch_001
P12345_AF2_2 P12345 af2 2 batch_001
P12345_AF3_1 P12345 af3 1 batch_001
P24704_Q39613_BZ2_1 P24704 boltz2 1 batch_001

Standard columns: - structure_id: Unique identifier for each structure file - protein_id: UniProt accession(s) - source: Prediction tool (af2, af3, boltz2) - model_rank: Ranking among models from same prediction - batch: Archive containing this structure - filename: Filename within the archive

This file enables filtering without downloading structures:

# Find rank-1 predictions for a specific protein
dataset |>
  filter(protein_id == "P12345", model_rank == 1)

structures/batch_*.tar.gz

Structure files grouped into compressed archives. Each archive contains PDB or mmCIF files:

batch_001.tar.gz
├── P12345_AF2_1.pdb
├── P12345_AF2_2.pdb
├── P12345_AF3_1.cif
├── P24704_BZ2_1.pdb
└── ...

Batching rationale: - Reduces HTTP overhead (one request per batch rather than per file) - Enables parallel and resumable downloads - Improves compression (similar structures compress well together) - Avoids file listing overhead for large datasets

Default batch size is approximately 100 MB compressed. Small datasets may have a single batch; large datasets may have hundreds.


predictions/scores.parquet

Confidence scores for each structure, queryable without loading structure files.

structure_id mean_plddt ptm_score iptm_score ranking_score
P12345_AF2_1 92.3 0.89 NULL 0.92
P12345_AF2_2 88.1 0.85 NULL 0.88
P24704_Q39613_BZ2_1 78.5 0.72 0.68 0.70

Standard scores: - mean_plddt: Average per-residue confidence (0–100) - ptm_score: Predicted TM-score (0–1) - iptm_score: Interface pTM for complexes (0–1) - ranking_score: Overall ranking metric from predictor

This enables confidence-based filtering:

# Find high-confidence structures
dataset |>
  filter(mean_plddt > 90, ptm_score > 0.8)

predictions/pae/batch_*.tar.gz

Predicted Aligned Error (PAE) matrices, stored separately due to size.

pae_batch_001.tar.gz
├── P12345_AF2_1_pae.json
├── P12345_AF3_1_pae.json
└── ...

PAE matrices are N×N where N is sequence length. A 500-residue protein produces 250,000 values per structure. Storing PAE separately allows users to access structures without downloading this data, and to fetch specific matrices when needed.

Format: JSON files with 2D array under pae key. Values are predicted distance errors in Ångstroms.


Data Flow

%%{init: {'theme': 'base'}}%%
flowchart LR
    subgraph Query["1. Query"]
        M[metadata.parquet]
        S[scores.parquet]
    end

    subgraph Download["2. Download"]
        B1[batch_001.tar.gz]
        B2[batch_002.tar.gz]
        P1[pae_batch_001.tar.gz]
    end

    subgraph Analysis["3. Analysis"]
        PDB[Structure files]
        PAE[PAE matrices]
    end

    M -->|filter| B1
    M -->|filter| B2
    S -->|filter| B1
    B1 -->|extract| PDB
    B2 -->|extract| PDB
    P1 -->|extract| PAE
  1. Query: Read parquet files to identify structures of interest
  2. Download: Fetch only the batches containing those structures
  3. Analysis: Extract and process specific files

Format Rationale

Component Format Rationale
Manifest JSON Human-readable, universal support
Metadata Parquet Columnar format, efficient filtering, compact
Structures tar.gz Streaming access, good compression
Scores Parquet Queryable, joins with metadata
PAE JSON in tar.gz Standard format, separate for size

Parquet is a columnar format that enables reading only needed columns, provides compression (typically 10× smaller than CSV), preserves types, and works with Arrow, DuckDB, and dplyr.

tar.gz is a standard format with universal support. Contents can be listed without full download, individual files can be extracted (streaming), and compression is effective for text-based PDB/CIF files.


Versioning

TSP packages support versioned releases:

tsl-arabidopsis-structures/
├── v1.0.0/    # Initial release
├── v1.1.0/    # Added predictions
└── v2.0.0/    # Major update

Zenodo provides: - Record DOI: Points to specific version (e.g., 10.5281/zenodo.12345678) - Concept DOI: Points to latest version (e.g., 10.5281/zenodo.12345670)

The datapackage.json includes version information:

{
  "version": "1.0.0",
  "zenodo": {
    "record_id": "12345678",
    "doi": "10.5281/zenodo.12345678",
    "concept_doi": "10.5281/zenodo.12345670"
  }
}

Size Characteristics

The batched, compressed format substantially reduces storage requirements compared to raw predictor output. Parquet metadata files remain small relative to total package size regardless of dataset scale, enabling efficient querying even for large collections.


Next: Creating TSP Datasets →