TSP Package Structure¶

This document describes the internal structure of a TSP (TSL Structure Package).

Package Layout¶

dataset/
├── datapackage.json           # Manifest and metadata
├── metadata.parquet           # Per-structure index
├── structures/
│   ├── batch_001.tar.gz       # Structure files (batched)
│   ├── batch_002.tar.gz
│   └── ...
└── predictions/
    ├── scores.parquet         # Confidence metrics
    └── pae/
        ├── batch_001.tar.gz   # PAE matrices (batched)
        └── ...

Components¶

datapackage.json¶

The manifest file describing package contents. Based on the Frictionless Data Package standard with TSP-specific extensions.

{
  "name": "tsl-test-structures",
  "title": "TSL Test Structure Dataset",
  "version": "1.0.0",
  "tsp_version": "1.0.0",

  "contributors": [
    {"title": "Dan MacLean", "role": "author"}
  ],

  "licenses": [
    {"name": "CC-BY-4.0", "path": "https://creativecommons.org/licenses/by/4.0/"}
  ],

  "stats": {
    "structure_count": 62,
    "protein_count": 7,
    "prediction_sources": ["alphafold2", "alphafold3", "boltz2"]
  },

  "resources": [
    {"name": "metadata", "path": "metadata.parquet", "format": "parquet"},
    {"name": "scores", "path": "predictions/scores.parquet", "format": "parquet"},
    {"name": "structures", "path": "structures/", "format": "tar.gz"},
    {"name": "pae", "path": "predictions/pae/", "format": "tar.gz"}
  ]
}

Key fields: - tsp_version: Format version for compatibility - stats: Summary statistics - resources: List of package components with paths and formats

metadata.parquet¶

A queryable index of all structures. Each row represents one structure file.

structure_id	protein_id	source	model_rank	batch
P12345_AF2_1	P12345	af2	1	batch_001
P12345_AF2_2	P12345	af2	2	batch_001
P12345_AF3_1	P12345	af3	1	batch_001
P24704_Q39613_BZ2_1	P24704	boltz2	1	batch_001

Standard columns: - structure_id: Unique identifier for each structure file - protein_id: UniProt accession(s) - source: Prediction tool (af2, af3, boltz2) - model_rank: Ranking among models from same prediction - batch: Archive containing this structure - filename: Filename within the archive

This file enables filtering without downloading structures:

# Find rank-1 predictions for a specific protein
dataset |>
  filter(protein_id == "P12345", model_rank == 1)

structures/batch_*.tar.gz¶

Structure files grouped into compressed archives. Each archive contains PDB or mmCIF files:

batch_001.tar.gz
├── P12345_AF2_1.pdb
├── P12345_AF2_2.pdb
├── P12345_AF3_1.cif
├── P24704_BZ2_1.pdb
└── ...

Batching rationale: - Reduces HTTP overhead (one request per batch rather than per file) - Enables parallel and resumable downloads - Improves compression (similar structures compress well together) - Avoids file listing overhead for large datasets

Default batch size is approximately 100 MB compressed. Small datasets may have a single batch; large datasets may have hundreds.

predictions/scores.parquet¶

Confidence scores for each structure, queryable without loading structure files.

structure_id	mean_plddt	ptm_score	iptm_score	ranking_score
P12345_AF2_1	92.3	0.89	NULL	0.92
P12345_AF2_2	88.1	0.85	NULL	0.88
P24704_Q39613_BZ2_1	78.5	0.72	0.68	0.70

Standard scores: - mean_plddt: Average per-residue confidence (0–100) - ptm_score: Predicted TM-score (0–1) - iptm_score: Interface pTM for complexes (0–1) - ranking_score: Overall ranking metric from predictor

This enables confidence-based filtering:

# Find high-confidence structures
dataset |>
  filter(mean_plddt > 90, ptm_score > 0.8)

predictions/pae/batch_*.tar.gz¶

Predicted Aligned Error (PAE) matrices, stored separately due to size.

pae_batch_001.tar.gz
├── P12345_AF2_1_pae.json
├── P12345_AF3_1_pae.json
└── ...

PAE matrices are N×N where N is sequence length. A 500-residue protein produces 250,000 values per structure. Storing PAE separately allows users to access structures without downloading this data, and to fetch specific matrices when needed.

Format: JSON files with 2D array under pae key. Values are predicted distance errors in Ångstroms.

Data Flow¶

%%{init: {'theme': 'base'}}%%
flowchart LR
    subgraph Query["1. Query"]
        M[metadata.parquet]
        S[scores.parquet]
    end

    subgraph Download["2. Download"]
        B1[batch_001.tar.gz]
        B2[batch_002.tar.gz]
        P1[pae_batch_001.tar.gz]
    end

    subgraph Analysis["3. Analysis"]
        PDB[Structure files]
        PAE[PAE matrices]
    end

    M -->|filter| B1
    M -->|filter| B2
    S -->|filter| B1
    B1 -->|extract| PDB
    B2 -->|extract| PDB
    P1 -->|extract| PAE

Query: Read parquet files to identify structures of interest
Download: Fetch only the batches containing those structures
Analysis: Extract and process specific files

Format Rationale¶

Component	Format	Rationale
Manifest	JSON	Human-readable, universal support
Metadata	Parquet	Columnar format, efficient filtering, compact
Structures	tar.gz	Streaming access, good compression
Scores	Parquet	Queryable, joins with metadata
PAE	JSON in tar.gz	Standard format, separate for size

Parquet is a columnar format that enables reading only needed columns, provides compression (typically 10× smaller than CSV), preserves types, and works with Arrow, DuckDB, and dplyr.

tar.gz is a standard format with universal support. Contents can be listed without full download, individual files can be extracted (streaming), and compression is effective for text-based PDB/CIF files.

Versioning¶

TSP packages support versioned releases:

tsl-arabidopsis-structures/
├── v1.0.0/    # Initial release
├── v1.1.0/    # Added predictions
└── v2.0.0/    # Major update

Zenodo provides: - Record DOI: Points to specific version (e.g., 10.5281/zenodo.12345678) - Concept DOI: Points to latest version (e.g., 10.5281/zenodo.12345670)

The datapackage.json includes version information:

{
  "version": "1.0.0",
  "zenodo": {
    "record_id": "12345678",
    "doi": "10.5281/zenodo.12345678",
    "concept_doi": "10.5281/zenodo.12345670"
  }
}

Size Characteristics¶

The batched, compressed format substantially reduces storage requirements compared to raw predictor output. Parquet metadata files remain small relative to total package size regardless of dataset scale, enabling efficient querying even for large collections.

Next: Creating TSP Datasets →