TSP Package Structure¶
This document describes the internal structure of a TSP (TSL Structure Package).
Package Layout¶
dataset/
├── datapackage.json # Manifest and metadata
├── metadata.parquet # Per-structure index
├── structures/
│ ├── batch_001.tar.gz # Structure files (batched)
│ ├── batch_002.tar.gz
│ └── ...
└── predictions/
├── scores.parquet # Confidence metrics
└── pae/
├── batch_001.tar.gz # PAE matrices (batched)
└── ...
Components¶
datapackage.json¶
The manifest file describing package contents. Based on the Frictionless Data Package standard with TSP-specific extensions.
{
"name": "tsl-test-structures",
"title": "TSL Test Structure Dataset",
"version": "1.0.0",
"tsp_version": "1.0.0",
"contributors": [
{"title": "Dan MacLean", "role": "author"}
],
"licenses": [
{"name": "CC-BY-4.0", "path": "https://creativecommons.org/licenses/by/4.0/"}
],
"stats": {
"structure_count": 62,
"protein_count": 7,
"prediction_sources": ["alphafold2", "alphafold3", "boltz2"]
},
"resources": [
{"name": "metadata", "path": "metadata.parquet", "format": "parquet"},
{"name": "scores", "path": "predictions/scores.parquet", "format": "parquet"},
{"name": "structures", "path": "structures/", "format": "tar.gz"},
{"name": "pae", "path": "predictions/pae/", "format": "tar.gz"}
]
}
Key fields:
- tsp_version: Format version for compatibility
- stats: Summary statistics
- resources: List of package components with paths and formats
metadata.parquet¶
A queryable index of all structures. Each row represents one structure file.
| structure_id | protein_id | source | model_rank | batch |
|---|---|---|---|---|
| P12345_AF2_1 | P12345 | af2 | 1 | batch_001 |
| P12345_AF2_2 | P12345 | af2 | 2 | batch_001 |
| P12345_AF3_1 | P12345 | af3 | 1 | batch_001 |
| P24704_Q39613_BZ2_1 | P24704 | boltz2 | 1 | batch_001 |
Standard columns:
- structure_id: Unique identifier for each structure file
- protein_id: UniProt accession(s)
- source: Prediction tool (af2, af3, boltz2)
- model_rank: Ranking among models from same prediction
- batch: Archive containing this structure
- filename: Filename within the archive
This file enables filtering without downloading structures:
# Find rank-1 predictions for a specific protein
dataset |>
filter(protein_id == "P12345", model_rank == 1)
structures/batch_*.tar.gz¶
Structure files grouped into compressed archives. Each archive contains PDB or mmCIF files:
batch_001.tar.gz
├── P12345_AF2_1.pdb
├── P12345_AF2_2.pdb
├── P12345_AF3_1.cif
├── P24704_BZ2_1.pdb
└── ...
Batching rationale: - Reduces HTTP overhead (one request per batch rather than per file) - Enables parallel and resumable downloads - Improves compression (similar structures compress well together) - Avoids file listing overhead for large datasets
Default batch size is approximately 100 MB compressed. Small datasets may have a single batch; large datasets may have hundreds.
predictions/scores.parquet¶
Confidence scores for each structure, queryable without loading structure files.
| structure_id | mean_plddt | ptm_score | iptm_score | ranking_score |
|---|---|---|---|---|
| P12345_AF2_1 | 92.3 | 0.89 | NULL | 0.92 |
| P12345_AF2_2 | 88.1 | 0.85 | NULL | 0.88 |
| P24704_Q39613_BZ2_1 | 78.5 | 0.72 | 0.68 | 0.70 |
Standard scores:
- mean_plddt: Average per-residue confidence (0–100)
- ptm_score: Predicted TM-score (0–1)
- iptm_score: Interface pTM for complexes (0–1)
- ranking_score: Overall ranking metric from predictor
This enables confidence-based filtering:
predictions/pae/batch_*.tar.gz¶
Predicted Aligned Error (PAE) matrices, stored separately due to size.
PAE matrices are N×N where N is sequence length. A 500-residue protein produces 250,000 values per structure. Storing PAE separately allows users to access structures without downloading this data, and to fetch specific matrices when needed.
Format: JSON files with 2D array under pae key. Values are predicted distance errors in Ångstroms.
Data Flow¶
%%{init: {'theme': 'base'}}%%
flowchart LR
subgraph Query["1. Query"]
M[metadata.parquet]
S[scores.parquet]
end
subgraph Download["2. Download"]
B1[batch_001.tar.gz]
B2[batch_002.tar.gz]
P1[pae_batch_001.tar.gz]
end
subgraph Analysis["3. Analysis"]
PDB[Structure files]
PAE[PAE matrices]
end
M -->|filter| B1
M -->|filter| B2
S -->|filter| B1
B1 -->|extract| PDB
B2 -->|extract| PDB
P1 -->|extract| PAE
- Query: Read parquet files to identify structures of interest
- Download: Fetch only the batches containing those structures
- Analysis: Extract and process specific files
Format Rationale¶
| Component | Format | Rationale |
|---|---|---|
| Manifest | JSON | Human-readable, universal support |
| Metadata | Parquet | Columnar format, efficient filtering, compact |
| Structures | tar.gz | Streaming access, good compression |
| Scores | Parquet | Queryable, joins with metadata |
| PAE | JSON in tar.gz | Standard format, separate for size |
Parquet is a columnar format that enables reading only needed columns, provides compression (typically 10× smaller than CSV), preserves types, and works with Arrow, DuckDB, and dplyr.
tar.gz is a standard format with universal support. Contents can be listed without full download, individual files can be extracted (streaming), and compression is effective for text-based PDB/CIF files.
Versioning¶
TSP packages support versioned releases:
tsl-arabidopsis-structures/
├── v1.0.0/ # Initial release
├── v1.1.0/ # Added predictions
└── v2.0.0/ # Major update
Zenodo provides:
- Record DOI: Points to specific version (e.g., 10.5281/zenodo.12345678)
- Concept DOI: Points to latest version (e.g., 10.5281/zenodo.12345670)
The datapackage.json includes version information:
{
"version": "1.0.0",
"zenodo": {
"record_id": "12345678",
"doi": "10.5281/zenodo.12345678",
"concept_doi": "10.5281/zenodo.12345670"
}
}
Size Characteristics¶
The batched, compressed format substantially reduces storage requirements compared to raw predictor output. Parquet metadata files remain small relative to total package size regardless of dataset scale, enabling efficient querying even for large collections.
Next: Creating TSP Datasets →