Cluster proteins across assemblies — cluster

Groups proteins from a protein_collection into orthogroups using sequence similarity. Multiple clustering methods are available.

Usage

cluster_proteins(
  proteins,
  method = "diamond_rbh",
  mode = "fast",
  min_identity = 30,
  min_coverage = 50,
  evalue = 1e-05,
  threads = NULL,
  tool_path = NULL,
  conda_prefix = NULL,
  keep_temp = FALSE
)

Arguments

proteins: A protein_collection object
method: Clustering method: "diamond_rbh" (default), "orthofinder", or "mmseqs2"
mode: Speed/sensitivity trade-off: "fast" (default) or "thorough"
min_identity: Minimum percent identity threshold (default 30)
min_coverage: Minimum query coverage threshold (default 50)
evalue: E-value threshold (default 1e-5)
threads: Number of CPU threads (default: auto-detect)
tool_path: Optional explicit path to the clustering tool binary
conda_prefix: Optional direct path to a conda/mamba environment containing the tool (e.g., "./this_project_env"). Tool is expected at <prefix>/bin/<tool>.
keep_temp: Keep temporary files for debugging (default FALSE)

Value

An orthogroup_result object containing:

orthogroups: tibble with orthogroup_id, assembly, protein_id
method: the clustering method used
parameters: list of parameters used
singletons: proteins not assigned to any orthogroup

Examples

if (FALSE) { # \dontrun{
# Load proteins
proteins <- load_proteins(fasta_dir = "assemblies/")

# Cluster with default settings (DIAMOND RBH)
result <- cluster_proteins(proteins)

# Use thorough mode with stricter thresholds
result <- cluster_proteins(
  proteins,
  mode = "thorough",
  min_identity = 70,
  min_coverage = 80
)

# Use a project-local conda environment
result <- cluster_proteins(
  proteins,
  conda_prefix = "./this_project_env"
)
} # }