Skip to contents

Groups proteins from a protein_collection into orthogroups using sequence similarity. Multiple clustering methods are available.

Usage

cluster_proteins(
  proteins,
  method = "diamond_rbh",
  mode = "fast",
  min_identity = 30,
  min_coverage = 50,
  evalue = 1e-05,
  threads = NULL,
  tool_path = NULL,
  conda_prefix = NULL,
  keep_temp = FALSE
)

Arguments

proteins

A protein_collection object

method

Clustering method: "diamond_rbh" (default), "orthofinder", or "mmseqs2"

mode

Speed/sensitivity trade-off: "fast" (default) or "thorough"

min_identity

Minimum percent identity threshold (default 30)

min_coverage

Minimum query coverage threshold (default 50)

evalue

E-value threshold (default 1e-5)

threads

Number of CPU threads (default: auto-detect)

tool_path

Optional explicit path to the clustering tool binary

conda_prefix

Optional direct path to a conda/mamba environment containing the tool (e.g., "./this_project_env"). Tool is expected at <prefix>/bin/<tool>.

keep_temp

Keep temporary files for debugging (default FALSE)

Value

An orthogroup_result object containing:

  • orthogroups: tibble with orthogroup_id, assembly, protein_id

  • method: the clustering method used

  • parameters: list of parameters used

  • singletons: proteins not assigned to any orthogroup

Examples

if (FALSE) { # \dontrun{
# Load proteins
proteins <- load_proteins(fasta_dir = "assemblies/")

# Cluster with default settings (DIAMOND RBH)
result <- cluster_proteins(proteins)

# Use thorough mode with stricter thresholds
result <- cluster_proteins(
  proteins,
  mode = "thorough",
  min_identity = 70,
  min_coverage = 80
)

# Use a project-local conda environment
result <- cluster_proteins(
  proteins,
  conda_prefix = "./this_project_env"
)
} # }