Species trees
The Compara pipelines use two main species trees.
- The NCBI taxonomy (i.e. topology only) is typically used for non-vertebrate divisions in:
- the Protein-trees pipeline
- the CAFE (Gene Gain/Loss tree) pipelines
- A tree with branch-lengths computed in-house (available for download here) is used for Vertebrates pipelines, including:
- the Protein-trees and ncRNA-trees pipelines
- the Multiple-alignment pipelines †
- the Constrained Elements / Conservation Scores pipelines
- the CAFE (Gene Gain/Loss tree) pipelines
- The species-tree pipeline begins with a lightweight annotation of the input genomes. This is achieved by aligning a BUSCO [1] protein set to the genomic sequences using miniprot [2] (version 0.12-r237). From these alignments, the corresponding coding sequences (CDS) are extracted.
- The CDS for each gene are aligned across all species to generate codon alignments. This multiple sequence alignment is primarily performed using PAGAN [3] (version v.0.61), with PRANK [3] (version v.170427) used as a fallback option if the former cannot be calculated.
- From the translations of these individual codon alignments, a phylogenetic tree is inferred for each gene family using IQ-TREE [4] (2.2.0.3 COVID-edition). These individual gene trees are reconciled by Astral [5] (version 5.7.1) to generate a consensus species tree.
- This consensus species tree is used as a starting topology for a refinement process. A supermatrix is created by concatenating the protein alignments of all genes and it is trimmed using trimAl [6] (version v1.4.rev15).
- IQ-TREE is used to perform a new, full tree search on this supermatrix to find the optimal species tree topology, using the Astral tree as the starting topology. This step refines both the topology and branch lengths of the final tree.
- Branch lengths for this definitive topology are also computed from the fourfold degenerate sites of the concatenated codon alignment (used in Cactus [7] multiple genome alignments).
Notes and References
- Manni M, Berkeley MR, Seppey M, Simão FA, Zdobnov EM. BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes. Molecular Biology and Evolution, Volume 38, Issue 10, October 2021, Pages 4647–4654.
- Li H. Protein-to-genome alignment with miniprot. Bioinformatics. 2023 Jan;39(1):btad014.
- Löytynoja A, Vilella AJ, Goldman N. Accurate extension of multiple sequence alignments using a phylogeny-aware graph algorithm. Bioinformatics. 2012 Jul;28(13):1684-1691.
- Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, von Haeseler A, Lanfear R. IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Mol Biol Evol. 2020 May;37(5):1530-1534.
- Mirarab S, Warnow T. ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes. Bioinformatics. 2015 Jun;31(12):i44-52.
- Capella-Gutiérrez S, Silla-Martínez JM, Gabaldón T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics. 2009 Aug;25(15):1972-1973.
- Armstrong J, Hickey G, Diekhans M, Fiddes IT, Novak AM, Deran A, Fang Q, Xie D, Feng S, Stiller J, Genereux D, Johnson J, Marinescu VD, Alföldi J, Harris RS, Lindblad-Toh K, Haussler D, Karlsson E, Jarvis ED, Zhang G, Paten B. Progressive Cactus is a multiple-genome aligner for the thousand-genome era. Nature. 2020 Nov;587(7833):246-251.
† Please note that the EPO and EPO-Extended multiple alignments produced before release 110 used species trees built with a different approach. In that approach, we pass the unmasked whole genome sequences to Mash to compute pairwise distances, and then we generate the species tree using distance-based Neighbour-Joining guided by taxonomy.