Applications such as reconstructing cell lineage trees (represented as phylogenetic trees) from single-cell sequencing data require reconstructing a [Formula: see text]-matrix that has many errors and missing entries. We introduce Sempervirens, a very fast matrix reconstruction algorithm for noisy and incomplete matrix representations of phylogenetic trees. Sempervirens uses an iterative maximum-likelihood approach to determine the topology tree represented by the corrupted data. We show that Sempervirens is at least three orders of magnitude faster than other methods on thousand by thousand matrices, with the speed gap widening with larger matrices. We also show that Sempervirens matches state-of-the-art methods in reconstruction accuracy. The speed of Sempervirens enables it to be tractably applied to reconstructing much larger matrices than those that other methods can reconstruct. In addition to experimental results, we justify the algorithm with a mathematical treatment of its subprocedures.
History: Accepted by J. Paul Brooks, Area Editor for Applications in Biology, Medicine, and Healthcare.
Funding: This work was supported by the Office of Science of the Department of Energy [Contract DE-AC02-05CH11231].
Supplemental Material: The software that supports the findings of this study is available within the paper and its Supplemental Information ( https://pubsonline.informs.org/doi/suppl/10.1287/ijoc.2023.0373 ) as well as from the IJOC GitHub software repository ( https://github.com/INFORMSJoC/2023.0373 ). The complete IJOC Software and Data Repository is available at https://informsjoc.github.io/ .