Historical biogeography has a diversity of methods for inferring ancestral geographic ranges on phylogenies, but many of the methods have conflicting assumptions, and there is no common statistical framework by which to judge which models are preferable. Probabilistic modeling of geographic range evolution, pioneered by Ree and Smith (2008, Systematic Biology) in their program LAGRANGE, could provide such a framework, but this potential has not been implemented until now.
I have created an R package, "BioGeoBEARS," described in chapter 1 of the dissertation, that implements in a likelihood framework several commonly used models, such as the LAGRANGE Dispersal-Extinction-Cladogenesis (DEC) model and the Dispersal-Vicariance Analysis (DIVA, Ronquist 1997, Systematic Biology) model. Standard DEC is a model with two free parameters specifying the rate of "dispersal" (range expansion) and "extinction" (range contraction). However, while dispersal and extinction rates are free parameters, the cladogenesis model is fixed, such that the geographic range of the ancestral lineage is inherited by the two daughter lineages through a variety of scenarios fixed to have equal probability. This fixed nature of the cladogenesis model means that it has been indiscriminately applied in all DEC analyses, and has not been subjected to any inference or formal model testing.
The process of founder-event speciation, thought to be crucial especially in island systems, is completely left out of the DEC and DIVA models, but it is implemented as an option in BioGeoBEARS, enabling the creation of models such as DEC+J, DIVA+J, etc. The models in BioGeoBEARS are fully parameterized, so that users can easily create new models of their own devising (e.g., vicariance only, founder-event speciation only, any combination of these, etc.) by setting parameters to 0 or 1. Alternatively, parameters controlling various processes can be set to be free parameters, and estimated from the data. Implementation of all models in a common framework allows use of standard statistical model choice procedures such as the Likelihood Ratio Test (LRT) or Akaike Information Criterion (AIC) to objectively compare models and hypothesis about the biogeographical processes operating in different clades and regions.
BioGeoBEARS also adds a number of features not previously available in most historical biogeography software, such as distance-based dispersal, a model of imperfect detection, and the ability to include fossils either as ancestors or tips on a time-calibrated tree.
In Chapter 2, I validate BioGeoBEARS by showing that it exactly reproduces the log-likelihoods and parameter inferences made by the LAGRANGE DEC model on the LAGRANGE test dataset of the Hawaiian Psychotria clade. I further validate the method by taking the Psychotria phylogeny and simulating geographic range evolution under the DEC and DEC+J models, and then conducting inference under the two models. Model choice using LRT is highly accurate, with false positive and false negative rates of approximately 5%, indicating that the test has the desired frequentist properties, and also indicating that DEC and DEC+J are easy to distinguish from the data, even on a small phylogeny. The simulation results also indicate that when DEC+J is the true model, DEC+J has 87% accuracy in inferring ancestral states, while DEC has only 57% accuracy.
The DEC and DEC+J models are then applied to 13 island clades, most of them classic Hawaiian study systems (Drosophila, silverswords, etc.), under a variety of dispersal constraint scenarios. Standard statistical model comparisons show that DEC+J is vastly superior to standard DEC for all clades, for the first time verifying the importance of founder-event speciation in island clades via statistical model choice, and falsifying vicariance-dominated models of island biogeography. The case of Psychotria is typical: the DEC+J model is about 300,000 times more probable than the DEC model in an unconstrained analysis, according to AIC weights. Furthermore, the inferred maximum likelihood (ML) estimates of parameters often differ radically under the DEC+J model, with the "DE" part of the model sometimes playing no role (i.e., the parameters d and e, controlling anagenetic range expansion and range contraction, are inferred to be 0). Further more, under DEC+J, ancestral nodes are usually estimated to have ranges occupying only one island, rather than the widespread ancestors often favored by DEC.
Chapter 3 expands this analysis to compare the cladogenesis models used by the programs LAGRANGE, DIVA, and BayArea (Landis et al. 2013, Systematic Biology). (The BayArea program actually ignores cladogenesis, which identical to assuming that the ancestral range is copied, unmodified, to both daughter lineages at each cladogenesis event.) These models, along with +J versions, are run on a samples of island clades and non-island (continental and oceanic) clades. Almost all analyses, including continental clades, strongly favored the "+J" models over the models without founder-event speciation. However, founder-event speciation was measurably less frequent in non-island analyses, being 2-4 times weaker than in analyses of island clades. Only one clade was found ("Taygetis clade" butterflies from the Neotropics) which favored the DEC model over all others.
Chapter 4 addresses the problem of including fossils in the inference of geographic range evolution on phylogenies. This is done by taking into account the fact that detection of presence and absence in regions will often be imperfect for fossil taxa. A hierarchical model is use to link a probabilistic model of imperfect detection with the traditional likelihood calculations of geographic range evolution. The NEOMAP database is used to provide occurrence data through time for two example clades with good fossil records, namely, North American Canidae and Equinae. The database is also used to provide counts of occurrences of taphonomic control groups that are used to measure relative sampling effort in each region and time bin. The two clades are found to prefer different models for cladogenesis: equids favor DEC, but canids favor BAYAREA+J. This result is found both with and without usage of the imperfect detection model. Ironically, in test data chosen because of their high-quality fossil record, the record was so good that the model for imperfect detection had little impact. However, modeling imperfect detection is likely to be extremely useful in situations with poorer data, or with subsampled data.
Several important conclusions may be drawn from this research. First, formal model selection procedures can be applied in phylogenetic inferences of historical biogeography, and the relative importance of different processes can be measured. These techniques have great potential for strengthening quantitative inference in historical biogeography. No longer are biogeographers forced to simply assume, consciously or not, that some processes (such as vicariance or dispersal) are important and others are not; instead, this can be inferred from the data. Second, founder-event speciation appears to be a crucial explanatory process in most clades, the only exception being some intracontinental taxa showing a large degree of sympatry across widespread ranges. This is not the same thing as claiming that founder-event speciation is the only important process; founder event speciation as the only important process is inferred in only one case (Microlophus lava lizards from the Galapagos). The importance of founder-event speciation will not be surprising to most island biogeographers. However, the results are important nonetheless, as there are still some vocal advocates of vicariance-dominated approaches to biogeography, such as Heads (2012, Molecular Panbiogeography of the Tropics), who allows vicariance and range-expansion to play a role in his historical inferences, but explicitly excludes founder-event speciation a priori. The commonly-used LAGRANGE DEC and DIVA programs actually make assumptions very similar to those of Heads, even though many users of these programs likely consider themselves dispersalists or pluralists. Finally, the inclusion of fossils and imperfect detection within the same likelihood and model-choice framework clears the path for integrating paleobiogeography and neontological biogeography, strengthening inference in both.
Model choice is now standard practice in phylogenetic analysis of DNA sequences: a program such as ModelTest is used to compare models such as Jukes-Cantor, HKY, GTR+I+G, and to select the best model before inferring phylogenies or ancestral states. It is clear that the same should now happen in phylogenetic biogeography. BioGeoBEARS enables this procedure. Perhaps more importantly, however, is the potential for users to create and test new models. Probabilistic modeling of geographic range evolution on phylogenies is still in its infancy, and undoubtedly there are better models out there, waiting to be discovered. It is also undoubtedly true that different clades and different regions will favor different processes, and that further improvements will be had by linking the evolution of organismal traits (e.g., loss of flight) with the evolution of geographic range, within a common inference framework. In a world of rapid climate change and habitat loss, biogeographical methods must maximize both flexibility and statistical rigor if they are to play a role. This research takes several steps in that direction.
BioGeoBEARS is open-source and is freely available at the Comprehensive R Archive Network (http://cran.r-project.org/web/packages/BioGeoBEARS/index.html). A step-by-step tutorial, using the Psychotria dataset, is available at PhyloWiki (http://phylo.wikidot.com/biogeobears).