UC Frontiers of Biogeography Consortium API: an extensible, open-source service for accessing fossil data and taxonomies from multiple community paleodata resources

Paleobiologists and paleoecologists interested in studying biodiversity dynamics over broad spatial and temporal scales have built multiple community-curated data resources, each emphasizing a particular spatial domain, timescale, or taxonomic group(s). This multiplicity of data resources is understandable, given the enormous diversity of life across Earth’s history, but creates a barrier to achieving a truly global understanding of the diversity and distribution of life across time. Here we present the Earth Life Consortium Application Programming Interface (ELC API), a lightweight data service designed to search and retrieve fossil occurrence and taxonomic information from across multiple paleobiological resources. Key endpoints include Occurrences (returns spatiotemporal locations of fossils for selected taxa), Locales (returns information about sites with fossil data), References (returns bibliographic information), and Taxonomy (returns names of subtaxa associated with selected taxa). Data objects are returned as JSON or CSV format. The ELC API supports tectonic-driven shifts in geographic position back to 580 Ma using services from Macrostrat and GPlates. The ELC API has been implemented first for the Paleobiology Database and Neotoma Paleoecology Database, with a test extension to the Strategic Environmental Archaeology Database. The ELC API is designed to be readily extensible to other paleobiological data resources, with all endpoints fully documented and following open-source standards (e.g., Swagger, OGC). The broader goal is to help build an interlinked and federated ecosystem of paleobiological and paleoenvironmental data resources, which together provide paleobiologists, macroecologists, biogeographers, and other interested scientists with full coverage of the diversity and distribution of life across time.


Introduction
Study of the patterns and processes governing the diversity of life on earth at long timescales and broad spatial scales requires the assembly of many individual fossil occurrences into larger, open, community-curated data resources (CCDRs; Williams et al. 2018a) such as the Paleobiology Database (PBDB), the Neotoma Paleoecology Database, and others (Uhen et al. 2013).
In an era of global change, when stewarding biodiversity is urgent (IPBES 2019), conservation biologists, global change ecologists, paleontologists and other Earth system scientists use the geological record to study biodiversity dynamics during large and rapid transitions (National Research Council 2005, Willis and Birks 2006, Dietl and Flessa 2011, Willis and MacDonald 2011, Fritz et al. 2013, Kidwell 2015, Fordham et al. 2020. For example, large paleodata syntheses are used to understand how contemporary ecological systems are shaped by historical legacies of slowacting processes (e.g., Whittaker et al. 2001, Jablonski 2008, test the ecological forecasting models used to project and prepare for the impacts of 21st-century climate change (e.g., Veloz et al. 2012, Blois et al. 2013, assess the patterns and causes of abrupt ecological and environmental change (Williams et al. 2010, Shanahan et al. 2015, constrain phylogenetic models of species divergence and rates of evolution (Muller & Reisz 2005), assess the novelty of contemporary ecosystems relative to historic or deeper-time baselines (Jackson & Williams 2004, Radeloff 2015, and understand the fundamental processes that generate, maintain, and rebuild biodiversity (Crame 2001, Jablonski et al. 2013.
These open paleodata resources also make paleobiological data accessible to scientists from allied disciplines, powering the next generation of convergent research. For example, the fossil record is used by sedimentologists and economic geologists studying facies relationships and employing biostratigraphic controls for correlating rock strata (Metcalfe & Nicoll 2007), structural geologists and geophysicists seeking biogeographic constraints on reconstructions of former tectonic plate positions (Chaloner & Creber 1988, Wright et al. 2013, paleoclimatologists building proxybased reconstructions of past climates (Bartlein et al. 2011, Marsicek et al. 2018, Routson et al. 2019, and archaeologists seeking to understand how past societies shaped and were shaped by their environment (e.g., O'Regan et al. 2011, Kohler et al. 2018. In response, many paleoecological and paleobiological data resources have emerged over the years, of varying size and scope, some begun and maintained by individual investigators and others maturing into publicly available, community curated data resources (Williams et al. 2018b), with data contributed and curated by a broad cross-section of the paleobiological community (Uhen et al. 2013, Williams et al. 2018a. The PBDB, launched in 1998 to study global biodiversity dynamics across the history of life, is a global-scale data resource with holdings across the Phanerozoic to present and a temporal grain on the order of 10 6 years. The Neotoma Paleoecology Database, a coalition of constituent databases that use a common database platform (Grimm et al. 2018, emphasizes records from the Late Neogene to present and temporal grains of 10 1 to 10 3 years and has multiple origins, often linked to efforts to reconstruct past climates, test climate models, and map species responses to environmental change (e.g., COHMAP 1988, FAUNMAP 1996, Harrison et al. 2013, Grimm et al. 2018. The PBDB and Neotoma together have been cited over 50,000 times, with H-factors of 94 and 79, respectively. Many other paleobiological data resources exist, of varying size and scope, including the New and Old World Database (NOW), the Strategic Environmental Archaeology Database (SEAD), Neptune, and others (Uhen et al. 2013).
The next stage of evolution is to consolidate or federate paleodata resources. Consolidation, in which data from one data resource are added to another, is a good solution for data resources with simpler data models or that are unlikely to persist on their own, e.g., if the lead investigator(s) retire or move to other projects. Both the PBDB and Neotoma have grown in part through consolidation. PBDB has incorporated data from the Evolution of Terrestrial Ecosystems (ETE) project, and several research projects have housed their data in the PBDB from the start instead of creating standalone databases. In addition, PBDB has incorporated several large datasets that had been stored in various off line formats and made them available to all. Many of these are now downloadable as PBDB Data Archives (https://www.paleobiodb.org/classic/ app/archive/list). Constituent Databases within Neotoma include FAUNMAP (Graham et al. 1996), the European Pollen Database (Fyfe et al. 2009), the Neotoma Ostracode Database (Curry et al. 2013), the Diatom Database of the Academy of Natural Sciences (Sullivan & Charles, 1994), Neotoma Testate Amoebae Database (Amesbury et al. 2018), and others (Williams et al. 2018b).
However, for some data resources, consolidation may not be feasible if data models have reached a level of complexity that precludes simple merging of semantic and ontological schema, if sustained funding requires maintaining a standalone identity (e.g., for national-scale data resources), or if consolidation would disrupt the linkage between a CCDR and the community that it supports. The last consideration is perhaps the most critical; the ultimate guarantor of sustainability of all CCDRs is close engagement with and support of their networks of data contributors and stewards.
This persistent multiplicity of paleodata resources, although understandable, presents a challenge for macroecologists, biogeographers, and other scientists seeking broad-scale, integrative understanding of the diversity and distribution of life. Simply discovering all pertinent paleodata resources is a challenge, and each has its own data schema, which hinders integration and understanding. Because paleodata resources often focus on particular spatiotemporal domains and have been assembled by different networks of researchers, Frontiers of Biogeography 2021, 13.2, e50711 © the authors, CC-BY 4.0 license 3 fossil occurrences for given taxonomic groups may be distributed across paleodata resources, with poorly characterized gaps and overlaps.
Here we describe a new resource, developed in partnership by the PBDB and Neotoma and open to all, called the Earth-Life Consortium Application Programming Interface (ELC API). The API is designed to be a common lightweight data standard and associated web services for discovering and obtaining fossil occurrence data from across multiple paleodata resources. A series of API endpoints enable retrieval of different kinds of data. The project is completely open source and ELC API code is designed to be readily extensible to other paleoecological and paleobiological resources.

The ELC API: Overview and Design Process
The ELC API is a composite API that generates and dispatches queries to multiple paleobiological data resources, via subqueries directed to the native API for each resource. Basic operation of the ELC API is illustrated in Fig. 1. The resulting data returns are processed and reformatted by the ELC API to provide the end user with comparable data objects. The ELC API is intentionally designed to be lightweight, with a fairly small number of endpoints and expected parameters. This design supports the goal of searching and returning data from multiple paleodata resources, each with its own particular data model, semantics, and ontology. Results are returned as aggregated Figure 1. Basic operation of the ELC API. In the diagram, data queries are shown in blue, while data returns are shown in red. The ELC API takes a single query from an end user, and sends it to constituent database APIs. Then, it takes the data returns and standardizes them into a single data return to the end user. Design and development of the ELC API followed a user-centered and "API first" development process, that emphasizes careful consideration of how to robustly represent and access information before application development. This approach consists of the following steps: 1) Developers and paleobiologists from the Neotoma and PBDB teams met to review the data models of the existing data resources and native APIs, identify semantic commonalities, and points of divergence. Common research queries were identified by paleobiologists and translated by the developers into sketches of API endpoints, each consisting of parameters to be passed to the endpoint and the structure of data objects to be returned by the endpoint.
3) Server backend code was developed using Python and Flask for each generated endpoint.
This API-first process enabled the developers and scientists to stay closely engaged throughout the development process and for scientific users to quickly test and suggest modifications to APIs. Changes to the API often were made in the schema during development and the changes pushed down through the frontend to backend code.

ELC API Endpoints
Overview Each ELC API endpoint returns a specific suite of formatted data from all participating data resources, with users having the option to choose subsets of participating resources. API versioning is supported through the use of 'api_v1' in the directory path; all descriptions here are for Version 1.0 of the API. All API endpoints and parameters are fully documented at (http://earthlifeconsortium.org/docs/ api-docs.html) and an interactive 'sandbox' is available for testing and designing API queries (http://earthlifeconsortium.org/api_v1/ui/). Here we briefly summarize the features of each endpoint likely to be of particular interest to biogeographers and macroecologists.

Occurrences (occ)
Base Path: http://earthlifeconsortium. org/api_v1/occ? Parameters: taxon (name or comma-separated list of names), bbox (wkt polygon), agerange, Notes: Occurrences are the individual instances of fossils in time and space. Occurrences of taxa can be specified at any taxonomic level, bounded by any units of time and spatial delimitation. Taxon names can use wild cards (%). Geographic search parameters are described using the well-known text (WKT) standard for describing polygons developed by the Open Geospatial Consortium. The WKT implementation is compliant with ISO 19125-1 and 19125-2 standards. The Open Street Map Playground (https://clydedacruz.github. io/openstreetmap-wkt-playground) graphical interface can be used to create and define search polygons. For all ELC APIs, geographic coordinates are expected to be decimal degrees, ranging from -90 (S) to 90 (N) and -180 (W) to 180 (E).
The ELC API supports spatial searches by both modern spatial coordinates and paleogeographic coordinates, using the coordtype parameter. Temporal searches can employ named geologic ages using definitions from the International Commission on Stratigraphy (ICS, Gradstein et al. 2012) or minimum/maximum age ranges, using units of ybp, ka, or Ma (years before present, thousands of years (kiloanna) before present, or millions of years (megaanna) before present). For example, a user could request all fossil occurrences of the Holocene, all occurrences from the Eocene to Pliocene, all occurrences from 12 to 15 ka, or from 1 Ma through the Miocene. Other parameters include run (to choose which data resources are queried), limit and offset (to limit the number of data objects returned and enable serial queries), and show (to allow full dataset returns, ID numbers only, or summary statistics only).
Note that the seemingly simple concept of 'occurrence' is a foundational point of semantic divergence between the PBDB and Neotoma that required special handling when building data returns. In the PBDB, unique identifiers are assigned to individual fossil occurrences, because the PBDB was originally designed a store of species occurrences in the stratigraphic record, extracted from the literature. In Neotoma, unique identifiers are assigned to samples, based on a data model in which samples are collected from cores and stratigraphic sections (Williams et al. 2018b). Each sample can consist of one fossil specimen (e.g., a single canid femur) or multiple taxa, e.g., separate counts of individual micropaleontological taxa. The ELC API resolves this difference by returning the unique occurrence ID for each PBDB fossil occurrence and a composited unique identifier for Neotoma occurrence that combines the sample identifier and taxon identifier. h t t p : / / w w w. e a r t h l i fe c o n s o r t i u m . o r g / api_v1/occ?taxon=pinus&agerange=15000%2C 10000&ageunits=ybp&includelower=false&limit=10 (retrieve the first 10 instances of pine between 15,000 and 10,000 years ago)

Locales (loc)
Base Path: http://earthlifeconsortium. org/api_v1/loc? Parameters: idlist, bbox, agerange, ageunits, coordtype, limit, offset, output, show, run Notes: The Locales endpoint returns information about sites or locations that contain fossil samples. Locales can be searched for using polygons or age ranges, as described for occurrences. The idlist parameter also allows locales to be found for lists of collection IDs (PBDB) or dataset IDs (Neotoma), using the format [database]:[datatype]:id_number, …. All other parameters follow the format for the occurrences endpoint.

References (ref)
Notes: References returns bibliographic information about the publications stored in paleobiological data resources that are linked to fossil records. References can be returned in JSON, CSV, BIBJSON, or RIS format (specified in output parameter) with BIBJSON. The idlist parameter format follows that for the locales endpoint. Because the idlist parameter requires knowledge of database-specific ID numbers for publications, the references API is usually called programmatically, given knowledge of publication IDs provided by the occurrence data return.
Parameters: taxon (name or comma-separated list of names), idlist, includelower(true,false), output(json,csv), show(all,poll,idx), run(all or list of database names) Notes: The Taxonomy endpoint reveals the taxonomic names and hierarchies stored inside paleobiological data resources. Taxonomic name requests can be sent as either as a list of taxon IDs (using idlist) or as one or more taxonomic names (taxon). If includelower is set to true (default is false), Taxonomy will return all species or sub-species names within a named taxon (PBDB) or will run wildcard searches (Neotoma). If the taxonomic name does not occur in the given data resource, the ELC API will not return any data from that database. Taxonomic concepts may differ among included data resources.
The PBDB and Neotoma handle taxonomies and taxon names differently. The PBDB, which draws taxonomic names directly from the published literature, allows multiple taxonomic names to be stored for the same taxon and is dynamically updated by data authorizers (Peters & McClennen, 2016). The most recent name entered for a given name is used as the current taxon name. The PBDB also employs a rank-ordering system with taxonomic names assigned to levels of species, genus, etc. Neotoma uses defined vocabularies of taxonomic names in which Data Stewards can propose the addition or modification of taxonomic names and designated Taxonomic Experts approve these additions and modifications taxonomic names (Williams et al.2018b). Taxonomic names in Neotoma can include information about fossil morphology or taxonomic uncertainty, e.g., Poaceae (<50μm), Odocoileus cf. O. virginianus, or Ambrosia-type. Because most taxa in Neotoma are still extant, Neotoma attempts when possible to link to current taxonomic authorities, e.g., using phylogenetic-based classification for plants (e.g., Cantino et al. 2007

Miscellaneous API Endpoints
The ELC API also supports a number of miscellanous utility endpoints. These are all located within the misc pathway. Return the oldest and youngest ages spanning the specified range. Age range requests can be passed as individual or pairs of geologic ages or numeric values, or as a single geological age. Geologic ages are resolved according to ICS definitions (Gradstein et al. 2012).

Subtaxa
Base Path: http://earthlifeconsortium. org/api_v1/misc/subtaxa? Parameters: taxon (name or comma-separated list of names), synonyms (true, false) Notes: Return a list of all taxonomic names hierarchically below the specified taxon, optionally including synonyms. As with occurrence, a single name or lists of taxon names can be passed in via taxon, with the % wildcard also allowed. Subtaxa defaults to returning synonyms. Examples:

Mobile
Base Path: http://earthlifeconsortium. org/api_v1/misc/mobile? Parameters: taxon (name or comma-separated list of names), bbox Notes: This is a custom lightweight endpoint designed for use with Flyover Country (Loeffler 2018, Myrbo et al. 2018) and other mobile apps. Mobile only requires two parameters (taxon name(s) and geographic polygon and returns a combination of occurrence data with associated taxonomic and select environmental details. As above, wildcard operators are permitted. The response is nested JSON with a highly compact vocabulary.

Use Case Examples
The best use cases available to demonstrate the utility of the ELC API are those where both databases have significant numbers of occurrences of the same fossil taxon. For Neotoma and the PBDB, the area of greatest shared holdings is for terrestrial vertebrates. Given that Neotoma has a heavy emphasis on the Late Neogene (and particularly the Quaternary), and PBDB covers all of deep time, the Pleistocene is the areas of greatest temporal overlap. As other CCDRs join the ELC API, additional taxonomic, temporal, and spatial parameters will be key to producing data sets blended from several resources.
The sea otter Enhydra lutris was used by Uhen et al. (2018) to demonstrate contributions from both Neotoma and PBDB. Here we show a similar return for the polar bear, Ursus maritumus using the ELC API. Fig. 2 shows a map of modern U. maritimus distribution from OBIS compared to the distribution from Neotoma and PBDB downloaded with the ELC API. Notice that the fossil distribution, primarily from the Pleistocene, shows U. maritimus much farther to the south, particularly in Europe and Asia, where they are unknown today. Also note that neither Neotoma, nor PBDB has full coverage of the fossil range of polar bears, but together, a much fuller and clearer picture Frontiers of Biogeography 2021, 13.2, e50711 © the authors, CC-BY 4.0 license 7 of the broad distribution of this currently threatened species is available. Another use case that demonstrates the complementarity of Neotoma and PBDB data in geologic time is shown in Fig. 3, which shows the distribution of occurrences of the Family Canidae. Neotoma is strong in shallow time, while PBDB is strong in deep time. To understand the full distribution of this taxon, both sets of data are necessary, and the ELC API gives access to both.

Benefits to scientists and other end users
Paleobiological data are hard-won, requiring substantial field and lab time and deep taxonomic expertise in the identification of fossil specimens. The paleobiological data and knowledge gathered by CCDRs such as Neotoma and the PBDB represents decades to centuries of accumulated data and knowledge and hundreds of millions of dollars of scientific investment. Hence, these data resources are foundational infrastructure for the paleobiosciences.
The ELC API improves and expands the interoperability of cyberinfrastructure within the paleobiosciences. It also promotes the sharing and use of paleobiological data within and outside the discipline but especially within closely allied geoscience and bioscience disciplines (Uhen et al. 2018, Williams et al. 2018a. This multiplies the usefulness of hard-won fossil occurrence data that has been accumulated by paleontologists for decades. The ELC API also builds interoperability between this paleobioscience CI and current and emergent CI in the biosciences, particularly with respect to networks of biodiversity and ecological databases. Finally, it helps to establish a 4D framework for life and its physical environments at all scales of time.

Extending the ELC API to Other Data Resources
The ELC API is designed to be readily extensible to other data resources, and its strength grows with the addition of other paleoecological and biological databases. In turn, joining the ELC API helps increase the discovery and citation of participating paleodata resources. All ELC API code makes use of open-source standards (e.g., Swagger, OGC) and is available on GitHub. Research groups and developers are encouraged to fork the GitHub ELC repository (https://github.com/ EarthLifeConsortium/elc_api) and follow the steps described below to add new paleodata resources to the ELC API.
The ELC API is an intentionally lightweight data service, because keeping the number of endpoints and codebase small facilitates adoption by other paleobiological data resources. However, this design philosophy means that some kinds of searches may not be able to be performed by the ELC API or some kinds of data returns may not be available. In these cases, users desiring more detailed data returns can use the native Neotoma and PBDB API endpoints, R packages, and other associated resources.
It is possible to link other data resources to the ELC API by adding a customized "handler" file (https:// github.com/EarthLifeConsortium/elc_api/ tree/master/swagger_server/handlers) that serves to translate the output of the data resource's own API, and return results in the format expected by the ELC API for each individual endpoint. The use of handlers means that onboarding new resources can happen relatively easily, without changing the underlying framework.
As an initial test of the extensibility of the ELC API, the Occurrence endpoint has been extended to retrieve data from the Strategic Environmental Archaeology Database (SEAD, Buckland et al. 2018).
The ELC API is denoted by major version within the base path itself and all sub-versions of the API will not break parameter, route or response compatibility. If, in the future, the API is expanded or modified in such a way that backwards compatibility is untenable (which is not anticipated at this time), a new major version will be launched. Earlier versions will remain available.
Extension of the ELC API to other paleobiological data resources requires that they have their own native APIs for accessing internal data holdings. Not all yet do. In these cases, the ELC APIs may offer a useful design template and starting schema for developers of other paleobiological data resources.
We plan to add access to museum collection data from iDigBio via the ePANDDA API in the near future. This will allow users to query both published and unpublished museum data sets with single queries.
The ELC API does not attempt to automatically test whether duplicate fossil occurrences are returned by data resources. At present, this is not a major concern because the PBDB and Neotoma have largely non-overlapping data holdings, with the possible exception of some overlap in Miocene and Pleistocene terrestrial vertebrates (e.g., Fig. 2). One of the likely uses of the ELC API, however, is as a starting point for identifying and resolving potential duplicate data holdings.