The Digital Index of North American Archaeology: networking government data to navigate an uncertain future for the past

Abstract The ‘Digital Index of North American Archaeology’ (DINAA) project demonstrates how the aggregation and publication of government-held archaeological data can help to document human activity over millennia and at a continental scale. These data can provide a valuable link between specific categories of information available from publications, museum collections and online databases. Integration improves the discovery and retrieval of records of archaeological research currently held by multiple institutions within different information systems. It also aids in the preservation of those data and makes efforts to archive these research results more resilient to political turmoil. While DINAA focuses on North America, its methods have global applicability.


Introduction
Government policies and bureaucracies shape the practice of archaeology in much of the world, including in the USA. Laws, regulations and government agencies oversee heritage management, influencing where archaeologists work, how intensely they record sites and the manner in which archaeological sites and their documentation experience long-term curation. Despite this importance, the impact and outcomes of US government policies in archaeology remain largely opaque to research communities and other public stakeholders. Furthermore, the nation's investment in archaeology and historic preservation has produced a literature that, while vast, is often inaccessible. The 'Digital Index of North American Archaeology' project (DINAA; ux.opencontext.org/archaeology-site-data/) builds critically needed infrastructure to make information about archaeology across North America accessible and useful to scholars and the public alike (Wells et al. 2014). DINAA provides the most comprehensive and detailed database documenting human settlement in North America currently available by aggregating archaeological and historical data from state and tribal governmental authorities that manage US cultural resources. To date, the public can download nearly 500 000 site file records (with precise geographic location and other sensitive data redacted) free of charge, and free of intellectual property restrictions. These records, available via Open Context (opencontext.org)-an open-access data-publishing service ( Figure 1)-cross-reference an ever growing number of reports, museum collections, bibliographic references and other online datasets. This provides researchers, museums, libraries, government offices, tribal officials and members of the public with a powerful gazetteer of all documented historical and archaeological sites in the USA, and this integrated content can facilitate cross-disciplinary research on a continental scale. Altschul and Patterson (2010) conservatively estimate that US taxpayers invest over US $500 million per year to comply with federally mandated historical and archaeological protection measures. This level of public investment nearly matches the total combined budgets of the Institute of Museum and Library Services (approximately US $240 million in 2015), the National Endowment for the Humanities (approximately US $140 million in 2015) and the National Endowment for the Arts (approximately US $150 million in 2015), and is 30 times the annual National Science Foundation budget (approximately US $16.5 million in 2013) for archaeology and archaeometry (Rocks-MacQueen 2014). These surprising figures demonstrate archaeology's relative importance in public cultural heritage investments. Unfortunately, much of this work goes unnoticed: decades of investment in managing and protecting America's archaeological heritage have led to few publicly accessible outcomes. Cultural resource management (CRM) largely takes place within relatively opaque bureaucratic processes that regulate construction and development. CRM work has resulted in an estimated 350 000 reports nationwide as of 2004 (NADB 2016), but because of poor access and cataloguing, irreplaceable cultural heritage documentation in these 'grey literature' reports is under-appreciated or ignored. Thus, CRM output typically sees little external reuse in research or publication venues recognised by professionals. Figure 1. DINAA partnerships as of December 2016 with dot density plot showing the distribution of cultural resources at low resolution within states whose data have been received thus far. Dots do not refer to exact site locations, but to groups of five sites whose position has been randomly distributed within 20 × 20km grid cells.

Making the most of public investment in archaeology
Similarly, this literature has very little accessibility to other groups, including Native American descendant communities.
'Open-government' reform efforts attempt to promote greater public access to information resulting from public investments, such as CRM archaeology. These initiatives complement so-called 'open-science' and other 'open-data' efforts with broad goals of promoting greater 'democratic accountability' to research and public policy through broader public access to information (Kansa 2012;Lake 2012). Appropriate investments in data management can help this information achieve its potential by making it accessible and usable (Kansa & Bissell 2010;Kansa & Kansa 2011, 2013Kansa et al. 2014;Kintigh et al. 2014a & b;Raviele 2014;Wells et al. 2014;Anderson 2018). Indeed, as is discussed later in this article, recent political upheavals in the USA further highlight the importance of public policy in shaping access to scientific data.
DINAA has open-science and open-government goals and motivations. It leverages the tremendous public investment in cultural heritage for wider public impacts by building a comprehensive, free and open-access inventory of US archaeological and historical resources. DINAA compiles, cleans and publishes site file data aggregated from state and other agencies that enforce US historical protection laws. Site files include information about periods of use, associated artefact collections and documents, preservation condition, eligibility for the National Register of Historic Places, and a host of other variables useful for research and management purposes. In 1993, the last time primary site file data were compiled nationally through the National Park Service's National Archaeological Database (NADB), just under one million archaeological sites had been recorded (NADB Maps 1993). This total has grown over the past two decades (see also Anderson & Horak 1995;Anderson & Sassaman 2012: 32), and we now estimate the number to be nearly 2.5 million sites.
DINAA has integrated archaeological site data from nearly 500 000 sites from more than a dozen states. These data encompass the rich chronological, cultural and anthropological metadata used by government compliance officials and the research community ( Figure 2). DINAA will continue this work over the next two years, with support from the Institute of Museum and Library Services and the National Science Foundation. This work will encompass the remainder of the USA, adding an estimated one to two million archaeological sites.
As already mentioned, contemporary Native American communities are largely without access to CRM, lacking access to reports, data and other information documenting their past. Before requesting any data from state offices, we contacted every federally recognised tribe with an ancestral homeland in the eastern USA (DINAA's current area of coverage). In the planning and the execution of the new DINAA-related grant projects, we expanded outreach and consultation efforts to Native American nations in the remainder of the USA. We staffed a booth at the recent meeting of the National Association of Tribal Historical Protection Officers (NATHPO, a professional organisation for tribal government heritage officials) and met with the Native American Advisory Council of the Phoebe A. Hearst Museum in Berkeley. Two of our team members also have current or past heritage management employment with tribal governments. From such consultation and work on behalf of tribal governments, DINAA can play an important role in supporting cultural resources at low resolution within states whose data have been received thus far or is being delivered (n = 21). Dots do not refer to exact site locations, but to groups of five sites whose position has been randomly distributed within 20 × 20km grid cells. Ohio data are at county-level resolution.
the work of tribal government heritage officials (THPOs). Colonial displacements and forced migrations have meant that ancestral homelands can span multiple state boundaries. By making certain low-sensitivity data available, DINAA can provide tribal officials with information needed to manage ancestral territories better.
DINAA records associate every site with its Smithsonian trinomial identifier-an administrative code assigned by the state-and with mapping, chronology and other metadata. To eliminate the risk of accidental or malicious disclosure of sensitive data, DINAA only stores and releases spatial coordinates at a reduced level of geographic precision. To prevent looting and vandalism, the Archaeological Resource Protection Act (ARPA-a US federal law), requires the protection of site location data. In discussion with State Historic Preservation Offices (SHPOs) and agency personnel, we arrived at a generally acceptable consensus on levels of spatial precision for public disclosure. In most cases, DINAA allocates sites to a 15-20km grid cell ( Figure 3). This consensus grid applies to all states currently in DINAA, with the exception of Ohio, which required DINAA to reference sites to county centroids, rather than the general DINAA grid. Our negotiations over spatial data highlight how the understanding of risk among government officials plays a key role in shaping public records. We hope DINAA will provide more experience in ways to balance information security needs with information usability needs. As described in the next section, DINAA's current approach to spatial data still permits important research programmes and applications.

Research applications of DINAA
In the face of rapid and far-reaching global changes, the concept of the 'Anthropocene', although contested, has intrigued scientists, cultural heritage professionals, policy makers and public communities. While variably defined, the Anthropocene refers to the period when human agency began measurably to alter climate and biota at regional, continental and global scales (Lane 2015;Lightfoot & Cuthrell 2015;Zalasiewicz et al. 2015). Whether or not the term has much explanatory value, public awareness of the Anthropocene helps to motivate the mobilisation of data. Part of the justification for requesting mass data dumps from state officials comes from research questions around large-scale human impacts. For instance, documenting the threat of sea-level rise to tens of thousands of sites known through DINAA demonstrates how unprecedented data access can reveal the vast scope of cultural preservation challenges resulting from accelerating 'Anthropocene' climate change ( Figure 4, Table 1; see Anderson et al. 2017).
From an information-management perspective, DINAA's greatest value centres on linked open-data applications; that is, the ability to participate in the growing body of related data shared openly via the web ( Figure 5). The next section provides examples of how DINAA data are already being combined and visualised with other web-based data sources. In short, because DINAA publishes sites through Open Context, it benefits from Linked Open Data technologies that Open Context employs and continually develops. Open Context   emphasises the use of stable web Uniform Resource Identifiers (URIs, i.e. stable URLs that serve as universally unique 'primary key' identifiers) to identify concepts and other entities so that they can be easily and precisely referenced and related across different data collections on the web. DINAA uses Open Context and the EZID service to mint persistent URIs for each site files record. In archaeology and historical geography, the 'site' is a key organisational entity. Minting stable web URIs and offering rich temporal, geographic and cultural metadata about sites will create significant linked open-data resources essential for broadly integrating museum, library and scientific datasets.

Archaeological linked data and DINAA
The DINAA project is not alone in developing information systems to integrate large-scale cultural heritage data. MEGA-Jordan (http://www.megajordan.org/), an early and ongoing effort, provided the basis for continued software development with Arches, an opensource heritage data management project (http://www.getty.edu/conservation/our_projects/ field_projects/arches/). Arches mainly serves the needs of government administrators, and uses the CIDOC-CRM (International Council of Museums, International Committee for Documentation, Conceptual Reference Model; http://www.cidoc-crm.org/) as a standard ontology to organise data. As such, it is an excellent tool for creating standards-compliant cultural heritage data.
In contrast, DINAA publishes legacy data created in each state by government personnel (typically site file managers working under SHPO/state archaeologist/state professional archaeological council direction), and without reference to the CIDOC-CRM. Loading each dataset into DINAA involves what data managers describe as an 'ETL' (extract, transform, load) process. ETL processes migrate data from one database to another, often involving transformations in formats and data organisation. We have yet to encounter a state site file database with any sort of API (application program interface) that we could use to automate requests for data. For DINAA, the ETL process involves obtaining tabular data 'dumps' manually generated by the database managers of each state database. DINAA then redacts sensitive data and cleans inconsistencies (e.g. misspellings in controlled vocabularies and non-numeric text in numeric fields), and also converts geographic coordinates (reduced to a low-level of spatial resolution as discussed above) to the WGS-84 standard-as is common to web mapping services. The DINAA team also creates additional metadata, especially date ranges for different periods defined by a dataset. Finally, prior to publication online, the data from each state require modelling according to Open Context's general and highly abstracted database schema. While schema mappings can be reused, most of the ETL process needs to be repeated to add new records to DINAA, as state site file databases expand over time.
Open Context uses a very general and abstracted database schema in order to preserve the wide variety of attributes, relationships and vocabularies used in source datasets. Use of a more specific data organisational standard, such as the CIDOC-CRM, would require additional investments of time, effort and expertise in the ETL workflow. The DINAA project explored mapping data to the CIDOC-CRM, but we decided against it for practical and conceptual reasons (see discussion below). Our ETL workflow mints URIs for each entity, controlled vocabulary concept and descriptive attribute. For DINAA, this means Open Context mints a stable URI for each site file record. Those URIs can be used for Linked Data applications that cross-reference archaeological site information reported in disparate sources, such as the Federal Register, JSTOR publications, the National Register of Historic Places, the Canadian Archaeological Radiocarbon Database (CARD) and other databases. The following examples illustrate how DINAA's approach to linking and integrating archaeological datasets can facilitate research.

Example 1: mapping government heritage administration
The current DINAA dataset is cross-linked with the Federal Register, a US government outlet that provides notifications of decisions and other news relating to the administration of laws and regulations (http://ux.opencontext.org/2016/12/02/ dinaa-and-the-federal-register/). Regulatory processes greatly impact and shape the practice of archaeology in the USA, and the Federal Register offers a key information resource for understanding governance of the archaeological past. DINAA allows archaeologically meaningful context to be added to Federal Register notifications via linked opendata methods. Archaeological sites in Federal Register notifications are typically listed with Smithsonian Trinomials. By themselves, these trinomials are just strings of letters and numbers with very little meaning. As DINAA curates Smithsonian Trinomial identifiers along with geospatial, chronological and other metadata, however, documents listing Smithsonian Trinomials can be matched with DINAA, thus adding rich spatial, chronological and other metadata to government documents. This added context helps make the Federal Register a more meaningful window onto how public agencies manage archaeological resources.

Example 2: mapping publications
Since the 1960s, many researchers have published scholarly papers and books identifying historical and archaeological sites with Smithsonian Trinomial identifiers. The DINAA team recently text-mined the literature in JSTOR to find trinomials and associate them with DINAA records. Open Context describes links between DINAA sites and JSTOR articles that reference those sites with concepts from Dublin Core Terms, a widely used digital library standard. This linking drives map-based search-and-browse interfaces to discover site-related scholarly literature. DINAA can display heat-maps of this content to show where academic scholarship has focused, thus helping to illustrate the history of research. This highlights how text-mining and entity identification can enhance discovery and analysis of archaeology's publication record (see Jeffrey et al. 2009;Kintigh 2015). This exercise will guide future entity identification efforts beyond JSTOR, to encompass the HathiTrust (digitised books), reports in tDAR (a digital repository for North American archaeology) and other document archives that contain poorly catalogued grey literature reports.
Example 3: cross-referencing DINAA with external data sources By matching Smithsonian Trinomials, we have established links between DINAA site file records and metadata records in other datasets and repositories. These include the Paleoindian Database of the Americas (PIDBA), the Eastern Woodlands Household Archaeological Data Project, the Federal Register and tDAR (Anderson et al. 2010(Anderson et al. , 2015McManamon & Kintigh 2010;White 2014;Anderson & Miller 2017). Figure 6 shows all the sites in DINAA that are referenced by records or web resources held in all four of these external data repositories. This helps people to discover potentially useful research content in any online collection with related and linked material. Essentially, DINAA acts as a vast series of pegs on which to hang external content (Sheehan 2015). As data curators hang more information resources on each peg, the collective value and impact of networking data with DINAA will grow.
Example 4: contributions to the PeriodO temporal gazetteer As noted above, the CARD team has agreed to cross-reference with DINAA. This will help to integrate North America's settlement history more successfully with our understanding of absolute chronologies. Nevertheless, relative chronologies remain important in archaeology, and because of their methodological significance, relative chronologies need appropriate data modelling. The PeriodO project (http://perio.do/), an initiative to develop a gazetteer of scholarly definitions of temporal periods, represents another aspect of DINAA's contributions to linked open data. PeriodO provides web URIs and a common schema to model temporally and geographically scoped period concepts (Rabinowitz 2014;Shaw et al. 2016). ARIADNE (http://ariadne-infrastructure.eu/), a major European archaeological data integration effort, uses PeriodO as a framework for temporal interoperability. DINAA contributed over 700 period entities used to describe the chronologies of individual states to the initial PeriodO database. Thus, through PeriodO, DINAA's North American collections have some common modelling and cross-referencing with key European collections. As PeriodO develops aggregation features, Open Context will eventually use PeriodO to enhance temporal indexing of DINAA and other datasets.
Importantly, PeriodO does not demand agreement where agreement does not exist. As PeriodO uses a common schema to model periods, it can serve as a basis for interoperability between different chronological schemes defined by different institutions and scholars. Representatives of different communities can use PeriodO to author and model new and alternative periodisation schemes. In a colonial context such as North America, archaeological chronologies used by government agencies often do not reflect the needs or perspectives of Native American communities. Rich traditions from oral histories and other indigenous ways of knowing can more broadly contextualise a public dataset such as DINAA to make it more meaningful and valued among members of contemporary descendant communities (Cochran et al. 2008;Nicholas 2008). PeriodO can help open the door for indigenous community perspectives to reorganise the histories represented in DINAA site files. Participatory research in archaeology has allowed indigenous knowledge to be applied not only to tribal land and sites, but also to the existing archaeological record across the landscape of the USA. Known sites are increasingly visited and used by contemporary descendant groups. As memoranda of agreement and understanding are developed for site maintenance and historic resource management, the intersection of traditional knowledge and scientific archaeological data deepens. A unique opportunity exists for referential terms familiar to traditional communities-including those describing origin stories and/or historic events curated through oral history-to be linked to temporal or cultural terms already in use by state and federal agencies. Bridging these disparate conventions of naming can open new pathways for archaeological and indigenous knowledge systems to work in concert. Indigenous communities can reference and reclaim sites through the use of their own vocabularies.

Discussion: interpretation and integration
The examples above highlight how DINAA's approach to linked open data can facilitate resource discovery as well as some research applications. We recognise, however, that using the CIDOC-CRM to achieve more comprehensive data harmonisation would enable more sophisticated querying and analysis of aggregated data (Binding et al. 2008;May et al. 2015). But greater semantic harmonisation involves difficult challenges and levels of effort beyond the DINAA project's current scope. The legacy data sources aggregated by DINAA use informally defined concepts and schemas sometimes inconsistently (for United Kingdom and European archaeological data, see also May et al. 2015). Greater comprehensive semantic harmonisation would require much more effort to investigate and resolve the ambiguities and inconsistencies in the legacy data. Without such care, mapping to the CIDOC-CRM would only make data seem more harmonised and comparable than they really are. While we regard CIDOC-CRM mapping to be an important long-term goal, DINAA forgoes full semantic harmonisation for the immediate future. As argued by Isaksen et al. (2014), the Pelagios project demonstrates how linked open-data programmes can provide useful levels of data integration, even without reference to a complex ontology, such as the CIDOC-CRM (Kansa 2015). Leveraging the 'charm of weak semantics' (Baker & Sutton 2015) can still support efficient publication and the indexing of structured data. This approach has enabled Open Context to leverage bioinformatics ontologies and datasets (chiefly the Uberon ontology and the Encyclopedia of Life) to support research outcomes based on the integration of archaeozoological datasets Kansa et al. 2014).

Networking data for sustainability in turbulent times
We have noted how public policy, including recent 'open-government' and 'open-science' reform movements, largely shapes the creation, accessibility and content of a dataset such as DINAA. Similarly, public policy will play an important role in shaping the future of these data. The sustainability of digital data management involves a host of practical, political, financial and institutional challenges (e.g. Kintigh & Altschul 2010). Charging for access to data created by public agencies and funded by taxpayers poses ethical problems and would undermine the public benefit of this project. As much government-created content carries no copyright restrictions, DINAA makes all data open access under a Creative Commons Zero (CC0) public domain dedication. We employ various complementary approaches to promote the sustainability of DINAA beyond the lifespan of this proposed project, including income from Open Context's research data management services, consulting and training services, and philanthropic donations. Beyond common digital preservation methods (e.g. open file formats, widely understood documentation metadata and support of digital repositories), we recognise that sustainability needs to extend beyond preservation of data created by this project. As a comprehensive map of US archaeological sites, and in its key role in linking diverse museum, archive and library collections, DINAA will play a central role in the stewardship of cultural heritage across North America. Once completed, DINAA will require ongoing maintenance, curation and updates as SHPOs register new sites, as well as organisational support and a governing body.
In recent years, federal agencies responsible for managing public lands have come under increasing political and financial pressure. The rise of the Trump administration highlights a long-term need for archaeologists (and other research communities) to prepare for risks to funding, the enforcement of conservation and historical protection laws and the integrity of public data. Archaeology, generally mandated by regulatory requirements, will suffer if enforcement of those requirements ceases. Granting bodies, such as the National Science Foundation and the National Endowment for the Humanities, as well as memory institutions (e.g. museums, libraries and digital archives) also face a more hostile political and financial climate. Government agencies maintain key information resources of archaeological significance, and these datastores can disappear with loss of funding. Beyond financing, the treatment of climate science and scientists by the current administration and its allies provides clear indications that it will seek to undermine freedom of inquiry and scholarly independence.
The political turmoil now experienced in the USA challenges existing assumptions about digital preservation and archiving. The Archaeology Data Service, a national archive for the United Kingdom, has provided a powerful and successful model to similar such efforts, including Digital Antiquity in the USA. The model of a centralised 'national repository', however, assumes that the nation will not fail nor become hostile to research and the preservation of research. The rise of far-right and neo-fascist (see Mammone 2016; Butler 2017) movements highlights under-recognised risks in conventional (i.e. centralised as opposed to distributed) approaches to digital preservation (Allen et al. 2017). The rapid changes to the US political system since the last Presidential election have already motivated another key digital repository, the Internet Archive, to start building a backup in Canada. The Internet Archive preserves broadcast news, websites and government information, including climate science data. Much of this content has become, or may become, highly politicised. Duplicating this information in Canada highlights how distributing research data across international borders can help guard against national political risks.
The Internet Archive's recent efforts show how decentralised and globally distributed strategies for digital preservation may better endure the rise of hostile national governments (see also Findlay 2015). With respect to DINAA, and Open Context more generally, we are taking steps to archive content in multiple institutions globally. In addition to archiving with the California Digital Library, we now have a fully mirrored instance of Open Context and DINAA hosted by the German Archaeological Institute. We are also archiving data with the Internet Archive so that it can be globally replicated along with other climate science and US government information. These initial steps towards distributed preservation, however, can be taken much further. Blockchains (see Miller et al. 2014;Findlay 2015), implemented by decentralised systems such as the Interplanetary Database (https://ipdb.foundation/), have the potential to help support archival and data management services across globally distributed networks. This content is relatively easy to preserve in distributed networks because the data are small (in a technical sense, although large in a thematic sense), encoded in open formats and have no security or intellectual property restrictions. These data can and should be used to experiment with new decentralised institutional and technical means for digital preservation. As discussed, linked open data provide pathways for different datasets, curated by different communities, to enrich and more broadly contextualise each other. We should explore ways to build better data preservation approaches into linked open data's globally distributed collaboration. Globally networked professional societies, library and museum 'memory institutions', and even enthusiast communities can use linked open data as a basis to discover and document relevant content to backup and secure. In this way, efforts such as DINAA can survive national political turbulence, and by networking related information together, may also facilitate the preservation of digital cultural information more broadly.

Conclusion
DINAA offers a case study in migrating key archaeological information from centralised government-controlled repositories to more distributed civil society networks. In making these data easily referenced on the web and accessible to different public stakeholders, DINAA provides a basis for integrating a host of other information in various government, museum and library contexts. These linked data efforts help explicitly to define relationships between disparate parts of the published and digitally curated archaeological record. Simultaneously, in building bridges with tribal heritage officials and in aggregating and internationally disseminating the data created by administrative offices, DINAA helps make these data more accessible to wider segments of civil society, and more resilient to local and national political upheavals. In doing so, DINAA highlights how 'bottom-up' and globally networked groups can help play an important role in safeguarding digital aspects of archaeology through turbulent times.