Frontiers of Biogeography Variance partitioning and spatial eigenvector analyses with large macroecological datasets

Macroecological data are usually structured in space, so taking into account spatial autocorrelation in regression and correlation analyses is essential for a better understanding of patterns and processes. Many methods are available to deal with spatial autocorrelation, but there are some difficulties when one is dealing with huge geographical extents and fine-scale data. So, we propose a relatively simple and fast computer-intensive approach to deal with Principal Coordinate of Neighbor Matrices (PNCM)/ Moran’s Eigenvector Mapping (MEM) analyses for large datasets, using global richness pattern of sharks as a model. We performed a variance partitioning approach by regressing species richness against environmental variables and spatial eigenvectors derived from PCNM. Due to the large number of ocean grid cells (> 9000), we ran the analyses 1000 timesby randomly subsampling each time 50 to 4500 cells and compared the distribution of the variance partitioning components, as well as the slopes of the environmental variables. We also estimated Moran’s I coefficients for regression residuals to check if spatial eigenvectors took into account spatial autocorrelation. Comparing statistics of analyses with different sample sizes, we note that although the environmental component increases linearly, other components (unique space and shared) of the most important variables stabilize with about 1000 cells, whereas all other smaller effects tend to stabilize between 2500 and 3000 cells. Besides that, PCNM eigenvectors were able to control spatial autocorrelation very well. We showed that shark richness patterns are strongly and positively correlated with temperature range, according to the well-known pattern of distribution for the taxon, and strong negatively correlated with oxygen supplies, which are higher in polar zones where ice acts as a barrier to sharks. Our approach clearly shows that it is possible to perform a robust evaluation of global patterns of diversity using eigenvector approaches based on a resampling strategy and allows effective computation of the variance partitioning even when dealing with large datasets.

• We found the smallest necessary sample size to represent the total database of shark richness, simplifying eigenvector selection for spatial analyses.
• We performed variance partitioning analyses using a large dataset of shark diversity across more than 9000 grid cells worldwide, using a common Notebook computer and within a feasible running time.
• By running 1000 replicates of the analyses for different sample sizes (50 to 4000 cells) and comparing the resulting distribution of estimated parameters, we showed that the patterns stabilize at around 1000 cells in the system analysed, corresponding to about 10% of the total dataset.
• Our new approach allows researchers to perform spatial analyses of large datasets in feasible time and with reduced computational efforts, achieving an accurate approximation of effects, testing, and searching for the stabilization of parameters of interest.

Abstract
Macroecological data are usually structured in space, so taking into account spatial autocorrelation in regression and correlation analyses is essential for a better understanding of patterns and processes. Many methods are available to deal with spatial autocorrelation, but there are some difficulties when one is dealing with huge geographical extents and fine-scale data. So, we propose a relatively simple and fast computer-intensive approach to deal with Principal Coordinate of Neighbor Matrices (PNCM)/ Moran's Eigenvector Mapping (MEM) analyses for large datasets, using global richness pattern of sharks as a model. We performed a variance partitioning approach by regressing species richness against environmental variables and spatial eigenvectors derived from PCNM. Due to the large number of ocean grid cells (> 9000), we ran the analyses 1000 timesby randomly subsampling each time 50 to 4500 cells and compared the distribution of the variance partitioning components, as well as the slopes of the environmental variables. We also estimated Moran's I coefficients for regression residuals to check if spatial eigenvectors took into account spatial autocorrelation. Comparing statistics of analyses with different sample sizes, we note that although the environmental component increases linearly, other components (unique space and shared) of the most important variables stabilize with about 1000 cells, whereas all other smaller effects tend to stabilize between 2500 and 3000 cells. Besides that, PCNM eigenvectors were able to control spatial autocorrelation very well. We showed that shark richness patterns are strongly and positively correlated with temperature range, according to the well-known pattern of distribution for the taxon, and strong negatively correlated with oxygen supplies, which are higher in polar zones where ice acts as a barrier to sharks. Our approach clearly shows that it is possible to perform a robust evaluation of global patterns of diversity using eigenvector approaches based on a resampling strategy and allows effective computation of the variance partitioning even when dealing with large datasets.

Introduction
For some twenty years or so, macroecologists and biogeographers have increasingly recognized that it is important to take into account spatial autocorrelation when inferring ecological and evolutionary processes based on statistical analyses of geographical patterns in species richness and other similar patterns, such as those generated under ecogeographical rules (Lennon 2000, Diniz-Filho et al. 2003. In principle, spatial autocorrelation affects the Type I errors of regression and correlation analyses (Legendre 1993, Lichstein et al. 2002. Still, many discussions have focused on how coefficients shift when applying spatial statistics to macroecological data due to collinearity, scale, and hierarchy problems (Lennon 2000, Diniz-Filho et al. 2003, Beale et al. 2007, Kuhn 2007, Dormann 2007, Hawkins et al. 2007). Moreover, as processes are structured in space, quantifying and describing spatial autocorrelation in macroecology and ecology can be viewed as a new general approach to understanding such processes and not necessarily just a problem to be removed (Legendre 1993). Many methods are now available to deal with spatial autocorrelation (i.e., Dormann et al. 2007, Bini et al. 2009), which can be implemented in specialized software such as SAM (Rangel et al. 2006(Rangel et al. , 2010 and, more recently, in many R-packages (e.g., spdep, Bivand and Wong 2018;gstat, Pebesma 2004;nlme, Pinheiro et al. 2020;adespatial, Dray et al. 2020).
Yet, with the development of new GIS facilities and the possibility to quickly generate both fine grain species and environmental data at large geographical extents, spatial analyses may become a challenge. It is notoriously difficult to deal with sample sizes (n) of thousands of spatial units, especially if it involves (as it frequently does) eigenanalyses of spatial connectivity or distance matrices (W matrices with n x n dimensions;see Haining 2002, Griffith 2003. Although it is possible to use parallel processing and intensive computing techniques to deal with some of these problems, this solution is not always available, and, in any case, it may still take a long time to run (which is a constraint if, for example, distinct W matrices should be tested; Kissling andCarl 2008, Bauman et al. 2018b). With some techniques such as Principal Coordinate of Neighbor Matrices (PNCMs; Legendre 2002, Dray et al. 2006) and subsequently developed but related approaches such as Moran's Eigenvector Mapping (MEMs; Dray et al. 2012), large matrices impose another additional challenge related to eigenvector selection to be used in the modeling (see Diniz-Filho and Bini 2005, Griffith and Peres-Neto 2006, Dray et al. 2012, Bauman et al. 2018a. Both techniques involve reconstructing spatial patterns using distance matrices (Euclidian distances for PCNMs or other types of weighting matrices for MEMs) by creating spatial eigenvectors that are newly composed variables that describe the spatial relationship among localities or sampling units. Thus, if these new variables are incorporated into the modeling process, they will take into account any spatial autocorrelation in the analyses and will be sufficient, if appropriately fitted, to correct for inflated Type I errors in the other parameters of the model (see Legendre 2002, Dray et al. 2006).
Here, we propose and evaluate a relatively fast and straightforward computer-intensive approach to deal with PCNM/MEM analyses for large datasets. The idea is simply to randomly sample the matrix thousands of times and evaluate the statistical distribution of parameters to be estimated. Although spatiallystructured resampling techniques have been used to deal with spatial autocorrelation by generating samples a given distance apart (eliminating the pseudoreplication;e.g., Hawkins et al. 2007, Oliveira et al. 2014, our goal here is a bit different. The problem to be solved is the difficulty of analyzing, using PCNM or MEM, a global data set with thousands of grid cells to estimate spatial and environmental components of richness patterns, and not to eliminate neighbor cells to avoid autocorrelation. As these methods work with eigenvectors from pairwise distance or connectivity matrices, the time and capacity for computation will increase more than exponentially with sample size (both because matrices will increase in size in a square scale and because of complexity in extracting eigenvalues iteratively). We tested our approach by analyzing global richness patterns of sharks and correlate them with environmental data.

Data
Primary data for our paper consist of range maps obtained from IUCN for a total of 469 species of sharks distributed worldwide (IUCN 2019), representing about 78% of the species of Neoselachii. Taxonomy was based on the GBIF Backbone Taxonomy 1 . The ranges were polygons, which were overlaid using a global grid with a total of 9212 cells with oceanic centroids (coastal cells with land centroids were removed from the dataset, so the proportion of land area in each cell depends on the shape of the coastal line), with approximately 2 o of latitude/longitude, using the R-package LetsR (Vilela and Villalobos 2015).
Eight environmental variables were obtained from the Bio-Oracle (Tyberghein et al. 2011) platform in the form of mean values and one variable in the form of range for the period between 2000 and 2014 and were also overlaid on the global grid. The variables used were mean temperature ( o C), temperature range ( o C), chlorophyll (mg.m -3 ), dissolved molecular oxygen (mol.m -3 ), sea ice concentration (fraction), pH, carbon phytoplankton biomass (umol.m -3 ), primary productivity (g.m -3 .day -1 ), and salinity (PSS).

Statistical analyses
We analyzed the data using a variance partitioning approach (Borcard et al. 1992, Legendre andLegendre 2013) by regressing species richness against environmental variables and spatial eigenvectors derived from PCNM. However, due to the large number of grid cells (> 9000), we performed the analyses 1000 times, by random sampling each time a given number of cells.
Then we compared the distribution of the variance partitioning components, as well as the effect sizes of the environmental variables. We computed the median effect size (i.e., t-value for each regression slope) and the proportion of simulations in which a significant (P < 0.05) regression slope for each sample size was found.
As richness is strongly skewed, we obtained the R 2 for variance partitioning with the glm of R-package, with a Poisson link function, but the approach proposed here can be used with any modeling technique (indeed, a more standard partition using an OLS based on logtransformed richness provide qualitatively similar results, although with a higher fraction for the environmental component). The particular type of model that better fits the data, in general, can be tested with a few preliminary tests (as we did with our dataset) or some diagnosis statistics (such as AIC-based comparisons between classes of models or statistics for residuals distributions) incorporated into the replications for "a posteriori" decisions.
Spatial eigenvectors derived from PCNMs were obtained using the pcnm function in the vegan R package (Oksanen et al. 2007) and selected, for computational simplicity, based on the correlation with the response variable (which is roughly equivalent to stepwise functions, as eigenvectors are independent;see Bauman et al. 2018a,b). In any case, we estimated Moran's I coefficients (see Legendre and Legendre 2013) for regression residuals to check if spatial eigenvectors were enough to account for spatial autocorrelation and thus allow correct evaluation of Type I error of the slopes.
For the record, we analyzed the data in a Notebook computer with Intel I-5 processor and 8GB RAM.

Results
The richness pattern of sharks (Fig. 1) tracks the well-known pattern for the taxon, with high richness concentrated in the coastal zones and middle latitudes of both hemispheres. Besides that, the Indo-Pacific islands and tropical Oceania exhibit high richness as well (Tittensor et al. 2010, Lucifora et al. 2011, Worm and Tittensor 2018. We provide detailed results for sample sizes equal to 100, 500, 1000, 2500, and 4500 cells, but detailed Tables with all results can be found in the Supplementary Material (Appendices S1 and S2). The variance partitioning for increasing sample sizes ( Table 1) shows that the environmental effect (a), which starts with a relatively high median value of about 0.22 when only 100 cells are sampled, goes up to a median of 0.06 when 4500 cells were sampled. This reduction in the unique environmental component is not correlated with an increase in both shared (b) and unique geographical components (c) that tend to stabilize with smaller sample sizes ( Fig. 2A).
However, although the unique environmental component in variance partitioning tends to increase linearly with sample size, the effect size of the most important environmental predictors in the full model (i.e., incorporating the selected spatial eigenvectors from PCNMs) was more or less stable across the sample sizes, above about 500 cells. As sample sizes increase, the median t-value of some variables tends to fluctuate, and the frequency of slopes with P < 0.05 tends to slightly increase, stabilizing when n = 2500 but not reaching values much higher than 80% of significant coefficients across simulations even with larger samples sizes. For the two variables with more important effects (i.e., temperature range and dissolved O 2 ), on the other hand, even with sample sizes smaller than 1000, more than 95% of the coefficients are significant (Fig. 2B). So, we understand that an increase in the unique environmental component is most likely due to an accumulation of minor effects that do not necessarily help to explain global patterns.
Just as a general reference, if we consider the stabilization of the main drivers and of the unique spatial component and its overlap with environmental variables, it is possible to see that, on average, the full model explains ca. 70% of the variation in species richness (CI 95% limits equal to 67.1% and 76.5%; Fig. 3A) and that environmental variables alone explains about 10% of this variation (CI 95% limits equal to 6.5% and 14.8%; Fig. 3B), without spatial  Table 2). The number of spatial eigenvectors from PCNMs selected for analyses increased approximately Table 1. Median values of the components a, b, and c for variance partitioning, modeling shark richness with a GLM as a function of environmental variables and spatial eigenvectors, and their 95% lower and upper limits, based on 1000 samples with variable number of cells (n), ranging from 100 to 4500. The three components refer to environmental, shared, and spatial effects, respectively. The analyses were based on data for 469 species and a full dataset global grid comprising 9212 cells.    exponentially with increasing sample sizes, to account for more complex spatial patterns involved (Appendix S3). Even so, Moran's I values suggest that, in all cases, spatial eigenvectors were enough to take autocorrelation into account, with upper 95% limits of the confidence intervals never reaching values higher than 0.1 (and this upper limit also decreases with increasing sample size). The analyses took from a few seconds (when n = 50) to about three days (n = 4500) to run (Appendix S3) using the script available in the supplementary material (Appendix S4).

Discussion
Our approach clearly shows that it is possible to perform a consistent evaluation of global patterns of diversity using eigenvector approaches based on a resampling strategy. All statistics tend to shift with increasing sample sizes (i.e., number of cells), but in our example, the main effects associated with temperature range and dissolved O2 tend to stabilize between 500 and 1000 cells. Our simple approach allows a straightforward computation of the variance partitioning and would be easily repeated several times, if necessary (i.e., testing different W-matrices or performing richness deconstruction). The analyses with 2500 cells took approximately twelve hours in a notebook with Intel I-5 processor, which is quite feasible (although the time for processing all 9,000 cells would increase more than exponentially and would in fact not be possible on this same machine). Although using more powerful computers would result in faster computation and allow us to deal with large sample sizes (and even with the full dataset), showing that sampling the full dataset would provide similar results is still interesting. It can be very useful to researchers with limited computational resources at their disposal in many places across the world. In addition, it allows more complex analyses and many more tests, including the complex process of eigenvector selection for multiple W-matrices, in feasible computer times.
In practice, it is important to highlight in advance that we are discussing the stability of PCNM and variance partitioning for a particular dataset regarding patterns of shark diversity worldwide. We are not stating, for example, that results of a global analysis with about 9,000 cells will stabilize when sampling 5% of the dataset. This will actually depend on the relative magnitude of the drivers. Thus, when applying the approach proposed here to a new dataset, it is important to perform some iterative and quick tests searching for the stabilization of the results, which will also depend on the computer capabilities available to each researcher. For instance, a nice alternative would be to check for mean effect sizes using a small number of replications for increasingly samples sizes and when a stable effect size is achieved, increase the replications to better evaluate the frequency of significant results.
Also, contrary to other studies that undertook spatially-structured sampling (i.e., considering only cells that are situated a given geographic distance apart) to heuristically understand autocorrelation effects on the analyses (i.e., Hawkins et al. 2007) or take them into account (in species distribution modeling; Oliveira et al. 2014), our goal is to reduce computational load. As in any variance partitioning approach, we also want to evaluate the magnitude of spatial effects on the response variable, both shared or independent of environmental variables.
In the context of the framework proposed here,, it is possible to assess the uncertainty of environmental variables due to spatial structure in data, so for example, with 500 cells, we can see that the 95% confidence interval ranges from 7.4% to 18.8%, clearly different from zero. Moreover, the effect sizes for individual predictors were also relatively stable, and the ones with the highest effects in our model are the temperature range and dissolved O2 ( Table 2). The median of the effect size (i.e., t-value) and the proportion of significant slopes seem to be a suitable way to synthesize the results across the replications. We understand that the median of the parameters would estimate these effect sizes adequately.
Of course, other strategies to reduce sample size and allow a more straightforward computation of spatial and environmental effects would be possible, such as separating data for biogeographical region or any other characteristic (i.e., coastal versus deep-sea cells). Also, another simple idea is to increase cell size so analysis of the overall dataset is viable, but this involves a more complex discussion (so we're assuming that cell size was adequately defined "a priori" given data type and resolution both in diversity and environmental predictors). Anyway, assuming that processes are similar across regions, a pooled estimate of factors would allow a similar interpretation in terms of global diversity patterns. Even so, in this case, it would be necessary to decide first if a global set of eigenvectors would be extracted from a very large global W-matrix (a potential problem, especially if different W matrices should be considered) and then incorporated into "regional" variance partitioning. It would be possible to extract the eigenvectors for each region, but in this case, it may be difficult in principle to directly compare the spatial component of these partitions (although if autocorrelation is correctly taken into account by PCNM, as evaluated by residuals Moran's I, coefficients of the explanatory variables would be comparable). This seems to be in some way analogous to the problem of selecting eigenvectors in PCNMs and MEMs independently or not of the environmental variables (i.e., Safi and Pettorelli 2010).
We recognize that if patterns in different regions are driven by different processes, regional analyses may not be entirely informative about global spatial patterns. In this case, rather than doing separate PCNMs for each region it would be more interesting to use Geographically Weighted Regression (GWR; Fotheringham et al. 2002)  solve the problem of regionalization by exploring the effects of environmental variation in a continuous way, mapping the coefficients of regressions or path analysis for all cells in the grid, but for each one there is a vector of weights based on the spatial distances from the focal cell to all other cells. Although these analyses are undoubtedly interesting and reveal more complex patterns, the goals of such approaches would be different, focused on local or regional effects, and not in a global analysis of richness patterns. Also, such analyses would not allow the evaluation of the relative impact of spatial structure and its overlap with environmental effects, which may potentially reveal other spatially-structured unknown variables driving patterns in species richness (although in principle the coefficients of the environmental variables would be, on average, accurate in respect to spatial autocorrelation, despite that they will vary in space). Although our goal in this paper is just to show the application of a simple approach to undertake variance partitioning in a large dataset, it is opportune to realize that it reveals that richness patterns in sharks at a global scale are explained by variables with well-known effects, such as temperature range and dissolved O 2 . Higher temperature ranges are found mostly in temperate zones of oceans associated with convergence richer areas in resources attracting predators like sharks (Hyrenbach et al. 2000). Although top predators with big body size such as sharks need high levels of oxygen for metabolism and have been associated with areas of higher oxygen supplies (as pointed out by Blank et al. 2002), our results showed a strong negative relation between richness and dissolved oxygen because the biggest supply of dissolved oxygen in oceans are found in polar zones, where the ice acts as a physical and even physiological barrier.
In summary, our approach allows feasible spatial analyses with large datasets without taking into account biases due to autocorrelation but rather to incorporate spatial structure in the data (as in the variance partitioning approach), with the advantage of allowing a better evaluation of the distribution of parameters, their correlation structure, and uncertainty. In our study case, as sample sizes around 1000 cells provide an accurate approximation of effects, it is feasible to use the approach proposed here to explore more complex spatial structures or different deconstruction of richness patterns.