Assessing the uncertainties of model estimates of primary productivity in the tropical Paci ﬁ c Ocean

Depth-integrated primary productivity (PP) estimates obtained from satellite ocean color-based models (SatPPMs) and those generated from biogeochemical ocean general circulation models (BOGCMs) represent a key resource for biogeochemical and ecological studies at global as well as regional scales. Calibration and validation of these PP models are not straightforward, however, andcomparativestudiesshowlargedifferencesbetweenmodelestimates.Thegoalofthispaperis to compare PP estimates obtained from 30 different models (21 SatPPMs and 9 BOGCMs) to a tropical Paci ﬁ c PP database consisting of ∼ 1000 14 C measurements spanning more than a decade (1983 – 1996).Primary was not a function of model complexity or type (i.e. SatPPM vs. BOGCM); nearly all models underestimatedtheobservedvarianceofPP,speci callyyieldingtoofewlowPP( values;morethanhalfofthetotalroot-mean-squaredmodel – datadifferencesassociatedwiththe satellite-basedPPmodelsmightbeaccountedforbyuncertaintiesintheinputvariablesand/orthe PP data; and the tropical Paci ﬁ c database captures a broad scale shift from low biomass-normalized productivity in the 1980s to higher biomass-normalized productivity in the 1990s, which was not successfully captured by any of the models. This latter result suggests that interdecadal and global changes will be a signi ﬁ cant challenge for both SatPPMs and BOGCMs. Finally, average root-mean-squared differences between in situ PP data on the equator at 140°W and PP estimates from the satellite-based productivity models were 58% lower than analogous values computed in a previous PP model comparison 6 years ago. The success of these types of comparison exercises is illustrated by the continual modi ﬁ cation and improvement of the participating models and the resulting increase in model skill.


Introduction
Marine primary productivity is a large and highly variable component of the global carbon cycle and drives ocean biogeochemical cycles of major chemical elements such as silicon, nitrogen and phosphorus. From an ecological standpoint, primary productivity provides the upper bound for production at higher trophic levels and defines ecosystem carrying capacity, a key factor for the design of marine protected areas. Awareness of bottom-up forcing even to understand the dynamics of the upper trophic levels that form the most important commercial fisheries has increased (Pauly and Christensen, 1995;Ware and Thomson, 2005). Calculating accurate primary productivity (PP) estimates over large areas is thus a primary step for ecosystem models charged with the task of assessing trophic dynamics. Reliable estimates of PP are also necessary for multiple other applications, including quantifying the flux of carbon dioxide (e.g. Bianchi et al., 2005), assessing export production (e.g. Boyd and Trull, 2007) and estimating production of climate-active gases such as dimethyl sulfide (e.g. Larsen, 2005). Estimating accurate PP on global scales is also essential to understanding the consequences of climate change on phytoplankton growth .
Although the broad spatial and temporal patterns of productivity were elucidated on the basis of compilations of in situ measurements (Koblentz-Mishke et al., 1970;Berger, 1989), global dynamics cannot be quantified without the quasi-synoptic view afforded by satellites. Consequently, there have been many efforts to develop models that use satellite-derived information (e.g. surface chlorophyll, chl 0 ; sea surface temperature, SST; and photosynthetically available radiation, PAR) to estimate PP (SatPPMs hereafter; these and other acronyms are defined in Table 1) (Platt and Sathyendranath, 1993;Behrenfeld and Falkowski, 1997a).
Although there is considerable understanding of the photosynthetic process and knowledge of the ocean optics that determine ocean color signals, SatPPMs often have limited success reproducing the observed variability of PP data (Siegel et al., 2001;McClain et al., 2002).
Another approach to quantify global patterns of photosynthesis is to use coupled biogeochemical ocean general circulation models (BOGCMs), which, as a result of increased computational resources, can now be run globally at adequate horizontal and vertical resolution. While these models parameterize photosynthesis similarly to SatPPMs, BOGCMs additionally have explicit compartments for different nutrients, detritus, and one or more functional or size groups of phytoplankton and zooplankton and incorporate mechanistic knowledge of nutrient uptake and physical transport of biomass. Whereas SatPPMs require surface chlorophyll and temperature as input variables, BOGCMs explicitly compute these fields. Although surface chlorophyll fields can be assimilated into BOGCMs (Friedrichs, 2002;Tjiputra et al., 2007;Gregg et al., 2009-this volume), the computational cost of doing so is high, and can entail reductions in horizontal and vertical grid resolution. While diverse approaches for estimating productivity from ocean color and with coupled ecosystem general circulation models are desirable as this field grows and develops, it is extremely important to quantify the performance of these various methods relative to observations and to elucidate the reasons underlying the similarities/differences in model output. To this end, the Primary Productivity Algorithm Round Robin (PPARR) series has provided a context in which the performance of primary productivity models can be quantified. In addition, the PPARR exercises continue to help model developers and users understand the conditions under which each productivity model is most applicable (a summary of past PPARRs is given in Section 2). For any model, a vital element of model skill is the ability to reproduce in situ observations; in the case of PP models, measurements of PP. If observations are representative and the data have undergone careful quality control, firm conclusions can be reached regarding the environmental conditions that challenge model skill. These challenging conditions, in turn, can be taken into account by model developers and end-users to improve model formulation and/or application.
This paper compares output of models of productivity with a large quality-controlled database (ClimPP) spanning more than a decade in the tropical Pacific Ocean. Since tropical and oligotrophic regions cover a large percentage of ocean surface area, it is critical that our ability to model them is improved. The size, temporal and spatial range, and consistent quality of this database help us meet the data requirements mentioned above for data-model intercomparisons. Although it is not possible to make conclusive statements concerning global model performance based on a single regional comparison, improving PP estimates in the tropical Pacific will increase the skill of global models because (1) this region represents a large fraction of the global ocean, and (2) this region represents one of the greatest current challenges to PP modelers (Campbell et al., 2002;McClain et al., 2002).
The goal of this effort is thus to compare the skill of various models (including both SatPPMs and BOGCMS) in estimating PP within the tropical Pacific Ocean. Assessing the overall skill of the BOGCMs is far beyond the scope of this paper; here, only the skill of the BOGCMs in estimating PP is assessed. Additionally, we note up front that the identification of a single best satellite-based PP model, or consensus model, is not the goal of the PPARR exercises. Here we implement a variety of different skill assessment methodologies, and identify which models perform well according to each criterion. These comparative results are relevant for those who wish to choose a single PP model to implement for a given study. In addition, we highlight specific problems that tend to characterize the current generation of PP models. This information is of use to PP model developers, as they continue to adjust and improve their model formulations.
The background of the PPARR exercises is described in the next section (Section 2), and is followed by an introduction to the observational dataset and the methodologies employed to assess model performance (Section 3). Skill assessment results are presented using Taylor and target diagrams as well as cumulative distribution functions. The impact of uncertainties in the input variables and in situ PP measurements are also quantified with error perturbation and principal component analyses (Section 4). Correlations between model errors and environmental variables are then discussed (Section 5), and we conclude with a summary in Section 6.

PPARR/PPARR2
For over a decade, NASA has supported research aiming to improve our ability to quantify marine photosynthesis from satellites in the form of a series of round robin experiments for evaluation and comparison. The first two Primary Productivity Algorithm Round Robin (PPARR) exercises used in situ measurements of PP to quantify the ability of participating models to predict PP based on information accessible via remote sensing. While PPARR1 tested the approach with data from 25 stations, PPARR2 compared the output of twelve models using data from 89 stations with wide geographic coverage (Campbell et al., 2002). The models that performed best were within a factor of 2.4 (based on one standard deviation in log-difference errors) of the in situ 14 C measurements. Of the eight regions represented in the comparison, the most serious biases were found in the equatorial Pacific, where all algorithms underestimated in situ measurements by more than a factor of 2. If biases, which in all cases contributed significantly to absolute model-data misfit, could be corrected, then ten of the twelve models were within a factor of two of the in situ data. Model-data misfit was lowest in regions that have historically contributed the most data for parameterization, i.e. the Atlantic, whereas misfit was high in both the equatorial Pacific and the Southern Ocean.

PPARR3, phases 1 and 2
Phases 1 and 2 of the third primary production algorithm round robin consisted of an intercomparison and sensitivity study of global primary productivity fields computed from 24 SatPPMs and 7 BOGCMs, but included no comparison with in situ PP measurements . The participating models, which are not the same suite as those participating in the current comparison, were provided with common input fields of PAR, SST, chl 0 , and mixed layer depth (MLD), and produced global average PP fields for 8 months of 1998 and 1999. Maximum PP within the Atlantic and Pacific Oceans occurred in the equatorial band (10°S-10°N; N0.5 g C m − 2 d − 1 ). The simulated global average PP varied by a factor of two between models, with model results diverging most at low surface temperatures (b10°C), at high chlorophyll concentrations (N1 mg chl m − 3 ), and within HNLC regions. The models were grouped based on pair-wise correlations and the model clustering was independent of model complexity and primarily depended on the form in which temperature was parameterized within the model.

Current PPARR methodology
In the earlier phases of PPARR3, global productivity fields computed from participating PP models were compared among themselves and also with a 'mean-model' productivity field . This third and final phase of PPARR3 differs significantly from the Carr et al. (2006) study in that here simulated productivity fields were compared with shipboard measurements of 14 C uptake, as in PPARR1 and PPARR2. This effort expands significantly on PPARR2, in which only a small number of models participated (ten versus thirty in this study), only a small dataset was used (b90 stations versus N900 stations in this study), participation was almost exclusively from the USA, BOGCMs were not invited to participate, and only root mean square error was presented. We note that the present comparison exercise does not attempt to assess the overall skill of the participating biogeochemical models (Matsumoto et al., 2004;Doney et al., 2004;Najjar et al., 2007;Friedrichs et al., 2006Friedrichs et al., , 2007. Instead, our goal is to compare the skill of the BOGCMs in estimating PP, allowing relative comparison with that of the SatPPMs. To our knowledge, a PP comparison on the scale of this current exercise (21 SatPPMs and 9 BOGCMs) has not heretofore been conducted.

ClimPP dataset
The PP dataset used here (ClimPP; available as an electronic supplement to this paper) was prepared by Barber and collaborators. The dataset was not made available to the model participants, and only a small subset of the ClimPP data were publicly available, e.g. the data from the Joint Global Ocean Flux Study equatorial Pacific Process study (Barber et al., 1996) which makes up only a small subset of the ClimPP database.
The ClimPP database comprises ∼1000 stations ( Fig. 1) in the tropical Pacific between 15°S and 15°N Barber and Chavez,1991;Le Borgne et al., 2002). Collected between 1983 and 1996, these data consist of chlorophyll and productivity profiles acquired from 31 U.S. and international research cruises. Productivity values were integrated to the 1% light depth. Integrations to the 0.1% light depth were on average only ∼3.5% higher than those integrated to the 1% light depth, and the portion of the photosynthetic profile from the 1% light depth to the 0.1% light depth showed very little spatial or temporal variation (Barber et al., 2001).
On-deck PP incubations were carried out in seawater-cooled Plexiglas incubator boxes. Light in the incubators was attenu-ated to varying percentages of the surface light field, using neutral density screening and blue Plexiglas. Trace-metal clean procedures (Sanderson et al., 1995) were used in all data collected. Specifically, the black rubber closing springs of the Niskin or Go-Flo bottles that were found to be a major source of trace metal inhibition were replaced with Teflon coated metal springs. These retrofitted bottles were mounted on a Teflon coated rosette Sanderson et al., 1995;Barber et al., 1996).
The largest source of variation in estimating PP is the process of determining the attenuation of light and of assigning a depth to each percent light level . Barber et al. (1997) have outlined a procedure that uses the Morel (1988) model to estimate the PAR attenuation coefficient and assigns light depths using observed chlorophyll profiles. This method removes important sources of variation due to project, ship or investigator, and has proven to be an efficient and accurate way to reprocess historical on deck incubation data to make basinwide and/or interannual and decadal comparisons (Le Borgne et al., 2002).
Both oligotrophic and mesotrophic areas are represented within this dataset. Productivities integrated to the 1% light level range from ∼0.05 g C m − 2 d − 1 in the western Pacific warm pool, to nearly 1.80 g C m − 2 d − 1 in the HNLC region (Fig. 2a). Thirty-six percent of the data are from the 1990s, with the remainder having been collected in the 1980s. During each of these decades, data are available from all portions of the tropical Pacific domain: 15°N to 15°S and 125°E to 95°W (Section 5). These data thus provide more than a decade of comparable observations for a region of the ocean where interannual and interdecadal variability is relatively well studied (McPhaden and Zhang,1999).

Participating models
The 30 models used to generate PP estimates for comparison (Table 2; Appendix A) include 21 SatPPMs and 9 BOGCMs. Of the SatPPMs, the first twelve (Models 1-12) represent examples of depth-integrated, wavelength-integrated (DI/WI) models. Five of these models (#8-12) are variants of the Vertically Generalized Productivity Algorithm (VGPM; Behrenfeld and Falkowski, 1997b). The next three models (#13-15) are depth-resolved but wavelength-integrated (DR/WI), and the remaining six (Models 16-21) are the most complex, resolving both depth and wavelength (DR/WR).
The simplest model (#1) estimates integrated primary productivity (in units of g C m − 2 d − 1 ) as the square root of surface chlorophyll (in units of mg chl m − 3 ). This purely empirical relationship was suggested by Eppley et al. (1985) and serves as a reference model to illustrate the role of chlorophyll in PP estimation. In PPARR3 Phase 1, PP estimated by this model had far lower temporal variability than the other models, but time averages were not that different from those of other models, except in areas of extreme light or temperature conditions . All 21 SatPPMs require chl 0 fields, Models 2-21 use PAR in their calculation, 15 of the 21 models use SST, and only seven models use MLD ( Table 2).
The nine BOGCMs participating in this intercomparison vary widely, though each has between two and four  Chai BOGCM Chai et al. (2003) a DI = depth-integrated, DR = depth-resolved, WI = wavelength-integrated, WR = wavelength-resolved. phytoplankton functional groups, and each model contains either three or four nutrients. All BOGCMs except Model 29 provide interannual, rather than climatological, PP distributions. Although one BOGCM (#27) assimilates in situ National Ocean Data Center chlorophyll data, none of the BOGCMs utilize any of the provided input fields (Table 2). All 30 models provided estimates of productivity integrated to the 1% light level, for each of the ∼1000 ClimPP data points. Details regarding the structure of all 30 participating models are beyond the scope of this paper, but key features of these models are provided in Table 2 and Appendix A. In addition, a forthcoming paper concentrates on identifying specific similarities and differences between model parameterizations, which may be affecting model performance (Saba et al., in prep.).

Input data variables for SatPPMs
Although none of the models were provided with the PP data ( Fig. 2a), the SatPPMs were provided with four types of input variables: daily mean chl 0 , PAR, SST, and MLD ( Fig. 2b-e). These data, along with the integrated productivities from the ClimPP dataset are included in this paper as an electronic supplement. As in PPARR2, chl 0 was obtained from in situ data, rather than remotely sensed data as in PPARR3 Phase 1. Although chlorophyll profiles were available for each station, only the surface values were provided to the modelers. The profiles were used to estimate the depth of maximum chlorophyll (Section 5), but not provided to the participants. NCEP/NCAR Reanalysis 1 data for daily mean downward solar radiation flux data were downloaded from the National Oceanic & Atmospheric Administration website for the Physical Sciences Division of their Earth System Research Laboratory. These NCEP/NCAR reanalysis data extend from 1948 until the present and are obtained from a state-of-theart data assimilative analysis/forecast system. A factor of 0.43 was used to convert these daily shortwave fluxes into PAR and produced values that were in relatively good agreement (±10 mol quanta m − 2 d − 1 ) with PAR measurements obtained on the Joint Global Ocean Flux Study equatorial Pacific Process Study cruises. SST was obtained from the Advanced Very High Resolution Radiometer. MLD was obtained from a reducedgravity, primitive equation tropical Pacific Ocean model (Murtugudde et al., 1996;Murtugudde and Busalacchi, 1998) with a variable depth mixed layer overlying 19 sigma layers. In this model, mixed layer thickness is determined using a "hybrid" mixed layer model (Chen et al., 1994) that considers both wind stirring and shear instability, and produces interannual MLD estimates that are generally within ±20 m of those derived from measurements collected on the Joint Global Ocean Flux Study equatorial Pacific Process Study cruises.
Two models (e.g. Models 7 and 21) require information on particulate backscatter at 443 nm (bbp), which is estimated from ocean color remote sensing data (Maritorena et al., 2002). Since these data were not available for this pre-SeaWiFS era ClimPP dataset, monthly 9 km climatological bbp values (Sep. 1997-Sep. 2007) were used for each given sample location.

Skill assessment strategies
The skill of the 30 participating models was assessed using several strategies outlined below (also see Stow et al., 2009-this volume). Campbell et al. (2002) concluded in PPARR2 that the total root mean square difference (RMSD) summed over the N data points provides a valuable comparison of PP models Here model-data misfit in log 10 space (Δ) is defined as: where PP m (i) is modeled PP, obtained either via SatPPMs or BOGCMs, and PP d (i) represents the ClimPP data. RMSD is composed of two components, the bias (B) representing the difference between the means of the two fields, and the centered pattern RMSD (RMSD CP ; sometimes referred to as the unbiased RMSD) representing the differences in the variability of the two fields: The bias and RMSD CP (Tables 3 and 4) provide measures of how well the mean and variability are modeled, respectively: Because the units of the above quantities are in decades of log 10 and not easily translated, a non-dimensional inverse transformed value for bias is presented: where F med (Campbell et al., 2002) is the median value of the ratio PPm i ð Þ PP d i ð Þ ¼ 10 Δ i (Tables 3 and 4). Thus, if F med = 2.0, the median value of PP m (i) is a factor of two larger than the median value of PP d (i); if F med = 0.5, the median value of PP m (i) is a factor of two smaller than PP d (i).
The novel target diagram (Jolliff et al., 2009-this volume) is used to visualize bias, RMSD CP , and total RMSD for the 30 models on a single plot. On the target diagram, these quantities are normalized by the standard deviation of logPP d (σ d = 0.279), i.e. normalized bias (B⁎) is defined as: Although RMSD CP is inherently a positive quantity, normalized RMSD CP (denoted as RMSD CP ⁎) is defined as: where σ m is the standard deviation of logPP m , and thus RMSD CP ⁎ can be either positive (indicating that the model is overestimating the variance of the data) or negative (indicating that the model is underestimating the variance of the data). It is important to note here that when the total RMSD statistic is computed, models that underestimate the observed variance tend to result in lower total RMSD scores than those that overestimate the variance (Jolliff et al., 2009-this volume). For a given value of RMSD CP , a portion of the model-data misfit will result from phase differences between the simulated and observed fields, and a portion will result from differences between the amplitudes of the variations. Thus in addition to RMSD CP values, correlation coefficients and variances are also computed (Tables 3 and 4) to understand the similarities and differences in model-data fit for each of the models participating in this comparison. In Section 4.2 Taylor diagrams (Taylor, 2001) are used to represent RMSD CP , correlation, and standard deviation on a single plot.
The cumulative distribution function represents an alternative way to visualize model bias. Although B provides a succinct measure of the magnitude and sign of model bias, from this statistic alone it is not possible to determine whether positive biases result from overestimating high values, or low values, or both. A comparison of model and data cumulative distribution functions clearly reveals where in the spectrum of values the biases occur, and is an excellent method for visualizing median, maximum and minimum values.
In a comparison exercise such as this, it is critical to acknowledge that uncertainties exist in the input variables provided to the SatPPMs, which may affect the ability of the models to adequately reproduce the 14 C PP measurements. Furthermore, the ClimPP data themselves are associated with a level of uncertainty and need to be regarded as ranges rather than as exact values. To examine the effect of uncertainties in the input variables on modeled PP, an error perturbation analysis was conducted in which PP from each participating SatPPM was recalculated with individual and simultaneous perturbations of the four input variables. Each perturbation represented addition or subtraction of the expected level of observational uncertainty associated with that variable. Comparison between SST, PAR and MLD measured on the Joint Global Ocean Flux Study equatorial Pacific Process Study cruises (a subset of the ClimPP database) and the input fields described above (Section 3.3) revealed that the following uncertainties were reasonable for these quantities: SST  Fmed and correlation are non-dimensional; remaining statistics are in units of decades of log 10 . Standard deviation of ClimPP dataset is 0.28. In addition, the absolute means (± one standard error) of these statistics are reported for various model subsets. uncertainty = ±1°C, PAR uncertainty = ±10 mol quanta m − 2 d − 1 , and MLD uncertainty = ± 20 m. Uncertainty for chl 0 was estimated (Barber, pers. comm.) to range between ±50% for the minimum chlorophyll value, and ±15% for the maximum chlorophyll value as a linear function of log(chl 0 ). This resulted in chl 0 uncertainties ranging between ±0.01 and 0.11 mg chl m − 3 . Because the four input variables each have three possible values (original value, original value +uncertainty, original value − uncertainty) there are a total of 3 4 = 81 perturbations. The error perturbation analysis was conducted on these 81 perturbations of each of the ∼1000 data points. The single perturbation yielding the lowest RMSD for each data point was then selected for an additional 'best-case' scenario assessment. This 'best-case' scenario RMSD was computed using a range of PP values based on the uncertainty in the ClimPP observations, which was assumed to vary as a linear function of log(PP), between 30% of PP for the minimum observed productivities, and 10% of PP for PP greater than 0.7 g C m − 2 d − 1 . This corresponds to uncertainties ranging from ±0.02 to ±0.07 g C m − 2 d − 1 , with a mean uncertainty of ±0.05 g C m − 2 d − 1 (the mean of logPP-log[PPuncertainty]=0.073). These uncertainty levels, which likely underestimate actual measurement errors, are only used in the perturbation analysis to identify the 'best-case' scenario total RMSD (Section 4.4), but not in the statistics described in Tables 3 and 4.
In order to further quantify how the observational uncertainties described above affect our assessment of model skill, a principal component analysis (PCA) was performed. The PCA was conducted on the reductions in RMSD obtained from the 21 SatPPMs after (1) considering the uncertainties associated with the ClimPP data, (2) perturbing each input variable individually according to its respective uncertainty and (3) considering the uncertainties in the ClimPP data as well as all the input data fields simultaneously. This PCA ordination reveals how each model responds to the accuracy criteria that reflect the uncertainty in field data due to sampling and analytical errors.

Model skill comparison results
Bias, F med , RMSD CP , total RMSD, correlation and standard deviation were computed for each of the 30 participating models and are reported in Table 3. Mean values ± one standard error are included for all 30 models, as well as for each model type (DI/WI, DR/WI, DR/WR, SatPPM, and BOGCM). These results are discussed below and presented in a variety of graphical representations.

Target diagram
Following Jolliff et al. (2009-this volume) the bias and RMSD CP of the 30 participating models were normalized by the standard deviation of the log-transformed ClimPP data (σ d ), and plotted on a target diagram (Fig. 3), where, as defined by Eq. (2), the distance from the origin to each symbol represents the total normalized RMSD: On this plot the inner dashed circle denotes the normalized observational PP uncertainty (RMSD⁎ = 0.073/σ d = 0.26). Thus the skill of any models yielding points falling within this inner circle (the bulls-eye) would be indistinguishable from each other. This, however, is a difficult goal to achieve for patchy quantities such as ocean productivity; in this analysis no points are located within this circle.
Another useful way to characterize the results of multiple models on the target diagram (Jolliff et al., 2009-this volume) is to examine whether or not the standard deviation of the observations exceeds the total RMSD for each model. If it does not, i.e. if RMSD⁎ N 1, then the model is showing less skill (in terms of total RMSD) than does the mean of the observations.  6)) and RMSD CP ⁎ (Eq. (7)) for the 30 participating models relative to the ClimPP database. Concentric circles represent RMSD⁎ isolines: the inner dashed circle represents the normalized observational PP uncertainty, and the outer solid circle represents the normalized deviation of the productivity data. Blue symbols = DI/WI models (darker blue are subset of VGPM variants), red = DR/WI, green = DR/WR, and pink = BOGCMs. Model results falling inside the outer circle, i.e. those with total RMSD values that are less than the standard deviation of the observations, tend to provide a better instantaneous estimate of productivity than the mean of the observations. In this analysis, eleven of the 30 models (37%) fell within this outer circle (Fig. 3). These include 42% of the DI/WI models, 67% of the DR/WI models, 33% of the DR/WR models and 22% of the BOGCMs. The model with the lowest normalized total RMSD (RMSD⁎ = 0.82) is Model 18, a DR/WR model. Four models have nearly equally low RMSD: two VGPM variants (Models 8 and 11), a depth-resolved model (Model 13) and the simplistic sqrt(chl 0 ) relationship (Model 1). The performance of the latter model, which does not use PAR, SST or MLD data, illustrates the importance of chl 0 in the estimation of integrated productivity.
The target diagram (Fig. 3) also illustrates that most (23 out of 30) models overestimated observed productivity (B⁎ N 0), whereas only 23% (7 out of 30) models underestimated productivity. Three models (one SatPPM #19, and two BOGCMs #25, 28) had positive median biases greater than two (F med N 2.0; B N 0.3; Table 3), i.e. in these cases the median modeled value was more than twice the median observed value. Nearly half of the BOGCMs (four of nine) were associated with absolute biases greater than a factor of 1.7 (F med N 1.7 or F med b 0.59), whereas only two of the 21 SatPPMs, both DR/WR models, were associated with biases of this magnitude. Interestingly, the overall magnitude of the mean bias was lower for the DI/WI models (0.10 ± 0.02) than for the DR/WR models (0.17 ± 0.04) and the BOGCMs (0.19 ± 0.02).
The centered pattern RMSD (RMSD CP ) provides a measure of how well the variability of a certain field is being modeled. Whereas bias for these models varied substantially, the magnitude of RMSD CP diverged much less among models (Table 3), ranging only from |RMSD CP ⁎| = 0.84 to 1.13 (Fig. 3). The RMSD CP was slightly, but not significantly, higher for the BOGCMs (j¯RMSD ⁎ CP j = 0.95 ± 0.04) than for the SatPPMs (j¯RMSD ⁎ CP j = 0.90 ± 0.02). There was also little significant difference in j¯RMSD ⁎ CP j for the different categories of SatPPMs. All but four models (three BOGCMS and one VGPM variant) underestimated the variance of observed productivity (RMSD CP ⁎ b 0).

Taylor diagram
Taylor diagrams (Taylor, 2001) complement the target diagrams described above by representing the RMSD CP , the correlation, and the standard deviation of each of the 30 PP models with respect to the ClimPP dataset (Fig. 4). Unlike the target diagram, the Taylor diagram presents no information on bias, however it has the advantage of being able to illustrate more detailed information pertaining to the difference in variability associated with the modeled and observed fields (Section 3.4).
In the Taylor diagram (Fig. 4), the distance between the black data symbol on the x-axis and each colored model symbol represents RMSD CP . This RMSD CP is composed of a component relating to the variance of the model estimates (the distance from the origin is the standard deviation) and a component relating to the correlation between the observations and model estimates (the azimuth angle). Although the RMSD CP for each model has already been shown on the target diagram (Fig. 3), it is not possible to determine from that diagram whether a given RMSD CP results from getting the correlation wrong or from getting the magnitude of the variability wrong.
For the 30 participating models, the correlation coefficients (azimuth angles) are very similar for many of the models: all but six models are correlated with the observations with a coefficient between 0.5 and 0.6 ( Fig. 4; Table 3). Of the four models with correlation coefficients less than 0.4, two are BOGCMs, one is a DI/WI model and one is a DR/WR model. Model 18, a DR/WR model, yields the highest correlation.
The distance from the origin to each model point is the standard deviation of the modeled productivities; points closer to the origin than the dotted line underestimate the variance of the data, and those outside the dotted line overestimate this variance. As was evident from the target diagram (Fig. 3) most models underestimate the observed variance, however the Taylor diagram (Fig. 4) illustrates that the amount by which these models underestimate the observed variance varies considerably from model to model, and independently of correlation or RMSD. In particular, the simplistic Model 1, which was shown to produce a very low RMSD value on the target diagram, is shown here to substantially underestimate the observed variance of the data (but less so than Models 19, 28 and 30 which are relatively complex). BOGCMs, with the exception of Models 24, 28 and 30, typically do slightly better than SatPPMs at reproducing the magnitude of the observed variance (i.e. are closer to the dotted line in Fig. 4), although the difference is not statistically significant.

Cumulative distribution functions
The cumulative distribution of ∼ 1000 test points allows a comparison between the range and median of the PP observations and those obtained from the 30 participating PP models. Most of the DI/WI models (Fig. 5a,b) underestimate the range of the 14 C PP data, which extends roughly from 0.1 to 1.0 g C m − 2 d − 1 . Whereas the lower PP values are substantially overestimated by some of the DI/WI models (e.g. #1, 2, 4, 6, 9), other models tend to underestimate the higher PP values (e.g. #1, 5, 8,10). Only Model 12 overestimates the range of PP values, and it does so both by underestimating the minimum PP values and overestimating the maximum PP values. Not surprisingly, without any SST, PAR or MLD dependence, Model 1 (i.e. PP ¼ ffiffiffiffiffiffiffi chl p ) produces a range of PP values that is much too small; however, this model reproduces the median of the PP data better than any other DI/WI model, with the exception of Model 12 (which interestingly has the largest range). For this DI/ WI subset of models, #11, a VGPM variant with an alternate definition of the 1% light level, provides the best overall fit to the cumulative fractional distribution of the test points.
In general, the depth-resolved models (DR/WI and DR/ WR) do not reproduce the observed range and median of the 14 C PP data (Fig. 5c,d) any better than the depth-integrated models (Fig. 5a,b). Eight of the nine DR models overestimate the median PP: #14 by more than 50% and #19 by nearly 100%. This mismatch occurs because these models generally produce very few PP values less than 0.2 g C m − 2 d − 1 , whereas nearly 20% of the PP observations are less than this value. The WR models (#16-21) also do not reproduce the observed range of PP any better than their WI counterparts (#1-15).
The BOGCM results (Fig. 5e) are more variable than those computed from the SatPPMs, which is to be expected since the BOGCMs do not use the chl 0 input fields provided to the participants and used by the SatPPMs. Moreover, the BOGCMs have different vertical/horizontal resolutions and use different mixing parameterizations, which may further contribute to these deviations. Even so, as was the case for the SatPPMS, most of the BOGCMs produce too few low PP values. The minimum PP for Models 24, 25, 28 and 30 (0.22, 0.32, 0.35 and 0.32 g C m − 2 d − 1 , respectively) are only slightly lower than the median of the observations (0.37 g C m − 2 d − 1 ). This strong positive bias evident for many of the BOGCMs may result from the tendency of coarse resolution general circulation models to upwell excessive amounts of nutrients in the eastern tropical Pacific (Gnanadesikan et al., 2002;Doney et al., 2004), and thus overestimate these lower productivity observations. The excessive nutrient flux greatly expands the moderately productive HNLC area. On the other hand, the BOGCMs are less biased for the high productivity observations than the low PP observations, because in most of these models excessive nutrient flux will result in a community shift towards diatoms, which will increase export more strongly than primary production.
Not all of the BOGCMs are associated with positive biases. Two BOGCMs over-predict the number of low PP values: Model 22 yields 50% of its results lower than 0.2, whereas only 20% of the observations are below this value PP. Other BOGCMs do quite well: the cumulative distributions of Models 23 and 29 agree with the observations better than many of the SatPPMs. The skill of Model 29 in this respect is particularly interesting, as this model uses climatological rather than interannual forcing Fig. 6. Total RMSD between each of the satellite-based PP models and the ClimPP database. The portion of the total RMSD that may be attributable to uncertainty in the input variables is black; colors defined in Fig. 3. fields. Overall, the performance of the BOGCMs varies substantially, and as a group of models they do not match the range and median of the observed PP any better or worse than the SatPPMs.

Error perturbation analysis
Measurements of the input variables (SST, MLD, PAR, chl 0 ) used by the SatPPMs are not error-free, and the uncertainty in their values affects the ability of the models to optimally reproduce the 14 C PP measurements, which are also associated with some degree of uncertainty. To examine the effect of these uncertainties, an error perturbation analysis (Section 3.4) was conducted for the SatPPMs (Fig. 6). The maximum reduction in RMSD resulting from perturbations of each of the four input variables individually, PP, and all five quantities simultaneously (the best-case scenario perturbation) is also reported in Table 5.
When the input variables were independently perturbed, not surprisingly most (two-thirds) of the SatPPMs showed larger reductions in model-data misfit with perturbations of chl 0 , than with PAR, MLD or SST (Table 5). Although chl 0 was perturbed by less than 50%, which may be less than the uncertainty in satellite chlorophyll algorithms, the decrease in RMSD averaged 25%. This is consistent with the results of Campbell et al. (2002), which demonstrated that errors in surface chlorophyll produced less than proportionate errors in integrated productivity. Model 12 demonstrated the greatest sensitivity to perturbations in chl 0 , probably because this VGPM variant computes P opt B as a function of chl 0 alone, with no input from SST. On the contrary, Model 10, a VGPM variant that computes P opt B as a function of both SST and chl 0 , showed one of the smallest sensitivities to perturbations in chl 0 , demonstrating only 13% improvement in RMSD. With this type of error analysis, it is possible to estimate the maximum fraction of the total RMSD that may be attributed to uncertainties in the input variables and in the PP data themselves. For example, as noted earlier, the model with the lowest total RMSD is Model 18 (RMSD = 0.23; Fig. 6). However, when uncertainties in the input variables and PP data are taken into account, the best-case scenario RMSD for this model drops by 60% to the very small RMSD = 0.09.
Certain models are inherently more sensitive to changes in input variables than others (Fig. 6), and therefore their PP estimates are more affected by uncertainties in these variables. Since Model 1 uses only one of the provided input fields (chl 0 ), it is affected only by uncertainties in that field and PP, and not surprisingly, less of its total RMSD can be attributed to these uncertainties than most other models, although even in this case the fraction is quite large (39%). The relatively low original RMSD may partially be explained by the fact that, by only using one of the input fields provided, it is unaffected by uncertainties in the other fields. These data uncertainties inherently have a greater effect on the skill of the more comprehensive models that incorporate information from all the provided fields: for example 78% of the total RMSD of Model 3 can be attributed to uncertainties in the data. On average, uncertainties in the input variables and the PP data can account for more than half (mean = 58%; range = 39-79%) of the total RMSD computed for the 21 SatPPMs. If the uncertainties associated with measurements of chl 0 , SST, PAR and MLD are reduced in the future as observations of these quantities become more accurate, the skill of Models 2-21 would be expected to increase relative to that of the simplistic Model 1.

Principal component analysis of reductions in RMSD
An R-mode principal component analysis (PCA) was performed to further examine how individual and simultaneous perturbations of the input variables and ClimPP observations reduce the total RMSD between observed and predicted PP for each SatPPM. The PCA was computed on the reductions in RMSD that resulted from individual perturbations of chl 0 , SST, PAR, MLD, PP and simultaneous perturbations of all five quantities (Table 5). Eighty-one percent of the variance of reduction in RMSD is explained by the first two principal components arising from such an analysis, with 60% of the variance explained by PC1 and 21% explained by PC2. In Fig. 7 the original total RMSD associated with each model, assuming the provided input variables and ClimPP data have no associated uncertainty (Table 3), is displayed as the symbol size (better fitting models are represented by smaller circles); vectors represent variables and they point towards the largest decrease in total RMSD associated with uncertainties in that specific variable. Thus models whose projection onto a specific vector fall near or beyond the head of that vector tend to be those that are most affected by uncertainties in that variable. As several variables are simultaneously considered, the order of the models' projections onto vectors cannot exactly match the reductions in RMSD, but it can be regarded as the best overall approximation.
The simultaneous perturbation of 'all' variables ( Fig. 7) is strongly correlated with the first principal component (PC1), as is illustrated by the fact that the angle between the 'all' vector and PC1 is very small. Therefore, models with the   (54) This data matrix is the basis for the principal component analysis (Fig. 7).
highest values of PC1 (e.g. #3, 4, 7, 21) are those for which the performance is most improved when uncertainties in all the input and output variables are considered, i.e. these models are generally the most sensitive to sampling and analytical errors. On the contrary, models that are located in the far left hand quadrants (negative values of PC1; e.g. #1, 8, 10) tend to be less affected by the data perturbation and are also generally the best fitting models (smallest original RMSD assuming no data perturbations). The vectors describing perturbations of MLD and PAR are the longest vectors, indicating that perturbations in these input variables produce the most diverse responses among the various models. Perturbations in PAR and MLD yield substantial reductions in RMSD for some models, and very little such reductions for other models. For example, the skill of Models 3 and 7 are strongly affected by inaccuracies in MLD, whereas the skills of Models 2, 6, 14, and 21 are more strongly affected by inaccuracies in PAR and chl 0 . It is interesting to note that the vectors that represent PAR and MLD are orthogonal, i.e. independent of each other in the PCA. This implies that improvements in RMSD due to PAR data perturbations tend to be independent of those obtained by perturbing MLD data. Both MLD and PAR strongly contribute (by combining their effects in a way that is similar to the combination of two orthogonal forces) to the reduction in RMSD obtained by perturbing all variables.
Vectors describing perturbations of chl 0 and SST are much shorter, reflecting that the effects of their perturbation on the differences between models as shown in the ordination are minor, or in other words these variables affect most models in a similar way. Similarly, PP data perturbation has a virtually negligible role in discriminating models according to their sensitivity. All vectors, except the one representing SST, point towards the positive end of PC1. This implies that their effects, although not necessarily correlated, contribute to the reduction of the RMSD of all the models that are located in the right part of the ordination. SST, on the contrary, points towards the left part of the ordination, nearly in opposition to PAR. This means that PAR and SST perturbations tend to have opposite effects on the reduction of RMSD: models that are improved by perturbing SST are generally insensitive to perturbations in PAR, and models that are improved by perturbing PAR are relatively insensitive to perturbations in SST. However, the effect of SST perturbations on RMSD reduction is considerably smaller than that of the PAR perturbations (Table 5).

Correlations of model-data misfit with environmental variables
Although documentation of the relative skill of these 30 productivity models in the tropical Pacific is useful, model improvement requires not only an assessment of model skill, but also an understanding of the particular environmental conditions that affect performance. To address this issue, a correlation analysis was undertaken to better characterize model performance in the context of environmental forcing. Specifically, the correlation was estimated between modeldata misfit (Δ) and ten variables (Fig. 8), including the input variables (SST, chl 0 , PAR and MLD) as well as longitude, degrees from the equator, year day, year, the depth of the chlorophyll maximum, and biomass-normalized PP integrated over the euphotic zone (P i B ).
As expected, correlations with surface chlorophyll are relatively high for the 21 SatPPMs. Model-data misfits in all but one of these models are positively correlated with chl 0 , indicating that modeled PP overestimates measured PP when surface chlorophyll concentration is high, and modeled PP underestimates measured PP when surface chlorophyll is low. Because the SatPPMs by definition only utilize chlorophyll data from the surface layer, one might also expect an equally high negative correlation with the depth of the chlorophyll maximum, i.e. that the models would underestimate productivity  when the maximum chlorophyll concentration occurred below one optical depth. Because large parts of this region are characterized by deep chlorophyll maxima (Barber et al., 1996;Le Borgne et al., 2002), the ClimPP dataset provides a good test of the impact of chlorophyll vertical structure on model misfit. However, Fig. 8 illustrates that the correlation between misfit and the depth of the chlorophyll maximum is generally less than 0.1. Only Model 12 produced misfits that were strongly correlated to the depth of the chlorophyll maximum, overestimating the ClimPP data when the maximum chlorophyll concentration was at the surface and underestimating the ClimPP data when the maximum chlorophyll concentration was deep. This latter behavior is what would be expected from models that use only a surface value, and has been a concern for the use of ocean color. The low correlation between misfit and depth of the chlorophyll maximum for the bulk of the models suggests that model assumptions typically used to characterize the depth profile of chlorophyll are fairly sound, and/or that the variations of PP in the deep chlorophyll maximum do not contribute strongly to vertically integrated PP.
Somewhat surprisingly, for all of the participating models but especially the SatPPMs, model-data misfit shows a strong inverse correlation with year, i.e. the models overestimated the ClimPP data in the earlier years. A comparison of the RMSD statistics computed for the 1990-1996 time period (N = 340; Table 4) versus the entire time period (1983)(1984)(1985)(1986)(1987)(1988)(1989)(1990)(1991)(1992)(1993)(1994)(1995)(1996); N =948; Table 3) reveals that although the bias is similar for these two time periods, the RMSD CP is substantially less for the SatPPMs in the 1990s. Specifically, the correlation between the SatPPM estimates of PP and the ClimPP data ( Fig. 9) is higher for the 1990s data (mean for all SatPPMs = 0.76 ± 0.01) than it is for the entire dataset (0.52 ± 0.01). This is true even for the simplistic Model 1, implying that the correlation between surface chlorophyll and integrated productivity was stronger in the 1990s than it was in the 1980s.
Not only is total RMSD reduced for the SatPPMs (0.29 ± 0.01 to 0.22 ± 0.01), but the average fraction of this total RMSD that may be related to uncertainties in input variables and PP data also increased from 58% to 68% after 1990 (Fig. 10). For seven SatPPMs, in the best-case scenario up to 80% of the RMSD for the 1990s may be attributed to uncertainties in the input variables and ClimPP data. The improved performance of the models since 1990 could reflect the fact that some of the participating SatPPMs and many of the BOGCMs were developed using data acquired after 1990. However, this was not true for many of the SatPPMs; the VGPM variants, for example, were based primarily on in situ data collected prior to the 1990s (Behrenfeld and Falkowski, 1997b).
Model-data misfit was not only highly correlated with year, but was similarly highly correlated with observations of P i B (Fig. 8). To further investigate these correlations, modeldata misfit is plotted as a function of P i B data and time (Fig. 11).
Consistent with the correlation results, all thirty models yield a clear trend of overestimating PP for low P i B values and underestimating PP when P i B is high; absolute misfit (|Δ|) typically is greatest for the lowest P i B values. Furthermore, the oldest data (from the early and mid-1980s) are generally characterized by relatively low P i B values (P i B b 2.5 mmol C mg chl − 1 d − 1 ), whereas P i B measured from 1995 and 1996 exceed 2.5 mmol C mg chl − 1 d − 1 (Fig. 12a). This change in P i B holds over the entire ClimPP domain, not just the equatorial zone (Fig. 12b,c). Thus, the surprisingly high correlations between year and misfit occur because: (1) all models, both the SatPPMs and the BOGCMs, perform poorly when P i B is low (b1 mmol C mg chl − 1 d − 1 ), and (2) 95% of these low P i B values Fig. 10. As in Fig. 6, except RMSD is computed for the 1990s subset of the ClimPP data. occur in the 1980s (although there are more data points poleward of 10°in the 1980s than the 1990s (Fig. 12c), even when only points between 10°N and 10°S are included, 95% of the P i B b 1 mmol C mg chl − 1 d − 1 values still occur in the 1980s portion of the ClimPP dataset). It is possible that the improved performance of the models since 1990 is a reflection of inherent problems associated with modeling low photosynthesis rates for a given biomass. The interdecadal variability in productivity per unit biomass directly results from an increase in measured  Table 2. productivity: mean productivity increased by as much as 62% between these two decades. Biomass-normalized productivity, on the other hand, increased by a smaller percentage (47%) since mean integrated chlorophyll for the 1990s was also slightly higher (8%) than that observed in the 1980s. Interestingly, although average integrated chlorophyll was lower in the 1980s, mean surface chlorophyll, which is what SatPPMs use to estimate PP, was slightly (10%) higher in the 1980s than it was in the 1990s. As a result, nearly all the SatPPMs overestimated PP in the 1980s. As an aside we note that over the entire ClimPP domain there was no significant difference in SST between pre-1990 and post-1990; however, in the eastern equatorial Pacific, the 1990s time period was associated with colder temperatures. These trends and interdecadal differences do not reflect any change in observational methodology. As discussed above, the entire ClimPP database was collected using the same tracemetal clean procedures Sanderson et al., 1995). In no instances did the Go-Flo bottles contain black rubber closing springs, which were found to be the major source of trace metal contamination (Fitzwater et al., 1982). In addition, the entire database was reprocessed using the same model for estimating PAR attenuation (Morel, 1988) and the same method for normalizing light extinction estimates in all of the ClimPP observations.
The interdecadal changes documented above are consistent with findings from other recent studies of interdecadal change in this region. The Pacific Decadal Oscillation regime shift between 1988 and 1992 has been associated with major physical, chemical and biological changes in this ocean basin (Hare and Mantua, 2000). For example, Chavez et al (2003) proposed multidecadal regime shifts in the eastern tropical Pacific Ocean, consisting of a warm phase with high landings of sardines followed by a cool phase with high landings of anchovies. Chavez et al. (2003) hypothesized that the regime shift occurred around 1990 when the landings of sardines began to decrease and a cooling of SST was observed.
In another study, Takahashi et al. (2003) present equatorial (5°N-5°S) observations that demonstrate a decrease in pCO 2 from 1979 to 1990 followed by an increase in seawater pCO 2 between 1990 and 2001. Similarly Feely et al. (2006) document a decrease in SST in the central equatorial Pacific and a simultaneous increase in the rate of the 28°C fCO 2sw during the 1990s, again suggestive of an increasing influx of colder and CO 2 -rich waters into this region. They note that these trends could result from either increased lateral flow of colder subtropical waters driven by equatorward winds or increased upwelling of colder deep waters. The increase in P i B documented here is more consistent with the latter scenario, since an increased rate of upwelling would bring high-nutrient iron-rich water to the surface, possibly resulting in a change in community structure and phytoplankton species composition, and generating the higher rates of biomass-normalized productivity documented here. The absence of lower mean SST after 1990 is attributable to the large areal extent over which the data extend (30°of latitude and nearly 150°of longitude) and thus is not inconsistent with an increased upwelling scenario. Because the interdecadal change in productivity was not associated with a significant and coincident change in any of the input variables (chl 0 , SST, PAR or MLD), it is not surprising that the SatPPMs failed to reflect these changes in PP. This reflects an important characteristic of SatPPMs in general. Regardless of complexity, i.e. depth/wavelength resolved or integrated, these models all incorporate empirical information regarding the relationship of physiological state with environmental conditions. If changes in the rate of upwelling at depth alter deep nutrient concentrations without effecting a coincident change in SST or MLD, a scenario which is likely to occur for regions of shallow MLD as is frequently the case in the tropical Pacific, the skill of SatPPMs is likely to be less. Although BOGCMs in principle have the capacity to model such interdecadal changes in upwelling and the resultant effects on nutrient concentrations at depth, the models submitted to this study have not demonstrated the ability to do this in this case. It is possible that complexities in community structure missing in these BOGCMs may be required in order to understand the relative biomass invariance. These results demonstrate that temporal variability in the form of interdecadal regime shifts pose a significant challenge for both SatPPMs and BOGCMs alike, and further suggest that other changes associated with scenarios of global warming are likely to be equally problematic.

Discussion and summary
Thirty models, including both SatPPMs and BOGCMs, participated in this PPARR comparison exercise. PP model skill was assessed by computing root mean square differences (RMSD) and by using a variety of skill assessment tools to determine how well each model reproduced log 10 PP from an extensive in situ tropical Pacific database (ClimPP). This region of the ocean was chosen specifically because earlier PP modeling efforts demonstrated that the tropical Pacific represents a significant challenge for satellite-based PP models. A primary result of this model intercomparison effort is that although skill varied substantially among the participating models, model skill was not associated with model complexity or model type. Overall, there was little significant difference in the performance of the three different classes of SatPPMs represented here (based on depth and wavelength resolution). Also, neither SatPPMs nor BOGCMs performed significantly better based on the full suite of skill assessment methodologies, revealing that using output from SatPPMs to evaluate BOGCM simulations of PP is not an adequate test of BOGCM skill.
Most models substantially underestimated PP variability, often by more than a factor of two. Specifically many of the models, including all 9 depth-resolved satellite-based models (though Model 18 to a much lesser degree), produced too few low PP values (b0.2 g C m − 2 d − 1 ). PP model developers need to be aware of this model shortcoming, and should note that an improvement in the ability of productivity models to estimate PP at low rates of productivity will have a significant impact on PP model skill in the tropical Pacific and other oligotrophic regions across the globe.
Whereas nearly all models underestimated PP variability, model performance differed substantially in terms of how well the models reproduced the mean and median PP. Several models overestimated the median productivity by a factor of two, others underestimated productivity by nearly a factor of two, and others presented almost no bias. Thus, the models with the greatest skill were generally those with the lowest bias. PPARR2 concluded that removing bias was an immediate goal to improve model performance. The present study confirms that several participating models have accomplished this goal. Although, as noted above, in general there was little difference in the performance of different classes of models, one exception is in terms of bias: the simplest models were characterized by significantly less bias than those that resolved wavelength and depth.
In a best-case scenario, more than half of the total RMSD associated with the SatPPMs could be attributed to uncertainties in input variables and the PP data. Whereas the skill of certain models could increase dramatically if the input variables were known more accurately (e.g. many of the models that use the MLD fields), the skill of other models, including the simple square root of chlorophyll relationship that uses only the chl 0 input field, would not undergo the same level of improvement. In general, the subset of models based on the Vertically Generalized Productivity Model (Behrenfeld and Falkowski, 1997b) show a relatively small improvement in skill when uncertainties in input variables and PP data are considered, primarily because of their inherent insensitivity to PAR. On the contrary, the skills of PP models that utilize MLD in their computations of productivity are severely limited by uncertainties in MLD estimates, suggesting that further development of models that require MLD input fields is not advised, especially since high quality MLD fields are not readily available.
The specific models that demonstrate the greatest skill vary depending on which skill assessment methods are used. For instance, if total root mean square difference is used as the single criteria for skill, Model 18 , which uses biogeographical provinces to define the necessary model parameters, performs best. Model 1, the simplistic sqrt(chl 0 ) model, performs nearly as well according to this criterion, however this is partially due to the fact that models that underestimate variance result in lower total RMSD scores (Jolliff et al., 2009-this volume). Ten of the 30 models are associated with biases that are smaller than the levels of PP uncertainty we assume here. One of the BOGCMs (Model 29: Tjiputra et al., 2007) produces the lowest bias, however one of the VGPMs (Model 12) and the depth-resolved Model 13 (Armstrong, 2006) also have negligible biases. In terms of reproducing the observed PP variability, two depth and wavelength-resolved models outperform the others (Model 16: Smyth et al., 2005 andModel 18: Mélin, 2003). Model 18, as well as Model 27 (Gregg, 2003) produce PP fields that are most highly correlated with the data, whereas Model 29 (Tjiputra et al., 2007) produces PP fields that have nearly the same variance as the ClimPP data. Cumulative fractional distributions indicate that Model 11 (Morel and Maritorena, 2001) produces PP distributions that most closely reproduce the observations. It is clearly not possible to identify one single model that is most skilled according to all the criteria listed above, however, the specific models highlighted here are examples of models that perform particularly well in these analyses.
In order to better understand why certain models have greater skill than others and what environmental conditions control performance, correlations between model-data misfit and a number of relevant variables were computed. Surprisingly, model-data misfit was highly correlated to year for almost all of the SatPPMs as well as for some of the BOGCMs. Both primary productivity and biomass-normalized productivity in the ClimPP database were substantially higher in the 1990s than in the 1980s. This interdecadal variability may be associated with a change in upwelling rate within the tropical Pacific, a hypothesis that is consistent with previous studies (Takahashi et al., 2003;Chavez et al., 2003;Feely et al., 2006). The observed interdecadal change in integrated productivity per unit biomass within the ClimPP database was not associated with a coincident change in the surface expression of chlorophyll, temperature, or PAR, probably because of the typically shallow MLD in the tropical Pacific. As a result, the SatPPMs were unable to correctly estimate this interdecadal variability in productivity. Although BOGCMs should theoretically have the ability to replicate the increased upwelling of nutrients and the effect of this on productivity, this study has shown that this remains a challenge for the current generation of BOGCMs. These results suggest that interdecadal and global changes will be a significant challenge for both SatPPMs and BOGCMs alike.
The success of these types of comparison exercises is illustrated by the continual modification and improvement of the participating models and the resulting improvement in model skill. The previous PPARR2 exercise (Campbell et al., 2002) specifically highlighted bias as a major contributor to total RMSD. As demonstrated here, bias has been effectively reduced in several participating models. PPARR2 used the Joint Global Ocean Flux Study equatorial Pacific process study data collected on and off the equator at 140°W in 1992 to compute the average total RMSD of log 10 PP for their twelve participating models. Using the identical dataset in an analogous computation, the average total RMSD is computed from 20 of the SatPPMs participating in this exercise (Models 2-21). The resulting RMSDs (on equator: 0.21; off equator: 0.23) are 58% and 35% lower, respectively, then those computed 6 years ago in PPARR2 (on equator: 0.50; off equator: 0.31). If this rate of model improvement continues, within a little more than a decade total RMSD can be expected to fall to within the level of observational uncertainty (0.07). (Eppley et al., 1985). It ignores any external forcing or changes in physiological state. While other models incorporate information regarding geography or forcing fields, this model assumes that the standing stock is sole determinant of photosynthetic rate. All biomass performs identically. This simplicity is inherently elegant because biomass is, for most of the ocean, an excellent indicator of nutrient supply and presence of light.
Model 2: This is a variant of the original Howard, Yoder, Ryan model (HYR, Howard and Yoder, 1997; Model 3) which integrates photosynthesis to the euphotic depth as defined in Behrenfeld and Falkowski (1997b) rather than to the MLD (Carr, 2002).
Model 3: This is the original HYR model, which for many years was a standard MODIS algorithm. Maximum growth rate is parameterized as a function of SST according to Eppley (1972). Primary productivity is integrated to the MLD rather than to the euphotic depth.
Model 4: This HYR variant uses MLD information to quantify the photoadaptive variables within the euphotic zone, as well as to address water column partitioning of primary productivity relative to euphotic depth.
Model 5: This model uses an artificial neural network to perform a generalized nonlinear regression of PP on several predictive variables, including latitude, longitude, day length, MLD, SST, P opt B (computed according to Behrenfeld and Falkowski (1997b)), PAR, and chl 0 (Scardi, 2000;Scardi, 2001). Since there are insufficient data to calibrate the neural network throughout the tropical Pacific, PP values from other models (VGPM, HYR, and the MOD-27 formulation (Esaias, 1996)) were considered measurements where there were none.
Model 6: This model is based on the formulation obtained through dimensional analysis by Platt and Sathyendranath (1993). The photosynthetic parameter (P max B ) is assigned by combining a temperature-dependent relationship for the maximum growth rate (Eppley, 1972) with a variable carbon to chlorophyll ratio following the statistical relationship of Cloern et al. (1995). Primary production is integrated to the 1% light level.
Model 7: The Carbon-based Production Model (CbPM) represents a new approach to NPP assessment that utilizes satellite information on both surface chlorophyll and particulate backscatter (bbp) at 443 nm, which is converted into phytoplankton carbon biomass (C) (Behrenfeld et al., 2005). This expanded information set allows the ratio of chl:C to be calculated and related to phytoplankton growth rates (m), such that NPP is then calculated as the product of C and mu. In the current application of the CbPM, no field data were provided on bbp, so phytoplankton C was estimated using 'typical values' derived from SeaWiFS data for a given sample location. This lack of coincident field bbp data compromises the performance of the CbPM in the current model-field data comparison. Vertical structure in this version of the CbPM is as assumed in Model 8.
Model 8: The Vertically Generalized Production Model (VGPM) developed by Behrenfeld and Falkowski (1997b) is one of the most widely known and used WIDI PP models and is one of two standard MODIS algorithms. The maximum observed photosynthetic rate within the water column, P opt B , is obtained as a 7th-order polynomial of SST.
Model 9: This model only differs from Model 8 in that P opt B is estimated as an exponential function of temperature following Eppley (1972). Model 10: This VGPM variant formulates P opt B as a function of SST and chl 0 (Yamada et al., 2005;Kameda and Ishizaka, 2005). The model is based on the assumptions that changes in chlorophyll concentration depend on the abundance of large phytoplankton and that chlorophyll-specific productivity is inversely proportional to phytoplankton size. Model 11: This VGPM variant uses the continuous function of Morel and Berthon (1989) to estimate total integrated chlorophyll, which in turn is used to estimate the euphotic depth with the equations proposed by Morel and Maritorena (2001).
Model 12: This VGPM variant differs from Model 8 in that P opt B (mg C mg chl − 1 h − 1 ) is estimated as a function of chl 0 (mg chl m − 3 ) only. This empirical relationship was derived from EqPac data (P opt B = 24 × chl 0 + 1).

A.2. DR/WI SatPPMs
Model 13: Photosynthesis per unit chlorophyll was determined using an optimality-based model of nitrogen allocation and photoacclimation (Armstrong, 2006), which is in turn based on the model of Geider et al. (1998). Photoacclimation and nitrogen allocation are determined as a function of light and temperature; therefore both PAR and SST are used in the productivity algorithm. Maximum photosynthetic rates were taken to be an average of maximum NPP and GPP estimated for T. weissflogii in Armstrong (2006), and were assumed to have Eppley temperature dependence. Through the photoadaptation algorithm, chlorophyll reflects nitrogen status, so that no additional assumptions about nutrient limitation are needed. Chlorophyll concentration was assumed constant over the photic zone and equal to surface chlorophyll, so that light decreases exponentially with depth. Photic zone depth (1% light) was determined from chlorophyll concentration and assumed extinction coefficients. The photic zone was assumed to be well-mixed and cells were assumed to be photoacclimated to average (10%) light. Column productivity is the integral over the photic zone of (photosynthesis/chlorophyll) ⁎ chlorophyll.
Model 14: In this model the depth distribution of PAR is determined by chl 0 while the depth distribution of chlorophyll is given by the PAR profile and chl 0 . Total productivity is empirically estimated as a function of surface temperature and depth-dependent PAR and chlorophyll (Asanuma, 2006).
Model 15: This model is based on chlorophyll-specific phytoplankton absorption, which is parameterized empirically as a function of SST (Marra et al., 2003). Absorption by photosynthetic pigments is distinguished from total absorption; the former is used to calculate productivity and the latter is used to estimate light attenuation in the water column. The rate of photosynthesis is quantum yield times absorbed irradiance, obtained from a hyperbolic tangent and a constant maximum quantum yield, ϕ max . The depth profile of chlorophyll is estimated assuming a Gaussian shape with parameters determined by the surface value. The attenuation coefficient of chlorophyll is also SST-dependent.

A.3. DR/WR SatPPMs
Model 16: This model is an implementation of the Morel (1991) model in which the depth distribution of chlorophyll is assumed constant throughout the water column. The broadband incident PAR is spectrally resolved using a lookup table generated from a single run of the Gregg and Carder (1990) marine irradiance model where the effects of clouds and aerosols are essentially linearly scaled. The model uses 60-min time and 10-m depth steps at 5-nm wavelength resolution when run using the global datasets (Smyth et al., 2005).
Model 17: This model is the lookup table described in Smyth et al. (2005) which uses a radiative transfer model with parameterizations for Colored Dissolved Organic Matter (CDOM). The model is based on Morel (1991) with incident irradiance parameterized in the same manner as Model 16.
Model 18: This model follows that of Platt and Sathyendranath (1988) as implemented at global scale by Longhurst et al. (1995). It uses biogeographical provinces to define the values of the parameters to describe the light-photosynthesis curve and the chlorophyll depth profile. Photosynthetic parameters were updated using an extended dataset provided by the Bedford Institute of Oceanography and an extensive literature review. Spectral surface irradiance is first estimated independently with the model of Gregg and Carder (1990) combined with a correction for cloud cover and then scaled to match the PAR values provided for the exercise. Spectral light is subsequently propagated in the water column with a biooptical model with updated parameterizations of the inherent optical properties. All changes to the original implementation of Longhurst et al. (1995) are detailed by .
Model 19: This model derives spectral irradiance from PAR using Tanre et al. (1990), and assumes a vertically uniform chlorophyll profile. Quantum yield is parameterized as a maximum value times both a light dependent term (Waters et al., 1994;Bidigare et al., 1992) and a temperature-dependent term. Temperature dependence was assumed to be sigmoidal, and was based on a vertical profile of temperature derived from SST and MLD.
Model 20: This is a spectral light-photosynthesis model published by Morel (1991). It is formulated using chlorophyllspecific wavelength-resolved absorption and quantum yield. Temperature dependence is given by the parameterization of P max B , which follows Eppley (1972). The chlorophyll profile is determined to be well-mixed or stratified according to the ratio of MLD and the euphotic depth, and if stratified, assigned a Gaussian profile as in Morel and Berthon (1989). Mean photo-physiological parameters are from . The model is run in its 'satellite' version Antoine and Morel (1996), where PP is the product of integral biomass, the daily irradiance, and Ψ⁎ (the cross-section of algae for photosynthesis per unit of areal chlorophyll biomass). Lookup tables for Ψ⁎ were previously generated using the full DR/WR model, and are used to increase computational efficiency.
Model 21: This model represents an expansion in both the physiological attributes of the original CbPM (Model 7) and the space-and time-resolution of the model, and is described in detail in Westberry et al. (2008). The model requires surface chlorophyll concentration data and particulate backscatter coefficients (bbp) at 443 nm. As with Model #7, execution of Model #21 for the current study required the estimation of bbp from the SeaWiFS record, as no coincident field data for bbp were available. Vertical structure in chlorophyll concentration in this model is driven primarily by photoacclimation of cellular chlorophyll, with additional adjustments made for changes in nutrient stress through the water column.

A.4. BOGCMs
Model 22: This BOGCM  uses a 20layer reduced-gravity σ-coordinate model, with the first layer corresponding to the ocean's mixed layer. The grid employed here stretches from 124°E to 76°W with uniform 1°resolution in longitude, and from 30°S to 30°N with maximal latitudinal resolution in the equatorial band (1/3°at latitudes b10°). The model was run using NCEP weekly (6 day) mean wind stress and monthly interannual precipitation. The nitrogen and iron based ecosystem model has two size classes each of phytoplankton, zooplankton, and detritus, and the nutrients nitrate, ammonium, and iron. Phytoplankton growth is simultaneously limited by irradiance as well as nitrogen and iron availability, with fixed Fe/N ratios so that iron in all compartments except the dissolved pool is not modelled explicitly.
Model 23: This BOGCM is a modification of Model 22 (Wang et al., 2005(Wang et al., , 2006 and differs primarily by having a saturating (Ivlev) rather than a nonsaturating (multiplicative) grazer functional response. In addition it utilizes an additional quadratic mortality term that confers stability and suppresses oscillatory solutions in the case of saturable grazing (Wang et al., 2005). Additional minor modifications from the published versions are discussed in Christian et al. (in press).
Model 24: The Biogeochemical Elemental Cycling (BEC) ocean model simulates the biogeochemical cycling of carbon, oxygen, nitrate, phosphate, iron, silicate, and alkalinity. The ecosystem has multiple, potentially growth limiting nutrients (nitrogen, phosphorus, silicon, and iron) and four phytoplankton groups (diatoms, diazotrophs, coccolithophores, and picophytoplankton) (Moore et al., 2002(Moore et al., , 2004Doney et al., 2009this issue). Growth rates can be limited by available nutrients and/or light levels. Full carbonate chemistry, as well as sinking particulate and semi-labile dissolved organic pools are included in a global, 3D context without nutrient restoring (Moore et al., 2004). The BEC runs within the coarse resolution ocean component of the Community Climate System Model (Yeager et al., 2006), driven with time-varying atmospheric forcing from NCEP reanalysis and satellite data products .
Model 25: This prognostic ocean biogeochemistry/ecology model considers 25 tracers including three phytoplankton groups, two forms of dissolved organic matter, heterotrophic biomass, and dissolved inorganic species for coupled C, N, P, Si, Fe, CaCO 3 , O 2 and lithogenic cycling with flexible N:P:Fe stoichiometry. The model includes such processes as gas exchange, atmospheric deposition, scavenging, N 2 fixation and denitrification. Loss of phytoplankton is parameterized through the size-based relationship of Dunne et al. (2005), river inputs, and sediment processes. This biogeochemistry was run in the Modular Ocean Model version 4 (Griffies et al., 2005) with 50 vertical z-coordinate layers and a nominally 1°g lobal resolution horizontal B-grid with tri-polar coordinates to resolve the arctic and finer detail of 1/3°near the equator. The model was forced with 6-hourly and interannually varying forcing from the atmospheric reanalysis of Large and Yeager (2004).
Model 26: The NASA Ocean Biogeochemical Model (NOBM) simulates four phytoplankton groups (diatoms, chlorophytes, cyanobacteria, and coccolithophores) and four nutrients (nitrate, ammonium, silica, and iron) (Gregg and Casey, 2007). The model is approximately 0.8°resolution with 14 vertical layers in quasi-isopycnal configuration. The model was forced by monthly mean winds and shortwave radiation from NCEP for 1979-2004. Model 27: In this NOBM variant (Gregg, 2008), NODC data for 1979-2004 were assimilated using the Conditional Relaxation Analysis Method. The assimilation affected the model representations of total chlorophyll (sum of the 4 phytoplankton groups), but not the individual community distributions directly. Primary production was affected by the change in total chlorophyll, as well as by indirect effects such as subsurface irradiance resulting from absorption and scattering by the changed chlorophyll field.
Model 28: The PISCES-T model is a two phytoplankton, two zooplankton, three detritus, three nutrient Dynamic Green Ocean Model. It has variable Fe:C, Chl:C and Si:C ratios and fixed N:C:O. It is embedded in the OPA8 OGCM, initialized with observations, and forced by daily NCEP reanalysis . No nutrient restoring was used. The model was run from 1948 to (Le Quere et al., 2007. Model 29: The Hamburg Ocean Carbon Cycle (HAMOCC5) model uses monthly climatology forcing with approximately 3.5°× 3.5°horizontal degree resolution and 22 vertical layers with thickness varying from 50 m at the surface to 700 m in the deepest layer (Maier-Reimer, 1993). The model employs an NPZD-type ecosystem model with two phytoplankton functional groups (diatoms and calcifiers) with multi-nutrient (nitrate, phosphate, silicate and iron) limitation. Productivity is simulated with a 3-day time step (Six and Maier-Reimer, 1996;Howard et al., 2006).
Model 30: The physical model is based on the Regional Ocean Modeling System (ROMS), which has been configured for the Pacific Ocean at 50-km horizontal resolution and 20 vertical layers (Wang and Chao, 2004). The biogeochemical model is based on the Carbon, Silicate, Nitrogen Ecosystem (CoSINE) model developed by Chai et al. (2002Chai et al. ( , 2003. The CoSINE model includes silicate, nitrate and ammonium, two phytoplankton groups, two zooplankton grazers, and two detrital pools. The model is initialized with climatological, temperature, salinity, nutrients and TCO2. When forced with daily air-sea fluxes (wind stress, heat flux, and freshwater exchange) derived from NCEP/NCAR reanalysis (Kalnay et al., 1996) for the period of 1984 to 2005, the Pacific ROMS-CoSINE model is able to realistically reproduce the evolution of ENSO events (1992-93; 1997-98) as well as a sharp change of the circulation and thermal structure for the Pacific Ocean in 1998-99 (Liu andChai, 2008).