Skill metrics for confronting global upper ocean ecosystem-biogeochemistry models against field and remote sensing data

We present a generalized framework for assessing the skill of global upper ocean ecosystem-biogeochemical models against in-situ ﬁ eld data and satellite observations. We illustrate the approach utilizing a multi-decade (1979 – 2004) hindcast experiment conducted with the Community Climate System Model (CCSM-3) ocean carbon model. The CCSM-3 ocean carbon model incorporates a multi-nutrient, multi-phytoplankton functional group ecosystem module coupledwithacarbon,oxygen,nitrogen,phosphorus,silicon,andironbiogeochemistrymodule embedded in a global, three-dimensional ocean general circulation model. The model is forced with physical climate forcing from atmospheric reanalysis and satellite data products and time-varying atmospheric dust deposition. Data-based skill metrics are used to evaluate the simulated time-mean spatial patterns, seasonal cycle amplitude and phase, and subannual to interannual variability. Evaluation data include: sea surface temperature and mixed layer depth;satellite-derivedsurface oceanchlorophyll,primary productivity,phytoplanktongrowth rate and carbon biomass; large-scale climatologies of surface nutrients, p CO 2 , and air – sea CO 2 and O 2 ﬂ ux; and time-series data from the Joint Global Ocean Flux Study (JGOFS). Where the data is suf ﬁ cient, we construct quantitative skill metrics using: model – data residuals, time – space correlation, root mean square error, and Taylor diagrams. © 2008 Elsevier B.V. All rights reserved. Keywords

Even with the wealth of new ocean data, the evaluation of basin and global-scale marine ecosystem-biogeochemistry models is challenging.Satellite data provide reasonably high space-time resolution but only for the surface layer and for only a handful of biological properties.Ship-based biological and chemical data are invaluable but comparatively sparse, with the construction of global annual mean climatologies requiring the aggregation of data from multiple years.Routine underway sampling transects on research vessels and volunteer commercial ships are greatly improving data densities but are mostly limited, to this point, to carbon dioxide system variables (e.g., pCO 2 ).Quantitative evaluation is further confounded by the fact that model variables in most ecosystem models are highly aggregated, for example lumping all microzooplankton or mesozooplankton into two model compartments.Model variables can also lack direct analogues with observed values, as in the case of the crucial higher trophic level mortality closure terms that combine effects of mortality and predation.Furthermore, even the best sampled locations (e.g., JGOFS sites) often do not have enough data to constrain model dynamics fully (e.g., Friedrichs et al., 2007).Despite these issues, data-based verification of model skill is fundamental to advancing the science of marine ecological modeling.Quantitative skill assessment is integral to both the model development cycle and defining confidence estimates on model forecasts.As such, most research groups have in place some form of data-based assessment, though as discussed in Stow et al. (this volume) these assessments are often partial and qualitative.
Skill assessment requires significant up-front investment to compile a wide range of field and remote sensing data into appropriate data products (often, but not always gridded) prior to the actual comparison with model results.The problem is not one of simply data management, but rather requires a serious level of data interpretation and analysis effort to combine observations from different researchers, measurement techniques or even satellite platforms.Once created, the utility of such data products is clear, as illustrated by the broad use by the modeling community of data compilations such as the global surface pCO 2 data set and air-sea CO 2 flux estimates of Takahashi et al. (2002) and Takahashi et al. (2008, in press), the GLODAP data products created from the WOCE/JGOFS (Key et al., 2004), and the level 3 gridded satellite ocean color data from SeaWiFS and MODIS.Standard data products and quantitative metrics of model skill, even if imperfect, stimulate critical assessment of model performance and speed model development.
Lessons for developing a systematic model-data skill assessment can be drawn from the experience of similar efforts in related fields.For example, the ocean biogeochemical community has organized model-data skill assessments under the Ocean Carbon Model Intercomparison Project (OCMIP).Within OCMIP researchers compared about a dozen global ocean biogeochemical models against observations of ocean physics (Doney et al., 2004), transient tracers including radiocarbon (Matsumoto et al., 2004) and chlorofluorocarbons (Dutay et al., 2002), and inorganic carbon, nutrients and oxygen (Orr et al., 2005;Najjar et al., 2007).Matsumoto et al. (2004) argue that most of the OCMIP simulations do not adequately match (within error bars) the available ocean transient tracer data.Estimates of less well-constrained model variables (e.g., future ocean uptake of anthropogenic CO 2 ) should therefore include only the subset of "skillful" simulations or should be constructed using a weighting function based on the transient tracer model skill.A second lesson from OCMIP is that model skill assessment is often best done as a partnership between researchers with expertise in modeling and researchers involved in field observations and remote sensing.
Here we argue for a similar generalized model-data framework for characterizing the skill of global ocean ecosystembiogeochemical models against in-situ field data and satellite observations.While all of the major ocean ecosystem modeling groups currently assess model skill, the approaches are often specific to their particular model, and the community lacks a set of agreed upon, objective evaluation metrics that can be used to inter-compare skill across models.Our goal here is to stimulate discussion and dialogue by proposing a prototype scheme that will be open to the community.We realize that a fully comprehensive system will only emerge over time with input from different user groups and that even with a generalized set of skill metrics there will still remain a need for unique verification approaches for different model applications.
As an illustration, we present model-data skill results from a multi-decade  hindcast experiment conducted with the Community Climate System Model (CCSM-3) coupled ocean Biogeochemical Elemental Cycling model (BEC).The BEC model consists of upper ocean ecological (Moore et al., 2004) and full-depth biogeochemical (Doney et al., 2006) modules embedded in a global 3-D Parallel Ocean Program (POP) ocean general circulation model (Smith and Gent, 2004;Collins et al., 2006).The model is forced with physical climate forcing from atmospheric reanalysis and satellite data products (Doney et al., 2007) and time-varying dust deposition (Mahowald et al., 2003).We focus the analysis on three aspects of the simulation: time-mean spatial patterns, the seasonal cycle, and subannual to interannual variability.Evaluation data include satellite-derived surface ocean chlorophyll and primary productivity (SeaWiFS and MODIS), largescale climatologies of surface nutrients and pCO 2 , and timeseries data from JGOFS and other field programs.Where the data is sufficient, we construct quantitative skill scores (timespace correlation and model and data rms variability) (Lima and Doney, 2004).

Ecosystem-biogeochemistry module
The CCSM-3 BEC model is cast as a set of three-dimensional, time-varying advection diffusion equations for a suite of tracers C: The physical transport is partitioned into resolved advection and parameterized eddy mixing terms; all of the ecological-biogeochemical source/sink terms and surface and sediment fluxes are grouped into the right hand side term RHS bio .The marine ecosystem module (Fig. 1) builds on traditional phytoplankton-zooplankton-detritus-nutrient food web models (e.g., Doney et al., 1996;Fasham et al., 2001).The module incorporates multi-nutrient limitation (N, P, Si, and Fe) on phytoplankton growth and specific phytoplankton functional groups (Moore et al., 2002(Moore et al., , 2004)).
There are fourteen main model compartments: small pico/ nanoplankton, diatoms, and diazotrophs; zooplankton; suspended and sinking particulate detritus; and dissolved nitrate, ammonia, phosphorus, iron, silicate, oxygen, dissolved inorganic carbon, and alkalinity.The pico/nanoplankton size class is designed to replicate the rapid and highly efficent nutrient recycling found in many subtropical, oligotrophic (low nutrient) environments.Diatoms model a larger, bloom-forming size class.Phytoplankton growth rates are determined by available light and nutrients using a modified form of the Geider et al. (1998) dynamic growth model.Photoadaptation is parameterized with dynamically adaptive chl/C ratios.The diazotrophs fix all required nitrogen from N 2 gas, and calcification is parameterized as a fraction of the pico/nanoplank-ton production as a function of temperature and nutrients adapted for coccolithophores.Size-structure effects are included by varying key zooplankton (e.g., partitioning of fecal pellets between suspended and sinking detritus) depending on the food source (Lima and Doney, 2004).Many of the biotic and detrital compartments contain multiple elemental pools, in addition to carbon, to track flows through the ecosystem.The model has one adaptive zooplankton class that grazes on phytoplankton and large detritus.
The biogeochemistry module (Doney et al., 2006) is based on an expanded version of the Ocean Carbon Model Intercomparison Project (OCMIP) biotic model (Najjar et al., 2007).The module includes full carbonate system thermodynamics and air-sea CO 2 and O 2 fluxes.Gas transfer velocities are computed from the 6-hourly NCEP winds (http://www.cdc.noaa.gov/cdc/reanalysis/reanalysis.shtml) using the quadratic wind speed relationship of Wanninkhof (1992).A dynamic iron cycle is incorporated with atmospheric dust deposition, water column scavenging and a continental sediment source (Moore et al., 2006; see also Moore and Braucher, 2008 for discussion on refinements of continental sediment source).Denitrification is simulated in oxygen minimum zones following Moore and   Doney (2007), and subsurface particle remineralization is parameterized incorporating the mineral ballast arguments of Armstrong et al. (2002).The model equations are identical to those reported for the 3-D implementation of Moore et al. (2004) with two important modifications as documented in more detail in Moore et al. (2006).First, water column denitrification has been added to the model in order to close the global nitrogen cycle.Second, a number of the parameters associated with the model iron dynamics and scavenging have been adjusted to improve the simulated dissolved iron fields (see Table 1 of Moore et al., 2006) (Moore and Braucher, 2008).

Atmospheric dust deposition
Time-varying mineral aerosol deposition to the ocean is simulated using a 3-D atmospheric chemical transport model (Mahowald et al., 2003;Luo et al., 2003) based on National Centers for Environmental Prediction/National Center for Atmospheric Research (NCEP/NCAR) reanalysis (Kistler et al., 2001).The dust source and deposition scheme is based on the Dust Entrainment and Deposition (DEAD) scheme (Zender et al., 2003), and the chemical transport model is the Model of Atmospheric Transport and Chemistry (MATCH) (Rasch et al., 1997), which has been developed specifically to be used with reanalysis winds (Mahowald et al., 2003).The dust source areas are defined as dry, poorly vegetated regions which have easily erodible sources, using topographic lows as preferential source areas (Ginoux et al., 2001).Dust is removed by wet deposition during precipitation events, and by dry deposition from gravitational settling and turbulent processes.
The ability of the dust model to correctly simulate the annual mean, and seasonal cycle of dust has been compared against in-situ and satellite observations elsewhere (Luo et al., 2003; Mahowald et al., 2003).For the in-situ concentration data at the 10 stations where multiple years are available, the model gets statistically significant correlations for the seasonal cycle and interannual variability at 8 of 10 stations, with the most difficulty and lowest correlations at stations in the southern hemisphere (Table 1, Mahowald et al., 2003).Comparisons of daily averaged concentrations obtain correlation coefficients of 0.31 to 0.84 for the 7 stations with daily averaged data.Correlation coefficients with available satellite data (TOMS AAI and AVHRR optical depth) in regions where dust is adequately sampled are above 0.60 (Fig. 2, Mahowald et al., 2003).Much of the interannual variability in dust concentrations downwind of the source regions is driven by atmospheric transport (or transport/source correlations) and not by source interannual variability (Tegen and Miller, 1998;Mahowald et al., 2003).

Atmospheric forcing and ocean physical hindcasts
The POP is a z-level, hydrostatic, primitive equation model integrated here with a resolution of 3.6°in longitude, 0.8°to 1.8°in latitude, and 25 vertical levels (Yeager et al., 2006).Effects of mesoscale eddy transport are parameterized according to Gent and McWilliams (1990).The Large et al. (1994) K-Profile Parameterization is implemented in the vertical to capture surface boundary-layer dynamics and interior diapycnal mixing.The historical simulation  is integrated with air-sea heat, freshwater, and momentum fluxes derived from a bulk flux forcing method that combines 6-hourly atmospheric surface fields (temperature, humidity, winds) from the NCEP reanalysis (Kistler et al., 2001) with satellite and in-situ derived clouds, precipitation, runoff and sea-ice fraction (Large and Yeager, 2004).Doney et al. (2007) present a quantitative skill assessment of the ocean physical solutions in terms of interannual variability of temperature, sea surface height, and circulation.
Initial conditions for the nutrient and inorganic carbon variables are prescribed from data-based climatologies (e.g., Key et al., 2004).The ecological-biogeochemical simulation is spun-up for several hundred years, prior to initiating the interannual varying forcing, using a repeat annual cycle of physical forcing, dust deposition, and fixed pre-industrial atmospheric CO 2 mole fraction (280 ppm).The full interannual variability in physics and dust forcing is initiated in model year 815 (equivalent to calander year 1979).In the preindustrial simulation atmospheric CO 2 mole fraction remains fixed at 280 ppm over the hindcast .In a companion anthropogenic CO 2 simulation, atmospheric CO 2 starts to evolve over time mid-way through the spin-up following ice-core and historical CO 2 observations from the 1700s forward to 1979; in that simulation, atmospheric CO 2 tracks observed global mean temporal trends over the hindcast .
The model ecosystem components converge to a repeat annual cycle within a few years of spin-up.There remains a slow drift in the subsurface nutrient and inorganic carbonate fields in the pre-industrial simulation.The global net air-sea CO 2 uptake flux is 0.150 PgC y − 1 (mean areal flux 0.025 mol C m − 2 y − 1 ), but the change in the drift over the 26 year integration  is only −0.010 PgC y (mean −0.002 mol C m − 2 y − 1 ) and much smaller than the simulated interannual variability.
In a companion anthropogenic CO 2 simulation, atmospheric CO 2 evolves over time during the latter part of the model spin-up following historical observations.

Evaluation data sets
Table 1 presents details on the specific field and remote sensing data sets used for model evaluation in this study.Information in the table includes each specific variable, temporal resolution of the underlying data set, data source, and reference(s).Most of the data sets are global in extent, which limits us primarily to data climatologies (annual mean and seasonal cycle) and satellite data products (annual mean, seasonal cycle, and subannual to interannual variability).This is not to argue that other data sets are not of value, a fact that we illustrate using an example 1-D water column timeseries.Our emphasis here is mostly on surface water properties.
The data sets are chosen to highlight key aspects of the coupled BEC simulation with regards to physics, chemistry and biology.Physical fields include sea surface temperature (SST), which is important for biological growth and respiration rates as well as air-sea gas exchange, and mixed layer depth (MLD), which influences nutrient entrainment and the average light field observed by the phytoplankton.Biogeochemical fields include surface water dissolved inorganic macronutrients (nitrate (NO 3 ), silicate (SiO 3 ), and phosphate (PO 4 )) and reflect a balance between physical nutrient supply and net biological nutrient drawdown.We also examine the simulated fields of dissolved gases oxygen (O 2 ) and the carbon dioxide partial pressure (pCO 2 ), as well as airsea O 2 and CO 2 fluxes, which reflect physical transport, solubility variations, net community production, and oceanatmosphere exchange.The global biological fields are derived from satellite ocean color data and include chlorophyll (Chl) and vertically integrated primary production (∫PP), which are measures of phytoplankton pigment standing stock and the photosynthesis that fuels the upper ocean food web.We also examine a pair of relatively new remote sensing products, phytoplankton specific growth rate (μ) and phytoplankton carbon concentration (P C ,), as described in more detail below.
The list of evaluation data sets is heavy on bottom-up physical-chemical forcing data, biogeochemical tracers, and phytoplankton responses, but light on many of the higher trophic level dynamics and loss processes that are also integral to the BEC solutions (Fig. 1).For example, we have not yet incorporated measures of zooplankton biomass and grazing rates because of a lack of comprehensive global data.While some macrozooplankton biomass climatologies exist, there is no similar treatment for microzooplankton that are essential to verifying the behavior of our single aggregated zooplankton compartment.For somewhat different reasons, we do not include an explicit measure of export production.While globally gridded export flux maps are available, their construction from satellite data (e.g., Laws et al., 2000) involves a considerable level of model or empirical assumptions; they are essentially derived products from derived products, in this case primary production.Unlike most satellite ocean color algorithms there is no direct link to radiances nor extensive in-situ validation, and it is unclear whether they serve as independent observational assessments or more of a model-model comparison (Najjar et al., 2007).Recent compilations of deep-sea sediment trap data offer another approach for a point by point assessment of simulated export production, but because only a small fraction of surface export reaches the deep ocean such analysis also folds in the skill of the model subsurface remineralization parameterization, which can have large uncertainties.
Some assessment variables require merging multiple observational data sets, which can add potential biases to model-data assessments if particular care is not taken.For example, air-sea CO 2 flux maps (e.g., Takahashi et al., 2002) are commonly created by joining air-sea ΔpCO 2 data with wind speed dependent gas transfer velocities.Observational flux estimates thus scale directly, if non-linearly, with wind speed, and the use of different wind speed products will result in different observational flux estimates even for the same underlying ΔpCO 2 data.A good argument can be made for adjusting the global mean transfer velocity to correct for differences in the global mean wind speed from different wind products (or the global mean wind speed squared in the case of a quadratic wind speed formulation), but there will still be seasonal and regional spatial flux differences introduced by the different wind products.
For models the issues are somewhat more complicated because wind speed products are used to force both physical circulation and biogeochemistry.The model air-sea CO 2 flux and ΔpCO 2 are dynamically coupled in the model solutions in that a change in flux will alter surface water DIC and thus ΔpCO 2 .The use of a different wind speed product between the observational estimate and as forcing for the model could therefore introduce model-data differences in both variables.Despite their interdependence, we include here model skill metrics for both air-sea CO 2 flux and ΔpCO 2 because they are commonly presented observational fields and reflect somewhat different weighting of different regions and seasons within the model.Note, however, that the uncertainties in more derived "observational" products, such as air-sea CO 2 flux, are larger than directly measured fields; in the case of air-sea CO 2 flux, this includes incorporating error in wind speed based gas transfer relationships (Wanninkhof, 1992).
Traditionally, surface chlorophyll concentration has been the primary satellite-derived ecosystem variable related to biological ocean carbon cycling.From chlorophyll concentration, a variety of models have been described for estimating water column net primary production (see reviews by Campbell et al., 2002;Carr et al., 2006).In most cases the estimated primary production is proportional to the satellite-derived chlorophyll multiplied by spatially and temporally varying factors that may depend upon estimates of surface light, nutrients, temperature, mixed layer, etc.Similar to the arguments for including both air-sea CO 2 flux and ΔpCO 2 , we also include both satellite chlorophyll and net primary production as evaluation data sets because of their common usage, their different weighting of the critical process of marine photosynthesis, and the considerable independent effort going into the validation of satellite productivity using 14 C-based field primary productivity data sets (Carr et al., 2006;Friedrichs et al., this issue).
The satellite primary productivity algorithms depend on empirical descriptions (generally temperature-dependent) of phytoplankton assimilation efficiencies that may be somewhat unreliable for detecting regional temporal variability, particularly in response to factors such as aeolian iron deposition.Recently, however, an alternative approach to analyzing satellite ocean color data has been developed that yields not only estimates of phytoplankton mixed layer pigment concentrations but also new and independent information on particulate scattering coefficients (Garver and Siegel, 1997;Maritorena et al., 2002;Siegel et al., 2002).
With this additional information, it is now possible to directly derive phytoplankton carbon biomass, chlorophyll-tocarbon ratios, and phytoplankton growth rates from space (Behrenfeld and Boss, 2003;Behrenfeld et al., 2005), and thus more reliably detect and distinguish physiological-and biomass-dependent responses to changing environmental conditions.Importantly, these three phytoplankton characteristics are directly comparable to ocean model variables.For the current study, we employ a spectrally-resolved version of the Carbon-based Production Model (CbPM) (Westberry et al., 2008) that yields improved descriptions of vertical variability in phytoplankton carbon, Chl:C and growth rates compared to the original CbPM (Behrenfeld et al., 2005).Due to the tight coupling between phytoplankton growth rates and zooplankton grazing, physical/chemical perturbations to mixed layer growth conditions can regularly occur without a detectable signature in phytoplankton biomass.Environmental changes are, however, invariably imprinted in physiological characteristics of the phytoplankton assemblage.

Model-data skill metrics
The skill metric suite includes model-data comparisons of fields of tracers (standing stocks) and biological flows or rates.Monthly averages χ are computed from the model output to match the common temporal resolution of observational climatologies.The observations are interpolated to the horizontal model grid for spatial fields and, where applicable, averaged to monthly resolution.For each observed χ O and model predicted χ P variable, we compute for each grid point a long-term mean 〈 χ 〉 , an annual mean χ P , and a mean annual cycle χ a (e.g., average January, average February, etc.) for the period of analysis 1979-2004.We define a series of anomalies (Doney et al., 2007): where χ′ are the annual mean anomalies, χ″ are the monthly anomalies, and χ⁎ are the monthly deseasonalized anomalies.Standard deviations (σ) are computed for each variable and for the various anomalies, as needed.
We apply a standard suite of univariate model-data skill metrics (e.g., Evans, 2003;Stow et al., this volume) including the model-data correlation coefficient r, the root mean square error ε rms , and the average error or bias ε bias .The metrics are applied to different temporal and spatial domains, depending upon the availability of observations and the question of interest.For example, metrics for a time-series at a single grid point for the full monthly data over the full analysis period of the hindcast simulations  would be: where the summation ∑ is over N = 312 months (26 years × 12 months).For variables where only a seasonal climatology is available, we define the corresponding metrics at the grid point scale using the mean annual cycle r(χ a ), ε rms (χ a ), and ε bias (χ a ) and n = 12.Where appropriate, similar statistics are computed on larger-spatial scales including zonal averages across ocean basins (e.g., Atlantic), global zonal averages, and global averages.Also, the simulated model fields are subsampled in time to match the data sampling when observations exist for only a subset of the model hindcast.
For some variables, that have large dynamic ranges, we may choose to analyze the log-transform of the data: The log-transform tends to give more equal weight to all of the data and not skew the statistics towards the largest data values.
The mean of the log-transformed variable 〈X〉, can be related to the geometric mean of the untransformed variable 〈 χ 〉 G : The geometric bias: gives a measure of the typical bias normalized by the value of the variable, (χ P −χ O )/ 〈 χ O 〉 ; ε bias G (χ) b 1 occurs when the mo- del tends on average to underestimate the observations and ε bias G (χ) N 1 when the model tends on average to overestimate.
The corresponding geometric root mean square error given by: reflects the size of the typical model-data error normalized relative to the typical data value.A perfect model with no error would give a ε rms G value of 1.0; a value of ε rms G of 2.0 would reflect a case where the typical error at any point is comparable in size to the observed value.Note that for geometric averaging, the errors are not symmetric about the mean in the geophysical space.Geometric averaging is used for several of the bio-optical data sets (chlorophyll, primary production, phytoplankton biomass) that have approximate log-normal distributions (Campbell, 1995).We use Taylor diagrams (Taylor, 2001) to display simultaneously information on model-data skill for a suite of variables from the ocean biological model (Lima and Doney, 2004).The Taylor diagram combines global r, ε rms , normalized by the observed standard deviation, and the ratio of the predicted to   the opposite.The distance from the origin to a point on the target diagram is the total normalized ε rms .

Example diagnostic plots to ocean chlorophyll
Given that we are comparing 3-D time-varying model results against observations for more than a dozen variables, the number of potential diagnostic plots quickly expands beyond the scope of a single journal paper.The solution used here is to illustrate a standard set of diagnostic plots for a specific example variable, in this case surface ocean chlor-ophyll, and to make available the entire set of figures on a CCSM BEC diagnostic webpage.
Figs. 2-5 display the results of a comparison of model predicted surface ocean chlorophyll (mg Chl m − 3 ) against results from the SeaWiFS ocean color sensor (Sept. 1997-Dec. 2004).The spatial map of the bias in the long-term mean ε bias (〈Chl〉) (Fig. 2, bottom panel) exhibits large-scale, coherent error patterns.The model surface chlorophyll tends to be too high in the subtropical oligotrophic gyres (ε bias N 0) and too low in the subpolar gyres (ε bias b 0).This error pattern may reflect problems with the single adaptive zooplankton pool; relative to primary production, grazing is too weak in picoplankton dominated subtropics and too strong in bloom environments.In particular, multiple zooplankton pools may allow for a seasonal disconnect in grazing, and thus a stronger bloom, in temperate and high latitudes.
Consistent with previous coarse-resolution global model results, the simulated chlorophyll levels are also underestimated in shallow coastal regimes.The reasons for this coastal bias are numerous but are likely dominated by physical errors: vertical upwelling due to off-shore coastal flow is poorly resolved at coarse scale and without careful treatment of the wind stress curl; the model lacks tidal mixing, an important mechanism for vertical mixing and nutrient supply on continental shelves; the mesoscale eddy parameterizations used in the global model are designed for adiabatic ocean interiors and are not adequate for the highly turbulent, and often topograqphically controlled lateral mixing on shelves.The long-term subtropical/subpolar bias reflects in part the fact that the model does a relatively poor job capturing the magnitude of the peak surface chlorophyll concentrations during summer in the temperate northern hemisphere, as illustrated in a plot of the model and observed zonal average of the seasonal anomalies (Chl a − 〈Chl〉) versus month (Fig. 3).The phasing of the model northern hemisphere spring bloom is approximately correct, but high chlorophyll levels are not sustained over the summer in the simulation.The phasing of the annual cycle in the southern hemisphere mid-latitudes (40-60°S) matches the observations well, but in this case overestimates the amplitude of the annual cycle, with chlorophyll values too high in the southern hemisphere summer and too low in the winter.
The hindcast exhibits substantial subannual to interannual variability ε rms (Chl⁎) beyond the variance introduced from the mean annual cycle ε rms (Chl a ).This is demonstrated in the middle panel of Fig. 4, a spatial map of the root-mean-squared variability in the deseasonalized anomalies in surface chlorophyll.The model hindcast simulation exhibits considerable interannual variability in the tropics, and temperate to subpolar oceans.The temporal correlations between the model and SeaWiFS chlorophyll data r(Chl⁎) (top panel, Fig. 4) are high in the western and central Equatorial Pacific and in the subtropics.However, the model-data correlations tend to drop in the midto high-latitude regions, where correlations are often not statistically different from zero.

Basin and global aggregated skill metrics
A corresponding set of model-data diagnostic plots can be constructed for all variables of interest, but we have also found it convenient to generate multi-variable synthesis plots and a table assessing skill for more aggregated basin zonal means and global averages/integrals.We include the zonal means because the spatial patterns of the model-data bias or residuals is often as or more interesting than the global average/integrated bias.
Fig. 5 illustrates this approach and compares the observed and simulated annual mean zonal averages for a range of variables.The bias, ε bias (χ), in the zonal means is simply the difference of the model and data curves.The CCSM BEC hindcast exhibits a number of large-scale biases, many of which are coherent across multiple variables.The simulation displays excess surface macronutrients in the tropical Pacific, likely the result of a combination of physical circulation errors and too much iron scavenging.Interestingly, however, the model and observed zonal average phytoplankton growth rates for the tropical and subtropical Pacific are similar and model productivity is actually higher that observed, suggesting that errors may also arise from other aspects of the biological cycling (e.g., export flux, subsurface remineralization).
As discussed above the model chlorophyll underestimated the data in mid-to high latitudes.Note that in the zonal average plots, the overall model bias in the northern hemisphere subtropics is negative.The low simulated chlorophyll in the coastal upwelling regions overwhelms the model positive bias in the subtropics.Another region of marked bias in the model simulations is the tropical Atlantic, where simulated phytoplankton biomass, specific growth rates and primary production are low relative to the observations.This region in the model is strongly P-limitation, which may reflect a combination of model errors in circulation, export and subtropical nitrogen fixation.
Global model skill is summarized in the Taylor (Fig. 6) and target (Fig. 7) diagrams and Table 2.The global mean bias ε bias (χ) for most variables is relatively small (Fig. 7, top panel).However, in agreement with Fig. 5, the spatial variation in the long-term mean 〈 χ 〉 is not captured well in many of the ecosystem variables (e.g., chlorophyll, primary production, Geometric statistics using log-transformed data Chl (mg m − 3 ) 0 . 1 2 6 ∫PP (g C m − 2 mon − 1 ) 0 . 1 6 1 P C (mg C m − 3 ) 0.033 phytoplankton growth rate), which exhibit correlation coefficients r(χ) b 0.4.The strongest correlations are for SST and nutrients.The model-data correlations for the seasonal anomalies (χ a − 〈 χ 〉 ) are for the most part between 0.3 and 0.7, and the model tends to overestimate the seasonal amplitude of some variables (e.g., chlorophyll) while underestimating that of others (e.g., phosphate and silicate) (Fig. 6 middle panel).On a global basis, the model exhibits little skill in capturing the interannual anomalies χ a except for SST (Table 3).

Comparison against local (Eulerian) time-series data sets
While powerful evaluation tools, global data sets also often have associated limitations.The compilation of disjoint field data sets from different time period and from different investigators and/or methodologies introduces errors and blurs important natural variability and climate change signals.Uncertainties arise due to spatial/temporal interpolation and extrapolation involved in creating complete global climatologies from sparse data.Because of limitations in sampling the subsurface ocean and dynamical rates from remote sensing and underway sampling, global data sets illuminate only a portion of most ecosystem models.Data-rich, local time-series sites provide a wealth of complementary information on depth profiles, the annual mean cycle, and interannual variability of physical and biogeochemical variables.These observations, although limited in spatial information, may be compared with global models to evaluate the credibility of the simulations.Time-series data have a rich history in marine ecosystem modeling as test-beds for model development and validation (e.g., Doney et al., 1996;Evans, 1999;Moore et al., 2002;Friedrichs et al., 2007).
An example time-series comparison of the CCSM BEC hindcast simulation is presented in Fig. 8 for the Hawaii Ocean Time-series (HOT) site (http://hahana.soest.hawaii.edu/hot/hot_jgofs.html).Monthly average vertical 1-D profiles are sub-sampled from the hindcast versus time at the nearest grid point to the HOT Aloha station.
At the surface, the simulated phasing of the seasonal cycle in chlorophyll (Fig. 8) is approximately correct with a minimum in the summer and a maximum in the winter, but the model tends to overestimate the mean chlorophyll levels as noted previously for the subtropical North Pacific.The seasonal surface chlorophyll cycle in the model mimics the seasonal cycle of mixed layer depth; during winter simulated mixed layer depth deepens to 60-70 m, resulting in the entrainment of nutrients that drives enhanced productivity.The strength of the simulated winter bloom is somewhat stronger than that observed in the data, however, even though the maximum winter observed mixed layer depth in the observations are somewhat deeper (and shifted later in the year).
The simulated magnitude of the deep chlorophyll maximum is slightly smaller and shallower (by about 20 m) than in the observations.The observations also exhibit non-zero chlorophyll levels well below 150 m in very low light conditions.The deeper deep chlorophyll structure in the observations results in a characteristic positive/negative dipole pattern in the model-data difference plot.The deep chlorophyll maximum and subsurface chlorophyll penetration depth are largely set by the initial slope of phytoplankton productivityirradiance curve.The model chlorophyll field can be shifted downward by increasing the initial productivity-irradiance slope (effectively reducing light limitation at low light) but at the expense of replicating surface productivity.The solution, explored in other models, involves incorporating distinct high light and low light phytoplankton populations in the subtropical gyre (Y.Spitz, per. comm.).
The coarse-resolution global model exhibits substantial interannual variability in both mixed layer depth and chlorophyll.Chlorophyll and productivity gradually drift downward from 1988 to 1998 and then begin to increase following a deep mixing event in the winter of 1989.The observed mixed layer depth experiences more variability than in the model, reflecting in part the aliasing of mesoscale eddies passing by the timeseries site (Doney, 1996).There is an indication of stronger winter mixing in the observations in 1998-1999 followed by higher near surface chlorophyll levels (but not for as an extended period as in the model. Similar patterns of model skill (and misfit) are found for aggregated skill metrics for a suite of variables displayed in Taylor diagram format (Fig. 9) and in tabular form (Tables 4  and 5).The model-data correlation values r are greater than 0.95 with a normalized standard deviation near 1.0 for the annual mean vertical profiles of temperature, oxygen, macronutrients, and primary production; the corresponding chlorophyll correlation is smaller (0.56) in large part because of the shallower simulated deep chlorophyll maximum.The modeldata agreement for the seasonal cycle (monthly-time Taylor diagram) is considerably weaker, with all of the model-data correlations less than 0.3 and substantial underestimates in the simulated strength of the seasonal cycle (σ P /σ O b 0.5) for primary production and macronutrients.
The model's interannual variability is much weaker than that in the data, and the hindcast shows essentially no skill on the observed interannual variability at the HOT site.In contrast to much of the subtropics where model skill is relatively high for interannual chlorophyll variability, the region around Hawaii is a region of little skill.A similar analysis for the Bermuda Atlantic Time-Series (BATS) (not shown) results in some what higher but still low model-data correlations, r (Chl) = 0.131 and r(PP) = 0.222.This highlights that care is need in making local data comparisons with a global model, the skill of which can vary significantly from region to region and which is confounded by regional biases in physics and unresolved high frequency variability in the observations.One strategy around these difficulties would be to combine the 3-D model assessment with a complementary assessment of 1-D vertical simulations for targeted time-series stations, so-called "regional test-beds".Surface forcing can be adjusted to give 1-D physical simulations that closely fit specific timeseries records (e.g., Doney, 1996), and 1-D ecosystem simulations are more amenable to parameter opimization, data assimilation, and cross-model intercomparison studies.

Discussion and future directions
A systematic and quantitative approach for assessing model-data skill is an essential tool in model development, evaluation, and data assimilation (Gregg et al., this volume), and emerging global-scale field and satellite data sets provide invaluable opportunities for testing upper ocean coupled ecosystem-biogeochemistry-physical models.The example suite of data sets and skill metrics presented here, while certainly still incomplete, illustrates several general points with regards to the CCSM-3 BEC model.First, overall the skill metrics highlight the fact that the model solutions have a number of deficiencies in replicating the observational data sets.Second, the degree of model skill differs sharply among variables.For example, simulated SST and some of the surface nutrient fields exhibit consistently higher skill across most of the metrics than the simulated ecological fields, chlorophyll, primary production and the new remote sensing products such as phytoplankton growth rate.Surface pCO 2 and CO 2 and O 2 air-sea fluxes are typically intermediate in skill between the physical and ecological variables.Third, regional spatial biases and seasonal cycle errors are often consistent across multiple variables, pointing towards a common dynamical problem within the coupled model.Fourth, model skill is time and space scale dependent; the model solutions of the seasonal cycle are more skillful than interannual variability.And fifth, it is challenging for a global model solution to replicate observations from local time-series because there are many subgridscale processes and representation issues that tend to confound the comparison.
The emphasis here has been primarily on assessing model skill in replicating ecological and biogeochemical metrics.But the upper ocean system is coupled, and the success of the biological and chemical simulations depends critically on a high-quality underlying physical circulation model (Doney et al., 2004(Doney et al., , 2007)).Physical model biases and errors thus should be an integral component of ecological and biogeochemical model-data assessment.Here we focused on two physical properties, SST and mixed layer depth.Other important facets of the circulation from a biological perspective include upwelling rates and near surface water column structure (Doney, 1996).As an example of how physical biases may influence ecosystem behavior, Fig. 10 displays the maximum winter mixed layer depth from the model and an observational estimate (Boyer-Montégut et al., 2004).The CCSM-3 BEC has consistently deeper maximum mixed depths in the tropics and subtropics and shallower maximum mixed depths in the northern hemisphere subpolar gyres and Southern Ocean.The model chlorophyll bias patterns (Fig. 2) may in part reflect these physical errors.In the subtropics, simulated winter chlorophyll is too high and summer levels are closer to observations.One possible explanation is that the nutrient supply from excessively deep mixing in winter induces higher simulated chlorophyll/C ratios.The too weak simulated mixing in the subpolar gyres and Southern Ocean, in contrast, limits the supply of nutrients to the surface (e.g., note the negative nutrient bias around the Antarctic) and may allow for a large overwintering zooplankton population, which can then limit the magnitude of the spring bloom.
As a pilot study, we have restricted our large-scale analysis here primarily to globally gridded synthesis data sets.But there are a number of key variables for which data coverage is still too poor to create global synthesis products but which one would want to consider in a more comprehensive assessments.Much of the same diagnostic machinery, however, can rather straightforwardly be used on a collection of stations.For example global maps and zonal mean plots can be generated to examine spatial biases and seasonal and interannual variability where data are plotted only for the grid points where there are observations.The statistics for the summary Taylor and target diagrams are independent of whether one is using gridded or point data.Examples of additional, nongridded data sets that we plan to include in the future are deep sediment traps (e.g., Gehlen et al., 2006), surface and subsurface iron concentrations (Moore and Braucher, 2008), and phytoplankton taxonomy derived from cell counts or pigments (e.g., Gregg and Casey, 2007; research/oceanbiology/index.php).Particularly for the Southern Ocean, one can also use atmospheric O 2 and CO 2 data sets to assess ocean model behavior, with of course the caveat that the comparison will also incorporate errors in atmospheric transport used to translate surface fluxes to atmospheric fields (e.g., Naegler et al., 2007;Nevison et al., 2008a,b).
Other directions to pursue include pattern analysis and multivariate model-data skill metrics that allow for an investigation of whether the model is correctly capturing the spatial or temporal relationships among the data (Stow et al., this volume).Small temporal and spatial phase shifts between the model simulations and observations can introduce large apparent biases and seasonal to interannual variability errors.Lagged correlation analysis can help identify phase errors.Empirical orthogonal functions can be used to assess the similarity between model and observed time-space variability patterns in the presence of spatial and temporal phase error.Multivariate analyses (e.g., seasonal property-property phase diagrams; factor analysis; binary-discriminatory receiver-operator methods) can be used to identify regions or times when model dynamics diverge from that seen in the observations.The multivariate analyses of model dynamics may be particular useful when the model-data skill assessments are applied to fully coupled ocean-atmosphere climate models (Doney et al., 2006;Schneider et al., 2008).Direct comparisons to observations are more difficult in this case because persistent physical biases in coupled models propagate into the ecological/biogeochemical mean state and seasonal cycle because the coupled ocean-atmosphere models generate their own internal climate variability and thus assessment of simulated interannual to decadal variability can only be done statistically, not directly.

Fig. 1 .
Fig. 1.Schematic of the CCSM Biogeochemical-Ecosystem-Circulation (BEC) model showing the major dissolved inorganic, biological, and detrital tracers and the flows among the tracers.Many of the biological and detrital tracers include multi-element sub-elements to separately track carbon, macronutrients, and iron.

Fig. 2 .
Fig. 2. Comparison of spatial distribution of observed and simulated annual mean, surface chlorophyll (mg Chl m − 3 ) with CCSM-3 BEC model results (top panel), SeaWiFS satellite ocean color data (middle panel), and model minus data residual (bottom panel).

Fig. 3 .
Fig. 3. Comparison of seasonal cycle of observed and simulated surface chlorophyll (mg Chl m − 3 ).Zonal averages of the seasonal anomalies to the zonal mean are displayed (latitude versus month) for CCSM-3 BEC model results (top panel), SeaWiFS satellite ocean color data (middle panel), and model minus data residual (bottom panel).
Both ε bias and ε rms unbiased (χ) are normalized by dividing by the standard deviation of the observations σ O .Additionally, for the target diagram the values of the normally positive definite ε rms unbiased (χ) are treated as positive if σ P N σ O and negative for

Fig. 4 .
Fig. 4. Comparison of the spatial distribution of interannual variability of observed and simulated surface chlorophyll anomalies from the mean seasonal cycle (mg Chl m − 3 ).Top panel displays the temporal correlation coefficient of CCSM-3 BEC model results against the SeaWiFS satellite ocean color data.The middle and bottom panels display the spatial map of the root mean square of the model and observed anomalies, respectively.

Fig. 5 .
Fig. 5. Comparison of observed and CCSM-3 BEC model values for suite of surfaces variables.Zonal averages are displayed (latitude versus month) for data (dashed lines) and model (solid line) for three basins Atlantic (blue), Indian (red), and Pacific (green).

Fig. 6 .
Fig. 6.Comparison of observed and CCSM-3 BEC model values for suite of global surfaces variables using Taylor diagrams.Taylor diagrams display in polar coordinates the model-data correlation coefficient (angle from x-axis) and model standard deviation normalized to observational standard deviation (radius).Diagrams are shown for annual mean spatial distributions (top panel), seasonal anomalies (middle panel), and interannual variability of anomalies from the mean seasonal cycle (bottom panel).

Fig. 7 .
Fig. 7. Comparison of observed and CCSM-3 BEC model values for suite of global surfaces variables using Target diagrams.Target diagrams simultaneously display information on normalized model-data biases and unbiased rms differences (see text for more details).Diagrams are shown for the seasonal climatology (top panel) and hindcast including seasonal dynamics and interannual variability (bottom panel).

Fig. 8 .
Fig. 8.Comparison of observed and simulated vertical profiles of chlorophyll (mg Chl m − 3 ) for a specific time-series station.Left column displays seasonal average climatology (depth versus month) and right column interannual variability for 1988-2004 (depth versus time) for CCSM-3 BEC model results (top panels), Hawaii Ocean Time-Series data (middle panels), and model-data differences (bottom panels).A dashed line gives model and observed mixed layer depth.

Fig. 9 .
Fig. 9. Comparison of observed and CCSM-3 BEC model values for suite of upper ocean variables (0-160 m) for the Hawaii Ocean Time-Series (HOT) site using Taylor diagrams.Taylor diagrams display in polar coordinates the model-data correlation coefficient (angle from x-axis) and model standard deviation normalized to observational standard deviation (radius).Diagrams are shown for annual mean climatological spatial distributions (i.e., vertical profile) (top panel), climatological seasonal anomalies (middle panel), and interannual variability of anomalies from the mean seasonal cycle (bottom panel).

Fig. 10 .
Fig. 10.Comparison of spatial distribution of observed and simulated maximum winter mixed layer depth (m) with CCSM-3 BEC model results (top panel), Boyer-Montégut et al. (2004) field data (middle panel), and model minus data residual (bottom panel).

Table 1
Description of the evaluation data sets used in this study

Table 2
Globally aggregated model-data skill metrics for the mean annual cycle in the CCSM BEC hindcast

Table 4
Aggregated model-data skill metrics for upper ocean variables (0-160 m) at the Hawaii Ocean Time-Series (HOT) site in the CCSM BEC hindcast

Table 5
Aggregated model-data skill metrics for the interannual variability of the monthly deseasonalized anomalies at the Hawaii Ocean Time-Sereis (HOT) site in the CCSM BEC hindcast