The distributed model intercomparison project - Phase 2: Experiment design and summary results of the western basin experiments

of the DMIP 2 experiments conducted for the two Sierra Nevada basins. Simulations from ﬁve indepen- dent groups from France, Italy, Spain and the USA were analyzed. Experiments included comparison of lumped and distributed model streamﬂow simulations generated with uncalibrated and calibrated parameters, and simulations of snow water equivalent (SWE) at interior locations. As in other phases of DMIP, the participant simulations were evaluated against observed hourly streamﬂow and SWE data and compared with simulations provided by the NWS operational lumped model. A wide range of statistical measures are used to evaluate model performance on a run-period and event basis. Differences between uncalibrated and calibrated model simulations are assessed. Results indicate that in the two study basins, no single model performed best in all cases. In addi- tion, no distributed model was able to consistently outperform the lumped model benchmark. How-ever, one or more distributed models were able to outperform the lumped model benchmark in many of the analyses. Several calibrated distributed models achieved higher correlation and lower bias than the calibrated lumped benchmark in the calibration, validation, and combined periods. Evaluating a number of speciﬁc precipitation-runoff events, one calibrated distributed model was able to perform at a level equal to or better than the calibrated lumped model benchmark in terms of event-averaged peak and runoff volume error. However, three distributed models were able to provide improved peak timing compared to the lumped benchmark. Taken together, calibrated distributed models provided speciﬁc improvements over the lumped benchmark in 24% of the model-basin pairs for peak ﬂow, 12% of the model-basin pairs for event runoff volume, and 41% of the model-basin pairs for peak tim- ing. Model calibration improved the performance statistics of nearly all models (lumped and distrib-uted). Analysis of several precipitation/runoff events indicates that distributed models may more accurately model the dynamics of the rain/snow line (and resulting hydrologic conditions) compared to the lumped benchmark model. Analysis of SWE simulations shows that better results were achieved at higher elevation observation sites. Although the performance of distributed models was mixed compared to the lumped benchmark, all calibrated models performed well compared to results in the DMIP 2 Oklahoma basins in terms of run period correlation and %Bias, and event-averaged peak and runoff error. This ﬁnding is note- worthy considering that these Sierra Nevada basins have complications such as orographically-enhanced precipitation, snow accumulation and melt, rain on snow events, and highly variable topog- raphy. Looking at these ﬁndings and those from the previous DMIP experiments, it is clear that at this point in their evolution, distributed models have the potential to provide valuable information on spe- ciﬁc ﬂood events that could complement lumped model simulations.


s u m m a r y
The Office of Hydrologic Development (OHD) of the U.S. National Oceanic and Atmospheric Administration's (NOAA) National Weather Service (NWS) conducted the two phases of the Distributed Model Intercomparison Project (DMIP) as cost-effective studies to guide the transition to spatially distributed hydrologic modeling for operational forecasting at NWS River Forecast Centers (RFCs). Phase 2 of the Distributed Model Intercomparison Project (DMIP 2) was formulated primarily as a mechanism to help guide the U.S. NWS as it expands its use of spatially distributed watershed models for operational river, flash flood, and water resources forecasting. The overall purpose of DMIP 2 was to test many distributed models forced by high quality operational data with a view towards meeting NWS operational forecasting needs. At the same time, DMIP 2 was formulated as an experiment that could be leveraged by the broader scientific community as a platform for the testing, evaluation, and improvement of distributed models.
DMIP 2 contained experiments in two regions: in the DMIP 1 Oklahoma basins, and second, in two basins in the Sierra Nevada Mountains in the western USA. This paper presents the overview and results of the DMIP 2 experiments conducted for the two Sierra Nevada basins. Simulations from five independent groups from France, Italy, Spain and the USA were analyzed. Experiments included comparison of lumped and distributed model streamflow simulations generated with uncalibrated and calibrated parameters, and simulations of snow water equivalent (SWE) at interior locations. As in other phases of DMIP, the participant simulations were evaluated against observed hourly streamflow and SWE data and compared with simulations provided by the NWS operational lumped model. A wide range of statistical measures are used to evaluate model performance on a run-period and event basis. Differences between uncalibrated and calibrated model simulations are assessed.

Overview
The Office of Hydrologic Development (OHD) of the U.S. National Oceanic and Atmospheric Administration's (NOAA) National Weather Service (NWS) led two phases of the Distributed Model Intercomparison Project (DMIP) as cost-effective studies to guide the transition into spatially distributed hydrologic modeling for operational forecasting (Smith et al., 2012a;Smith et al., 2004) at NWS River Forecast Centers (RFCs). DMIP 1 focused on distributed and lumped model intercomparisons in basins of the southern Great Plains Smith et al., 2004). DMIP 2 contained tests in two geographic regions: continued experiments in the U.S. Southern Great Plains (Smith et al., 2012a,b) and tests in two mountainous basins in the Sierra Nevada Mountains, hereafter called DMIP 2 West. Since the conclusion of DMIP 1, the NWS has used a distributed model for basin outlet forecasts (e.g., Jones et al., 2009) as well as for generating gridded flash flood guidance over large geographic domains (Schmidt et al., 2007). The purpose of this paper is to present the DMIP 2 West experiments and results.
Advances in hydrologic modeling and forecasting are needed in complex regions (e.g., Hartman, 2010;Westrick et al., 2002). Experiments are needed in the western USA and other areas where the hydrology is dominated by complexities such as snow accumulation and melt, orographically-enhanced precipitation, steep and other complex terrain features, and sparse observational networks. The need for advanced models in mountainous regions is coupled with the requirements for more data in these areas. Advanced models cannot be implemented for operational forecasting without commensurate analyses of the data requirements in mountainous regimes.
A major component of the NWS river forecast operations is the national snow model run (NSM) by the NWS National Operational Hydrologic Remote Sensing Center (NOHRSC; Rutter et al., 2008;Carroll et al., 2001). For over a decade, NOHRSC has executed the NSM in real time at an hourly, 1 km scale over the contiguous US (CONUS) to produce a large number of gridded snow-related variables.

Science questions
DMIP 2 West was originally formulated to address several major science questions . They are framed for the interest of the broad scientific community with a corollary for the NOAA/NWS. These science questions and issues are highly intertwined but are listed separately here for clarity.

Distributed vs. lumped approaches in mountainous areas
Can distributed hydrologic models provide increased streamflow simulation accuracy compared to lumped models in mountainous areas? If so, under what conditions? Are improvements constrained by forcing data quality? This was one of the dominant questions in DMIP 1 and the DMIP 2 experiments in Oklahoma. Smith et al. (2012a,b) and Reed et al. (2004) showed improvements of deterministic distributed models compared to lumped models in non-snow, generally uncomplicated basins. The specific question for the NOAA/NWS mission is: under what circumstances should NOAA/NWS use distributed hydrologic models in addition to lumped models to provide hydrologic services in mountainous areas? While many distributed models have been developed for mountainous areas (e.g., Garen and Marks, 2005;Westrick et al., 2002;Wigmosta et al., 1994), there remains a gap in our understanding of how much model complexity is warranted given data constraints, heterogeneity of physical characteristics, and modeling goals (e.g., McDonnell et al., 2007). Several major snow model intercomparison efforts have been conducted in recent years such as Phases 1 and 2 of the Snow Model Intercomparison Project (SnowMIP; Rutter et al., 2009;Etchevers et al., 2004) and the Project for Intercomparison of Land Surface Process models (PILPS; Slater et al., 2001). In addition, several comparisons of temperature index and energy budget snow models have been conducted (e.g., Debele et al., 2009;Franz et al., 2008a,b;Lei et al., 2007;Walter et al., 2005;Fierz et al., 2003;Essery et al., 1999;WMO, 1986a,b). Comprehensive studies such as the Cold Land Processes Experiment (CLPX; Liston et al., 2008) have also been performed. However, to the best of our knowledge, there have been few specific tests of lumped and distributed modeling approaches in mountainous basins with a focus on improving river simulation and forecasting. One such study was conducted by Braun et al. (1994), who found that finer spatial modeling scales did not lead to performance gains.

Estimation of models inputs and model sensitivity to existing data
What are the advantages and disadvantages associated with distributed vs. lumped modeling in hydrologically complex areas using existing NWS operational precipitation and temperature forcing data? Current NWS RFC lumped streamflow models in mountainous areas rely on networks of surface precipitation and temperature gages to derive mean areal averages of point precipitation and temperature observations as input forcings for model calibration and real time forecasting. The density of these networks varies greatly, with most networks featuring sparse coverage at high elevations. Even for lumped hydrologic modeling, there are uncertainties in the precipitation and temperature observations used by the NWS RFCs in mountainous areas (Hartman, 2010;Carpenter and Georgakakos, 2001). Beyond network density issues, there are problems with observation times, missing data, distribution of multi-day precipitation accumulations, and other difficulties. It is not known if these data uncertainties preclude the application of distributed models, giving rise to the question: can the existing observational networks support operational distributed modeling? Nonetheless, some have attempted to apply distributed models using such existing data (e.g., Shamir and Georgakakos, 2006). The intent in DMIP 2 West was to set up and run the models using Quantitative Precipitation Estimates (QPE) derived from the relatively dense existing gage network. Follow-on experiments would use QPE fields derived from reduced networks to investigate the appropriate density for modeling.

Model complexity and corresponding data requirements
The NOAA/NWS corollary is: what can be improved over the current lumped model approach used in the NWS for operational river forecasting? Is there a dominant constraint that limits the performance of hydrologic simulation and forecasting in mountainous areas? If so, is the major constraint the quality and/or amount of forcing data, or is the constraint related to a knowledge gap in our understanding of the hydrologic processes in these areas? In other words, given the current level of new and emerging data sets available to drive advanced distributed models, can improvements be realized? Or, do we still not have data of sufficient quality in mountainous areas? Additionally, what data requirements can be specified for the NOAA/NWS to realize simulation and forecasting improvements in mountainous areas?
There is a considerable range in the recent literature on the subjects of model complexity and corresponding data requirements for hydrologic modeling in mountainous areas. We provide a sample here to indicate the range of issues and findings.
For hydrologic models driven solely by precipitation and temperature, there are the issues of the gage density and location required to achieve a desired simulation accuracy (e.g., Guan et al., 2010;Tsintikidis et al., 2002;Reynolds and Dennis, 1986). The gage density issue also affects merged-data precipitation estimates (e.g., satellite-radar-gage) because the gage information is very often used to adjust the other (radar or satellite) observations (e.g., Nelson et al., 2010;Boushaki et al., 2009;Guirguis and Avissar, 2008;Young et al., 2000).
Addressing the data limitations in mountainous areas noted by Garen and Marks (2005) and others (at least in terms of coverage), the number of radar, model-based, and satellite-derived products is rapidly growing. Efforts are ongoing to improve the ability of weather radars to observe precipitation in mountainous areas (e.g., Kabeche et al., 2010;Gourley et al., 2009;Westrick et al., 1999). Model-based data include the North American Regional Reanalysis -(NARR, Mesinger et al., 2006), Rapid Update Cycle (RUC; Benjamin et al., 2004), and Real Time Mesoscale Analysis (RTMA;De Pondeca et al., 2011). Much work has gone into satellite estimates of precipitation in remote regions (e.g., Kuligowski et al., 2013;Behrangi et al., 2009;Kuligowski, 2002). As these data sets emerge and become more common, users are cautioned to avoid the expectation that increased data resolution in new data sets will translate into increased data realism and accuracy (Guentchev et al., 2010;Daly, 2006).

Rain-snow partitioning
Can improvements to rain-snow partitioning be made? Partitioning between rainfall and snowfall plays a major role in determining both the timing and amount of runoff generation in high altitude basins (Guan et al., 2010;White et al., 2010White et al., , 2002Gourley et al., 2009;Lundquist et al., 2008;Kienzle, 2008;McCabe et al., 2007;Maurer and Mass, 2006;Westrick and Mass, 2001;Kim et al., 1998). The question for the NOAA/NWS is: can distributed models provide improved representation of the spatial variability of rain/snow divisions?
Traditionally, surface temperature observations have been used to determine the form of precipitation, although such data are not the most reliable indicators of surface precipitation type (Minder and Kingsmill, 2013;Minder et al., 2011;Maurer and Mass, 2006). Recently, as part of the western Hydrometeorologic Testbed (HMT-West; Zamora et al., 2011;Ralph et al., 2005;hmt.noaa.gov), instrumentation such as vertically pointing wind profilers and S-Band radars have been used to detect freezing levels by locating the bright-band height (BBH, Minder and Kingsmill, 2013;White et al., 2010White et al., , 2002.

Scale issues
What are the dominant hydrologic scales (if any) in mountainous area hydrology? Understanding the variations of snowpacks and the timing and volume of snowmelt that generate streamflow has grown in recent periods but is complicated by difficult scale issues (e.g., Simpson et al., 2004). Blöschl (1999) describes three scales related to snow: process, measurement (observational data), and modeling scale. Process scale is the variability of a snow related variable. Measurement scale covers spacing, extent, and 'support' or area of integration related to an instrument. Modeling scale describes the spatial unit to which the model equations are applied (e.g., grid cell size in a distributed model). Several studies have investigated the impacts of modeling scale (e.g., Merz et al., 2009;Leydecker et al., 2001;Cline et al., 1998). However, to the best of our knowledge, there is scant literature on modeling scales that jointly considers snow and runoff processes. One exception is the work of Dornes et al. (2008), who found a spatially distributed approach provided better late season ablation rates and runoff hydrographs than a spatially aggregated model.
For forecasting agencies like NOAA/NWS, the scale question can be restated as: is there an appropriate operational modeling scale in mountainous areas that captures the essential rain/snow/runoff processes and provides adequate information for forecasting, water resources management, and decision support? For example, can the 4 km grid scale used in the non-mountainous DMIP 1 and 2 test basins be used instead of the current elevation zones for operational forecasting? Or is this 4 km scale too coarse to capture the large terrain variations and resultant hydrometeorological impacts on modeling? Shamir and Georgakakos (2006) used a 1 km grid modeling scale in the American basin, but concluded that significant improvement in simulation quality would result by better representations of the spatial variability of precipitation and temperature, especially at the lower elevations of the snowpack. Some have commented on the difficulty, or even impossibility, of finding an optimum element size that effectively comprises measurement, process, and modeling scales (Dornes et al., 2008;Blöschl, 1999). DMIP 2 West intended to examine simulation performance vs. modeling scale to infer appropriate model spatial resolution.

Internal consistency of distributed models
Another question posed in DMIP 2 West: 'Can distributed models reproduce processes at interior locations (points upstream of basin outlet gages) in mountainous areas?' Inherent in this question is the ability of distributed models to simulate (and therefore hopefully forecast) hydrologic variables such as SWE, soil moisture, and streamflow at points other than those for which observed streamflow data exist. Successful simulation of such variables at interior points supports the idea that the models achieve the right answer (i.e., basin outlet streamflow simulations) for the right reason, i.e., because they are correctly modeling processes in the basin interior (Kirchner, 2006).
The remainder of this paper is organized as follows. Section 2 presents the methodology of the DMIP 2 West experiments, including data derivation and an overview of the modeling instructions. The results and discussion are presented in Section 3. We present our conclusions in Section 4, while recommendations for future work are offered in Section 5.

Participating institutions, models, and submissions
Five groups submitted simulations for analysis. As with other phases of DMIP, the level of participation varied. Some participants submitted all requested simulations, while others submitted only a subset. Table 1 lists the simulations submitted by each DMIP 2 West participant.
The models used in DMIP 2 West feature a range of model structures and approaches to hydrological modeling. Appendix A presents information on the models used in DMIP 2 West, as well as references that describe the models in more detail. The greatest differences amongst the models seem to be in the precipitationrunoff approaches. The snow models all used relationships based on temperature rather than the full energy budget equations. Threshold temperatures are used to partition rain and snow. It should be kept in mind that the results herein reflect the appropriateness of model structure and other factors such as user expertise, parameter estimation, and calibration. It was not the intent of DMIP 2 West to diagnose simulation improvements from specific model structures but rather to examine the performance of the models as applied by the participants.
Focusing on precipitation-runoff models, the University of Bologna (UOB) used the TOPKAPI model (Coccia et al., 2009), which is based on the idea of combining kinematic routing with a topographic representation of the basin. Three non-linear reservoirs are used to generate subsurface, overland, and channel flow. TOPK-API includes components that represent infiltration, percolation, evapotranspiration, and snowmelt. The NWS Office of Hydrologic Development (OHD) used the HL-RDHM model 2004). HL-RDHM uses the Sacramento Soil Moisture Accounting model (SAC-SMA;Burnash, 1995) applied to grid cells. Kinematic wave equations are used to route runoff over hillslopes and through the channel system. The University of California at Irvine (UCI) also used the SAC-SMA model but applied it to sub-basins. Kinematic wave routing was used for channel routing of the runoff volumes. The Technical University of Valencia (UPV) used the TETIS model (Vélez et al., 2009;Francés et al., 2007). TETIS is a 6-layer conceptual model linked to a kinematic channel routing module. The GR4J model (Perrin et al., 2003) was used by CEMAG-REF (CEM). GR4J is a parsimonious 4-parameter lumped model. For DMIP 2 West, it was applied to 5 elevation zones.

Benchmarks and performance evaluation
Two benchmarks (e.g., Seibert, 2001;Perrin et al., 2006) were used to assess model performance. Observed hourly streamflow data from the U.S. Geological Survey (USGS) were used as 'truth.' Simulations from the NWS operational precipitation/runoff model (hereafter referred to as the lumped (LMP) benchmark) were used as the second benchmark. This model was selected to address the science question regarding the improvement of distributed models compared to lumped models. In addition, the LMP model was chosen for consistency with the DMIP 1 and DMIP 2 Oklahoma experiments (Smith et al., 2012b;Reed et al., 2004).
The LMP model actually consists of several NWS components linked together. The NWS Snow-17 model (Anderson, 2006(Anderson, , 1976) is used to model snow accumulation and melt. Rain and melt water from the Snow-17 model is input to the SAC-SMA model. Runoff volumes are transformed into discharge using unit hydrographs. This model combination is typically applied over two elevations zones above and below the typical rain/snow elevation. Flow from the upper elevation zone is routed through the lower zone with a lag/k method. Unit hydrographs are used to convert runoff to discharge in upper and lower zones. Mean areal precipitation (MAP) and temperature (MAT) time series for the elevation zones were defined from the gridded values on the DMIP 2 ftp site. More information on elevation zone modeling can be found in Anderson (2002) and Smith et al. (2003). In the North Fork American River basin, an elevation of 1,524 m was used to divide the upper and lower elevation zones for the LMP model. The upper basin comprises 37% of the total basin area, while the lower basin comprises 65%. In the East Fork of the Carson River we used an elevation value of 2134 m to separate the upper and lower zones.
These elevation values agree with the configuration of the NWS California-Nevada River Forecast Center (CNRFC) operational models.
The LMP model also included the ability to compute the elevation of the rain/snow line using the MAT data and typical lapse rates. This rain/snow elevation is then used in conjunction with the area-elevation curve for the basin in Snow-17 to determine how much of the basin receives rain vs. snow (Anderson, 2006).
A major consideration in any model evaluation or intercomparison is the question: what constitutes a ''good'' simulation or an acceptable level of simulation accuracy (Seibert, 2001). This is the subject of much discussion (e.g., Bennett et al., 2013;Ritter and Muñoz-Carpena, 2013;Puspalatha et al., 2012;Ewen, 2011;Confalonieri et al., 2010;Andréassian et al., 2009;Clarke, 2008;Gupta et al., 2008;Moriasi et al., 2007;Schaefli and Gupta, 2007;Shamir and Georgakakos, 2006;Krause et al., 2005;Seibert, 2001). These references and others indicate that there is not yet an agreed-upon set of goodness-of-fit indicators for hydrologic model evaluation. Moreover, it has been difficult to specify ranges of values of the goodness-of-fit indicators that determine whether a model simulation is acceptable, good, or very good, although suggested ranges have recently emerged (Ritter and Muñoz-Carpena, 2013;Moriasi et al., 2007;Smith et al., 2003). One potential cause of these difficulties is that one must consider the quality of the input data when judging simulation results. What is a ''poor'' simulation for a basin with excellent input data may be considered good for a basin having poor quality input data (Ritter and Muñoz-Carpena, 2013;Moriasi et al., 2007;Seibert, 2001). As a result, the interpretation of goodness-of-fit indices continues to be a subjective process.
With this in mind, and consistent with DMIP 1 and 2 Smith et al., 2012b) we use a number of performance criteria computed over different time periods to evaluate the simulations compared to the benchmarks. These include measures of hydrograph shape (modified correlation coefficient r mod ; McCuen and Snyder, 1975) and volume (%Bias), water balance partitioning, cumulative runoff error, and specific indices to measure the improvement compared to the LMP benchmark. Appendix D presents the statistical measures used herein and target ranges of these measures are given in the discussion below. In addition, we relate our results to those achieved in the relatively simple nonsnow DMIP Oklahoma basins (Smith et al., 2012b;Reed et al., 2004).

Definitions
For consistency with the results of the DMIP 1 and 2 experiments in Oklahoma, we adopt the definition of Reed et al. (2004) that a distributed model is one that (1) explicitly accounts for spatial variability of meteorological forcings and basin physical characteristics and (2) has the ability to produce simulations at interior points without explicit calibration at those points. Interested readers are referred to Kampf and Burges (2007) for a detailed discussion of definitions and classifications regarding distributed hydrologic models.
A parent basin is defined as a watershed for which explicit calibration can be performed using basin outlet observed streamflow data. In our experiments, these parent basins represent the typical watershed sizes for which forecasts are generated by the NWS RFCs. Interior points are locations within the parent basins where simulations are generated without explicit calibration (hereafter also referred to as 'blind' simulations).
Statistics are computed over two types of time intervals. The term ''overall'' refers to multi-year run periods such as the calibration, validation, and combined calibration/validation periods. Event statistics are computed for specific precipitation/runoff events.

Description
Two sub-basins in the American and Carson River watersheds located near the border of California (CA) and Nevada (NV), USA, were selected as test basins (Fig. 1). Although these basins are geographically close, their hydrologic regimes are quite different due to their mean elevation and location on either side of the Sierra Nevada divide (Simpson et al., 2004). The Carson River basin is a highaltitude basin with a snow dominated regime, while the American River drains an area that is lower in elevation with precipitation falling as rain and mixed snow and rain (Jeton et al., 1996). These two basins were selected to represent the general hydrologic regimes of western mountainous areas with the hope that our modeling results would be relevant to other mountainous areas. Table 2 presents a summary of the characteristics of the American and Carson River basins.
In the American River basin, we selected the North Fork sub-basin above the USGS gage 11427000 shown in Fig. 2. This gage is at the North Fork dam forming Lake Clementine. Hereafter, this basin is referred to as NFDC1, using the NWS CNRFC basin acronym. This basin is 886 km 2 in area and rests on the western, windward side of the Sierra Nevada crest. Precipitation is dominated by orographic effects, with mean annual precipitation varying from 813 mm at Auburn (elev. 393 m above msl) to 1651 mm at Blue Canyon (elev. 1676 m above msl) (Jeton et al., 1996). Precipitation occurs as a mixture of rain events and rain-snow events. The basin mean annual precipitation is 1532 mm and the annual runoff is 851 mm (Lettenmaier and Gan, 1990). Streamflow is about twothirds wintertime rainfall and snowmelt runoff and less than one-third springtime snowmelt runoff . The basin is highly forested and varies from pine-oak woodlands, to shrub rangeland, to ponderosa pine, and finally to sub-alpine forest as one moves up in elevation. Much of the forested area is secondary-growth due to the extensive timber harvesting conducted to support the mining industry in the late 1800s (Jeton et al., 1996). Soils in the basin are predominately clay loams and coarse sandy loams. The geology of the basin includes metasedimentary rocks and granodiorite (Jeton et al., 1996).
In the Carson River basin, the East Fork sub-basin shown in Fig. 2 was selected for DMIP 2 West. Hereafter, the CNRFC identifier GRDN2 is used for the basin above the gage at Gardnerville, NV, and CMEC1 is used for the interior basin above the stream gage at Markleeville, CA. The Carson River terminates in the Carson Sink. The East Fork Carson River generally flows from south to north so that its average slope is not as steep as it could be if it were to face directly east-west. GRDN2 is a high altitude basin, with a drainage area of 714 km 2 above USGS stream gage 10-308200 near Markleeville, CA and 922 km 2 above USGS stream gage 10-309000 at Gardnerville, NV. Elevations in the GRDN2 basin range from 1650 m near Markleeville to about 3400 m at the basin divide. Mean annual precipitation varies from 559 mm at Woodfords (elev. 1722 m) to 1,244 mm near Twin Lakes (elev. 2438 m). Fig. 3 shows the rugged, heavily forested terrain in both basins.

Calibration
Participants were free to calibrate their models using strategies and statistical measures of their choice as this process is usually model-dependent. This provision was also an aspect of the Oklahoma experiments in DMIP 1 and 2 (Smith et al., 2012a,b;Reed et al., 2004) and is similar to the Model Parameter Estimation Experiment (MOPEX; Duan et al., 2006) and SnowMIP-2 (Rutter et al., 2009). Appendix B presents a brief description of the strategies followed by the participants to calibrate their models. Three models (OHD, UOB, and UPV) used spatially variable a priori parameters and adjusted the parameter grids uniformly using scalar factors. CEM and UCI maintained the spatially constant parameters in each computational area.

Run periods
Specific periods were prescribed for model calibration and validation. An initial one-year 'warm up' 'or 'spin-up' period was provided to allow models to equilibrate after a complete annual wetting/drying cycle. Table 3 presents the computational periods. The warm-up and calibration periods for the East Fork of the Carson and North Fork of the American were slightly different due to the availability of the observed hourly USGS streamflow data.

Simulations and modeling instructions
Participants followed specific modeling instructions to generate simulations of streamflow and SWE in order to address the science questions (See http://www.nws.noaa.gov/oh/hrl/dmip/2/docs/ sn_modeling_instructions.pdf and Table 1). Modeling Instruction 1 was for NFDC1. Participants generated hourly simulated streamflow at the basin outlet gage using calibrated and uncalibrated model parameters. There were no interior ''blind'' streamflow gages in NFDC1. During the same run to generate the outlet streamflow hydrographs, participants also generated simulations of snow water equivalent at two locations where snow pillows are operated by the U.S. Bureau of Reclamation (USBR): Blue Canyon and Huysink. Modeling Instruction 2 focused on the GRDN2 basin. Participants generated uncalibrated and calibrated streamflow simulations at the GRDN2 outlet. During the same run to generate GRDN2 outlet simulations, participants generated streamflow simulations at the interior USGS gage at Markleeville, CA. The gage at Markleeville was considered a blind simulation point for this test with no explicit calibration. Hereafter, we refer to this test as CMEC1-2. In addition, the basin above the USGS gage at Markleeville, CA was also treated as an independent basin (Modeling Instruction 3). This test is referred to as CMEC1-3. As part of Modeling Instruction 3, participants also generated SWE simulations for four U.S. Natural Resources Conservation Service (NRCS) snowpack telemetry (SNOTEL) sites (Serreze et al., 1999). Explicit instructions for modeling scale were not provided. Rather, it was hoped that the participants' models would inherently represent a sufficiently broad range of modeling scales from which to make inferences on appropriate model scale.

Data assimilation
Assimilation of observed streamflow or other data to adjust model states was not allowed in DMIP 2 West. The simulations were generated by running models continuously over the warmup, calibration, and validation periods.

Data
Participants were required to use the hourly precipitation and temperature grids posted on the DMIP 2 West web site. This gridded data was derived from the same types of in situ measurements used by RFCs to construct MAP and MAT time series for model calibration and operational forecasting. As in the other phases of DMIP, basic forms of many other physical basin features and meteorological variables were provided to promote participation.
2.9.1. Precipitation 2.9.1.1. Precipitation data sources. DMIP 2 West used precipitation data collected by the NWS Cooperative Observer Network (COOP; NRC, 1998) available from the National Climatic Data Center (NCDC). Also used were daily precipitation observations from the SNOTEL network. Problems with the COOP and SNOTEL data are well known and include difficulties in distributing multi-day accumulations (Eischeid et al., 2000), data-entry, receiving, and reformatting errors (Reek et al., 1992), observer errors (e.g., NRC, 1998), and dealing with varying daily station observation times (e.g., Hay et al., 1998).
Hourly and daily stations from the COOP and SNOTEL networks were selected inside and near the basins. After initial screening for period-of-record (at least 5-10 years) and windward-leeward effects, 41 stations for NFDC1 and 69 for GRDN2 were selected for further analysis. The location of the precipitation and temperature stations is shown in Fig. 2. 2.9.1.2. Generation of gridded QPE. Considerable effort was expended to generate a multi-year, hourly, 4 km gridded QPE data set (Smith et al., 2010). One goal in the development of the QPE data was to generate spatially varying gridded precipitation data based on the same types of in situ point measurements currently used by NWS RFCs for lumped model calibration and real time operational forecasting. In this way we could address the science question: can distributed models be operationally implemented with currently available precipitation and temperature data?
An initial QPE data set for 1987-2002 was used to launch the DMIP 2 West experiments , but was found to contain a large inconsistency when extended from 2002 to 2006 (Mizukami and Smith, 2012;Smith et al., 2009). Developing an alternative procedure and QPE data set delayed DMIP 2 West experiments by nearly two years. Finally, an approach was developed using modified NWS procedures as shown in Fig. 4. The method consists of three major steps: (1) data quality control and generation of hourly point precipitation time series, (2) spatial interpola- tion of the point time series to a 4 km grid, and (3) water balance analyses. Details of these steps can be found on the DMIP 2 West web site: http://www.nws.noaa.gov/oh/hrl/dmip/2/wb_precip.html.
Step 1: Data quality control and generation of point hourly data time series.
The goal of this step was to generate quality-controlled, serially complete, hourly precipitation time series at all the gage locations, including daily stations. The term serially complete means that precipitation was estimated for each station over the entire analysis period of 1987Eischeid et al., 2000). The hourly and daily data were quality controlled (QC) using standard NWS procedures described in Smith et al. (2003) and Anderson (2002). Double mass analysis was used to identify and correct for human influences. Missing precipitation data were estimated using weighted observations from the closest station in each of four quadrants. To account for orographic influences, long-term ratios of monthly station means were used to condition estimates of missing data by the ratio of long term monthly mean of the estimator station to the current station.
Multi-day gage accumulations were distributed over the preceding flagged days using daily amounts from surrounding stations. Daily precipitation accumulations were subsequently distributed to hourly values using the distributions of hourly stations surrounding each daily station. A great deal of effort was expended to correct problems associated with distributing daily observations (Smith et al., 2010). In particular, two types of problems were addressed: (1) distributing multi-day accumulations to each individual day (e.g., Eischeid et al., 2000) and (2) distributing daily totals into hourly values. The precipitation records from 1988 to 2006 were examined month by month to identify and correct such errors.
A major task is the estimation of precipitation at ungaged gridded points (Tsintikidis et al., 2002). The multi-sensor precipitation estimation (MPE: Seo, 1998) algorithm was used to spatially distribute the point hourly time series onto the DMIP 2 West $ 4 km Hydrologic Rainfall Analysis Project (HRAP; Green and Hudlow, 1982;Reed and Maidment, 1999) grid. Data selected from 27 stations in and around NFDC1 and 62 stations for GRDN2 were used to derive the final gridded QPE data sets.
MPE uses PRISM data to adjust the interpolation of point precipitation to grids. The 800 m resolution PRISM monthly climatological precipitation data derived for 1971-2000 (Daly et al., 2008) was selected to be consistent with the river forecasting operations at CNRFC (Hartman, 2010). The DMIP 2 West QPE data did not include any corrections for gage under-catch (Yang et al., 1998a,b). Treatment of gage under-catch is usually model-dependent so participants were free to make adjustments as they chose.
Step 3. Water balance analysis.
A check of the DMIP 2 West precipitation is shown in Fig. 5. This figure presents a Budyko-type plot (Budyko, 1974) of water balance components for a number of basins across the US, representing a broad range of basin climatologies. On the abscissa is plotted the ratio of observed long-term mean precipitation (P obs ) to potential evapotranspiration (PET) while on the ordinate we plot the ratio of observed streamflow (Q) to PET. PET was computed from the NOAA Evaporation Atlas (Farnsworth et al., 1982). The basins in the domain of the NWS Arkansas-Red Basin RFC (ABRFC) were taken from the work of Koren et al. (2006) and range in size from 20 km 2 to 15,000 km 2 . This figure shows that the climatological variables of PET and precipitation for the DMIP 2 West basins agree with the trend established by the other basins, indicating that the mean annual precipitation values are reasonable.
2.9.2. Temperature 2.9.2.1. Temperature data sources. Hourly 4 km gridded temperature values were derived using daily maximum and minimum (hereafter tmax and tmin) temperature data available from the NWS COOP and SNOTEL networks shown in Fig. 2. The underlying interpolation procedure uses an inverse distance weighting algorithm. It also uses PRISM gridded monthly climatological analyses of daily maximum and minimum temperature (Daly et al., 1994). The procedure has the following major steps: 2.9.2.2. Generation of gridded hourly temperature.
Observation times estimated for COOP stations with missing temperature observation times were assumed to be the same as for the corresponding daily precipitation observations. The procedure used to estimate missing observation times for these stations is documented in Schaake et al. (2006).
A daily temperature processor was used to generate daily tmax and tmin grids for each day of the analysis period. Complex terrain in the DMIP 2 West study area generates spatial variations in temperature that are comparable to diurnal temperature variations. As a result, simple spatial interpolation of gage observations to grid locations did not by itself account for the complexity of the actual spatial variations. Therefore, monthly PRISM climatological grids of daily mean tmax and tmin were used as part of the interpolation process.
The spatial interpolation procedure for daily maximum and minimum temperature analysis is as follows. COOP and SNOTEL sites to be used for the given day were selected having at least 5 years of observations with less than 15% missing data. At each site, the algorithm computes the difference between the observed gage value for the given day and the monthly PRISM climatological mean values of tmax and tmin. These differences were interpolated to the HRAP grid used by DMIP 2 West using an inverse distance weighting interpolation procedure with the distance exponent equal to 1.0. Difference values for each of the nearest 2 gages in each of 4 quadrants surrounding each HRAP grid point are used. For each HRAP grid point, the PRISM mean value was added to the analyzed difference value to get the grid point value of the analyzed maximum or minimum daily air temperature.
The hourly temperature processor uses the daily tmax and tmin grids to generate hourly temperature grids for each hour of each day. This procedure uses the Parton and Logan (1981) algorithm to estimate hourly temperatures from daily max-min values. As a check of the procedure, gridded climatologies of daily tmax and tmin for January for the period 1961-1990, over an area including the American and Carson River basins were generated (not shown). The gage analysis and PRISM climatologies were nearly identical.

Form of precipitation
Participants were free to determine the form of precipitation (rain or snow) for each time step in their modeling. The procedures followed by the participants are shown in Appendix A.

Potential evaporation
Participants were allowed to determine the values of potential evaporation data for their models. As in the DMIP 2 experiments in Oklahoma, estimates of climatological monthly PE for both basins were provided. Koren et al. (1998) used information from seasonal and annual Free Water Surface (FWS) evaporation maps in NOAA Technical Report 33 (Farnsworth et al., 1982) and mean monthly station data from NOAA Technical Report 34 (Farnsworth et al., 1982) to derive parameters for an equation that predicts the seasonal variability of mean daily free water surface (FWS) evaporation. These parameters were used to derive the mean monthly FWS evaporation estimates for DMIP 2 West basins. The data for NFDC1 are in the same range as the estimates derived by Carpenter and Georgakakos (2001) for the entire American River basin.
A link to the NARR data (Mesinger et al., 2006) was also provided, along with guidelines and processing codes for participants to compute other estimates of PE. The NARR project provides relative humidity, wind speed, air temperature, and radiative flux data.

Analysis of simulated hydrographs
As a final check of the DMIP 2 West forcings, the precipitation, temperature, and climatological PE forcing data were tested in hourly lumped and distributed simulations over the project period. Snow correction factors used in Snow-17 for these simulations to compensate for gage undercatch were derived during the calibration of the benchmark LMP model. All suspect simulated hydrograph 'spikes' that did not seem to be consistent with the observed data were investigated (Smith et al., 2010). Cumulative streamflow simulation error plots were also examined.

Digital elevation data
Participants were not required to use any particular digital elevation model (DEM). 15 arc-s and 1 arc-s DEM data were provided via the DMIP 2 West website to encourage participation. The 15 arc-s national DEM was derived by resampling 3 arc-s DEMs (1:250,000 scale) distributed by the U.S. Geological Survey. 1 arcs DEM data were available from the USGS National Elevation Dataset (NED).

Flow direction data
Flow direction grid files at a 30 m resolution were provided for the convenience of any participants who wished to use them. These 30 m grids were used to define the basin boundaries. The basis for these flow direction grids was the 30 m DEM data from the USGS NED data server. The DEM data were projected and filled, and commercial software was used to calculate flow directions using the D8 algorithms of Jenson and Domingue (1988). Flow directions were also provided at a 400 m resolution. Moreover, we also defined flow directions for coarse resolution model cells that matched or aligned with the grid of available radar-based forcing data (the HRAP grid) using the algorithm described by Reed (2003).

Vegetation and land use data
DMIP 2 West provided a 1 km gridded vegetation/land use dataset covering both basins. These data were originally developed by Hansen et al. (2000). Thirteen classes of vegetation were defined in these data.
2.9.9. Hourly observed streamflow data Instantaneous hourly flow data were acquired from the USGS. The data included some corrections and shifts but were defined as provisional. Unlike the mean daily flow data available from the USGS National Water Information Service (NWIS) web site, the instantaneous observations had not undergone rigorous quality control. However, OHD performed some rudimentary quality control steps. These involved: (1) downloading the approved mean daily flow data from the NWIS web site for the same time periods, (2) computing mean daily flow from the hourly data, (3) visually comparing the derived and approved daily flow time series, hourly streamflow data, mean areal precipitation data, and basic simulations for each basin, and (4) setting any suspicious data in the hourly time series to missing values.

Soils information
State Soil Geographic (STATSGO) texture data covering the two basins were taken from data sets originally derived by Miller and White (1998). These were made available as a grid for each of 11 soil layers. In addition, a link was provided to finer resolution county-level soil information called the Soil Survey Geographic (SSURGO; Zhang et al., 2012) data set. The SSURGO data are typically available at a scale of at least 1:24,000. They are approximately ten times the resolution of STATSGO data in which the soil polygons can be on the scale of 100 -200 km 2 . SSURGO data have been recently used to derive a priori estimates of model parameters (e.g., Zhang et al., 2012Zhang et al., , 2011. Participants were free to use information from either soil data set to derive any necessary model-specific parameters. 2.9.11. Areal extent of snow cover NWS RFCs use snow covered area (SCA) data in the calibration of the operational hydrologic models. For DMIP 2 West, SCA data from the NWS NOHRSC were extracted for the basins. The data consisted of gridded 'snap shots' of SCA on available days, with values in each cell indicating clouds, snow, or no snow.

Snow water equivalent
Observed SWE data were also made available to DMIP 2 West participants. SWE data for two USBR sites in NFDC1 were downloaded from the California Data Exchange (CDEC). Data from four SNOTEL sites in the GRDN2 basin were also provided. The sites are listed in Appendix C. These data spanned the calibration period. Participants were allowed to use these data in the calibration of their models.

Cross sections
Due to the remote nature of the North Fork basin, we were only able to provide cross section data for one location in the North Fork basin. These data were derived from as-built bridge plans for the Iowa Hill Bridge near Colfax, CA provided by personnel from California State Parks.

Results and discussion
Following the format of the DMIP 2 Oklahoma results (Smith et al., 2012b), we present the results of the experiments from general to specific in order to address the science questions in a coherent manner. A number of statistical measures were used to assess the simulations. It would be impossible to present and discuss all of the analyses that were performed, but the results of the most important and relevant ones are presented below.

Overall water balance
A critical aspect of hydrologic modeling is the partitioning of precipitation into runoff and evapotranspiration/losses. This is especially important in mountainous areas given the large uncertainty in precipitation, temperature, and other meteorological observations. Following other experiments (e.g., Lundquist and Loheide, 2011;Mitchell et al., 2004;Lohmann et al., 2004Lohmann et al., , 1998Wood et al., 1998;Duan et al., 1996;Timbal and Henderson-Sellers, 1998;Shao and Henderson-Sellers, 1996), we investigated the ability of the participants' models to partition precipitation into runoff, evaporation, and losses.
The water balance quantities for each model were computed using the general continuity equation: where S is storage, P obs is observed mean annual basin-average precipitation in mm, E is evaporation in mm, L represents the intercatchment groundwater transfer (losses or gains) and R model is the depth of model runoff in mm over the basin. We computed these quantities on an annual basis over a multi-year period and assumed that the change in storage over that period is equal to zero. Observed mean annual precipitation over the basin and computed runoff from each of the models was used to compute a budgetbased estimate of evaporation E and losses L: Fig. 6 can be interpreted as follows. Each diagonal represents the partitioning of observed precipitation into computed runoff and evaporation (plus losses) for a basin, with the x and y intercepts equal to the value of the mean annual areal observed precipitation. On each diagonal, a model's plotting symbol can be projected to the x and y axes to yield that model's basin-averaged mean annual runoff and evaporation plus losses. All models should plot on a single line with a À1 slope and x and y intercepts equal to the observed mean areal precipitation if they have the correct water budget. All models should plot at the same point if they have the same partitioning of water. From Fig. 6, with the exception of one outlier (CEM for NFDC1), it can be seen that the models partition precipitation reasonably well. There is slightly more spread in the results for NFDC1 (156 mm) than in the GRDN2 (80 mm) and CMEC1 (105 mm) basins, which may be due to the difficulty in modeling the rain-snow dominated NFDC1.

Long-term cumulative simulation error
Overall model performance in terms of long-term runoff simulation error as performed by Reed et al. (2004) and Smith et al. (2012b) was also analyzed. This analysis was meant to examine the consistency of the precipitation estimates and subsequent impacts on multi-year hydrologic model simulations. Consistent precipitation data have been proven to be necessary for effective model calibration (Smith et al., 2012b and references therein). For example, in some cases in the DMIP 2 Oklahoma experiments, the improvements gained by distributed models compared to lumped models were negated when the models were calibrated using inconsistent precipitation data from DMIP 1 (Smith et al., 2012b). Fig. 7 presents the cumulative runoff error plots for the NFDC1, GRDN2, and CEMC1-3 basins. In general, the plots are linear, indicating that the precipitation, temperature, and evapotranspiration forcings are temporally consistent. The under-prediction from CEM shown in Fig. 7 follows from the partitioning shown in Fig. 6: the CEM model generated less runoff volume over time. The cumulative error plots in Fig. 7 are considerably improved compared to the OHD cumulative error plot using the original (flawed) DMIP 2 gridded QPE (Mizukami and Smith, 2012;Smith et al., 2009;Moreda et al., 2006). Moreover, the results in Fig. 7 span approximately the same error range for two basins in the DMIP 2 Oklahoma experiments (see Fig. 10 in Smith et al., 2012b).

Comparison of distributed and lumped model results
In this section we begin to address the science question: can distributed hydrologic models provide increased streamflow simulation accuracy compared to lumped models in mountainous areas? Fig. 8 presents the overall performance of the calibrated models in terms of r mod (McCuen and Snyder, 1975), computed hourly for the calibration, validation, and combined periods. The term 'overall' means that the statistic was computed for each hour over the entire period specified. The r mod measure was used herein to provide consistency with the DMIP 2 Oklahoma results (Smith et al., 2012b) and DMIP 1 results . The r mod statistic is a goodness-of-fit measure of hydrograph shape. In Fig. 8, the results are organized in order of increasing computational element size (i.e., ranging from 250 m for UOB to two elevation zones for LMP). Recall that there are two sets of results for the basin CMEC1: one is for an independent calibration of the basin (CMEC1-3) and the other is for the use of the Markleeville gage as a 'blind' interior simulation point (CMEC1-2).
Looking collectively at the results in the top panel of Fig. 8, no single model performed best for all the basins in the calibration period. The models had the most uniform r mod value for the snow-dominated basin GRDN2, and the most spread in r mod for NFDC1. All models except LMP showed a decrease in r mod for the blind simulation test CMEC1-2. Not surprisingly, all models showed an improvement in r mod when the CMEC1 basin was explicitly calibrated (compare CMEC1-2 with CEMC1-3). Compared to LMP, the UOB model provided the only improved r mod values for this calibration period for NFDC1, while the OHD, UOB, and CEM models provided equal-to or improved values for GRDN2. No model was able to provide an improvement over the lumped model for the blind test at CMEC1-2. Only the OHD model provided improved r mod values for the CMEC1-3 test.
The r mod values for the validation period are shown in the middle panel of Fig. 8. Models typically perform slightly worse in validation periods compared to the calibration period. In our case, mixed results were realized. Some models improved in this period (e.g., CEM in the CMEC1-2 and CMEC1-3 cases), while others showed the expected decline in r mod . Only the OHD model provided improved r mod values compared to LMP for the validation period.
The bottom panel of Fig. 8 shows the model performance over the combined calibration and validation periods. The LMP and OHD models had the highest r mod for this period in all the basins. The CEM model achieved the next-highest values for three out of four basin tests. Only the OHD model provided improved r mod values compared to LMP (cases of GRDN2 and CMEC1-3) in this period. Very similar results to Fig. 8 (not shown) were achieved when using the Nash-Sutcliffe statistic instead of r mod .
The %Bias for all the calibrated models is shown in Fig. 9. The %Bias statistic is used to compare the overall simulated and observed runoff volumes. Target values of the %Bias measure vary in the literature from ±5% for NWS model calibration (Smith et al., 2003) to ±25% (Moriasi et al., 2007).
A wide range of values can be seen looking collectively at the plots. For the calibration period, some models were able to achieve a near zero bias for several of the basins (e.g., CEM in GRDN2, LMP in CMEC1-3, OHD in NFDC1). Other models had a consistent positive or negative bias for all the basins and all periods (UOB, CEM). As might be expected, the %Bias values were larger for the CMEC1-2 test compared to the explicitly calibrated CMEC1-3 test. Recall that CMEC1-2 simulations were generated as interior points within GRDN2 with no specific calibration (Modeling Instruction 2). In terms of %Bias, no model performed best in all cases, nor did any distributed model consistently outperform the LMP benchmark. While no one model performed consistently better than the others in Fig. 8 and Fig. 9, all models achieved relatively high values of the two statistics. For example, the values of r mod (0.63-0.95) and %Bias (À20.1 to 5.2) for the combined calibrationvalidation period in Figs

Analysis of precipitation/runoff events
To further investigate the science question of the performance of distributed and lumped models, statistics were computed for 68 events in the NFDC1 and 92 events in the GRDN2 and CMEC1 basins. These events were selected from the combined calibration and validation period. Event statistics were computed because our experience has shown that the overall run-period statistics can mask the improvement of distributed models over lumped models for individual events (Smith et al., 2012b;Reed et al., 2004). We use the same two measures as in DMIP 1 and DMIP 2 Smith et al., 2012b respectively) to evaluate the model performance for events: the event absolute % runoff error and the event absolute % peak error. These measures evaluate the models' ability to simulate runoff volumes and peak flow rates. Eqs. (7a) and (7b) of Appendix D present the formula for % runoff error and % peak error, respectively. Low values of % runoff error and % peak error are desired.
The event statistics for the NFDC1 basin are shown in Fig. 10. Each plotting symbol represents the average measure of a specific model for the 68 events. As much as possible, the same plotting symbols in Smith et al. (2012b) and Reed et al. (2004) are used. The LMP and OHD models have near-identical performance with the lowest values of the two statistics. The OHD model provides a slight improvement in % peak error over the LMP model but at the expense of a slightly worse runoff volume. Next in order of performance is the UOB model, followed by the UPV and CEM models.
The event statistics for GRDN2 are plotted in Fig. 11. All models have errors whose magnitudes are similar to Fig. 10. One difference is that there is more spread between the OHD and LMP results here compared to the NFDC1 basin. In this case the OHD model provides improvement compared to the LMP model for both runoff volume error and peak error. Fig. 12 shows the calibrated event results for the CMEC1-2 and CMEC1-3 tests. The former calls for the CMEC1 basin to be simulated as a blind interior point (Modeling Instruction 2) within the GRDN2 basin. The latter test calls for explicit calibration of the CMEC1 basin as an independent headwater basin (Modeling Instruction 3). The plotting symbols are shown in different sizes to distinguish the results for each participant. The arrows show how explicit calibration of the CMEC1 basin impacts the statistics.   Oct-9 Oct-9 Oct-9 Oct-9 Oct-9 Oct-9 Oct-9 Oct-9 Oct-9 Oct-9 Oct-0 Oct-0  Note that UCI only submitted simulations for the CMEC1-3 test, and the UOB used the same simulations for the CMEC1-2 and CMEC1-3 tests. The CMEC1-2 results (small plotting symbols) span approximately the same range of error values as in NFDC1 (Fig. 10) and GRDN2 (Fig. 11). Thus, in this case, the distributed models calibrated at the basin outlet achieved about the same event simulation performance at this interior location. Explicit calibration of the CMEC1 basin (i.e., CMEC1-3) improved the event statistics for the LMP, UPV, and OHD models as one would expect. The OHD model generated a 'blind' simulation that was slightly better than the LMP model in terms of % peak error (23% vs. 26%). These results may be influenced by the large size of the interior basin (714 km 2 ) compared to the parent basin (922 km 2 ). Summarizing the results of Figs. 10-12, only one distributed model (OHD) was able to perform at a level near or better than the LMP benchmark for absolute % runoff error and absolute % peak error. However, all models achieved levels of performance equivalent to those in one of the non-snow Oklahoma basins in DMIP 2.

Improvement of distributed models over lumped models
This analysis specifically addresses the question whether calibrated distributed models can provide improved event simulations compared to lumped models in mountainous regions. Using the same events from Section 3.4, three specific measures of improve-ment were computed: % improvement in peak flow, % improvement in runoff event volume, and improvement in peak time error in hours (Eqs. (8a)-(8c) in Appendix D). The calibrated simulations from the LMP model were used as the benchmark. Fig. 13 presents the three measures of improvement for the calibrated models, along with the inter-model average of the values. It is desirable to achieve values greater than zero in each of the plots. Each data point is the average value of the measure for a specific model in a specific basin over many events. Looking collectively at the plots, it can be seen that the OHD model provides improvement in peak flow and volume, as seen previously in Figs. 10-12. Interestingly three models show improvement in peak timing: OHD, CEM and UPV. Taken as a group, distributed models provided improved peak flow simulations in 24% of the 17 model-basin pairs, improved runoff volume in 12% of the 17 model-basin pairs, and peak timing improvements in 41% of the 17 model-basin pairs. These values correspond to 24%, 18%, and 28%, respectively, achieved in the DMIP 2 Oklahoma tests (Smith et al., 2012b). However, caution is advised as the DMIP 2 Oklahoma results were based on a much larger number of model-basin pairs (148 vs. 17).
Visual inspection of the hydrograph simulations revealed that the UOB model also provided improved simulations for a few certain events, but these improvements were overwhelmed by other events in the average improvements statistics and not visible in Fig. 13.
While the results in this section may be discouraging, they are entirely consistent with the results from the Oklahoma experiments in DMIP 1 and 2. In DMIP 1, there were more cases when a lumped model out-performed a distributed model than vice versa . The results of the DMIP 2 Oklahoma experiments showed a greater number of cases of distributed model improvement than in DMIP 1 (Smith et al., 2012b).
The convention of Reed et al. (2004) is used herein to identify the cases in which the 'improvement' in Fig. 13 was negative but near zero. This shows that distributed models can perform nearly as well as a calibrated lumped model. The UCI, UOB, and UPV models show 'improvement' values less than À5% for peak volume and flow, and less than 1 h for peak timing.

Specific examples of distributed model improvement
To complement the event statistics presented in Section 3.5, we provide an analysis of two events in December, 1995 andJanuary, 1996 to diagnose the source of distributed model improvement. While other events and models could have been selected, this case is used to intercompare distributed (OHD) and lumped (LMP) models that share the same precipitation/runoff physics. Fig. 14 shows the results for the December, 1995 event in NFDC1. At least some of the improvement from the OHD model (and other distributed models in other cases) may result from improved definition of the rain/snow line, and the subsequent impacts on runoff generation. As an illustration, Fig. 15 shows the time evolution of the hourly rain/snow line as computed by the LMP model for the month of December, 1995. The diurnal variation of the rain/snow line is quite evident. The grey zone marks the four-day period from December 11th to December 14th, during which the LMP rain/snow line drops dramatically in elevation. Fig. 16 shows how the LMP model and the OHD model simulate the rain/snow line using the gridded temperature data at 15Z on December 12, 1995. The white line denotes the rain/snow line at 1758 m computed by the LMP model. Areas higher in elevation (to the right in the figure) than this receive snow, while areas below this line receive rain. The OHD model receives rain over a larger area than LMP as denoted by the red grid cells. Fig. 17 shows the precipitation and runoff for this event from the LMP and OHD models. The top panel shows that the OHD model generates more runoff than the LMP model. The middle two panels of Fig. 17 show how the LMP and OHD models partition total precipitation into different amounts of rain and snow. The bottom panel shows the arithmetic difference between the two rainfall time series. The OHD model partitions total precipitation into a greater percentage of rain than the LMP model, and more runoff is generated by the OHD model compared to the LMP model. However, our analysis did not determine the exact cause of the increased runoff. The increase could be simply due to the greater amount of total rain input into the OHD model. Another cause could be that distributed models (OHD) preserve the precipitation intensity in each grid as shown in Fig. 18 rather than averaging it over the LMP elevation zones. As a result, fast-responding surface runoff is generated in several cells and is routed down the channel network. In the lumped model, the precipitation is spatially averaged over the elevation zones, resulting in a delayed runoff response. Another cause may be the difference in channel routing schemes between the OHD model (kinematic wave) and LMP (unit hydrographs and lag/k routing).
The contributions of the spatial variability of runoff generation and routing were also investigated. Hourly OHD surface and subsurface routing volumes shown (as accumulations) in Fig. 18 were averaged over the entire basin, then used as spatially uniform input to each 4 km grid into the OHD routing network. In this case, the resulting hydrograph (not shown) was very similar to the OHD hydrograph in Fig. 14 with the result that routing constitutes only 9% of the difference in root mean square (rms) error statistics between the OHD and LMP hydrographs. This small contribution seems reasonable given the central location of the runoff volumes. However, this is only an approximate comparison as the OHD and LMP runoff volumes prior to routing were not equal (Fig. 17, top panel).
In another case of distributed model simulation improvement (January 25, 1996; not shown), liquid precipitation and surface runoff were concentrated near the basin outlet. The routing contri- bution to the difference in rms error between the OHD and LMP simulations in this case was much larger at $36%. In this case, capturing the spatial variability of the liquid precipitation and subsequent runoff may be more important than differences in precipitation partitioning. From our limited analysis, it is not clear which factor, or combination of factors, was the source of improvement of the OHD model. Beyond simply identifying the differences in precipitation partitioning and the spatial variability of precipitation and runoff, it was difficult to isolate the source of distributed model improvement. Further complications arise due to different initial conditions in the OHD and LMP model for the events studied. In one case, the spatial distribution of precipitation appeared to play a major role, while in another case, modeling the rain/snow areas seemed to have a large impact. Dornes et al. (2008) were able to attribute the benefits of their distributed model in two events to capturing the impact of topography on shortwave radiation for the modeling of snow accumulation and melt. However, their study did not address complexities arising from mixed rain and snow.

Effect of model parameter calibration
As in the DMIP 1 and 2 Oklahoma experiments, participants were instructed to submit simulations using calibrated and uncalibrated model parameters. This analysis was designed to assess the efficacy of a priori model parameters as well as schemes to calibrate hydrologic models in mountainous areas. The results shown are for the combined calibration and validation periods. The r mod and %Bias measures are presented to provide an overall view of the impacts of parameter calibration.
The calibration results for NFDC1, GRDN2, and CEMC1-3 are presented in Fig. 19. The r mod and %Bias measures for uncalibrated and calibrated results are shown connected by an arrow indicating the directional change in values (e.g., Viney et al., 2009). For NFDC1, calibration improved the r mod and %Bias measures for several models: UPV, LMP, UOB, and OHD. The results for CEM do not show any gain in either measure, indicating that the calibration process focused on minimizing other error criteria. Indeed, the CEM model was calibrated using root mean square error (RMSE) calculated on root square transformed flows, which may explain why the change in r mod is not so satisfactory.
For GRDN2, three of the four models achieved improved r mod values, but at the expense of increasing (positive or negative) the overall %Bias (UOB, LMP, and OHD). The most consistent improvements from calibration were seen in the CMEC1-3 test in the bottom panel of Fig. 19. All models realized gains in the r mod and %Bias statistics with the exception of UPV. The uncalibrated r mod results for the OHD and CEM models were better than the calibrated results for the three remaining distributed models.
Although not shown here, parameter calibration resulted in clear improvements in nearly all cases of other statistics such as the Nash-Sutcliffe efficiency and root mean square error.

Analysis of interior processes: streamflow and SWE
This part of the DMIP 2 West experiments was designed to investigate how distributed models represent basin-interior processes in mountainous areas. Participants generated simulations of two variables: streamflow and SWE.
Only one interior flow point was available (specified as a ''blind'' test with no explicit calibration; CMEC1-2). The results for calibrated simulations are shown in Fig. 8, Fig. 9 and Fig. 10. The multi-year r mod values for CMEC1-2 in Fig. 8 are only slightly lower than the CMEC1-3 case in which specific calibration was allowed. More of a difference between CMEC1-2 and CMEC1-3 is visible in Fig. 9, where larger multi-year %Bias values can be seen for CMEC1-2. This suggests that specific calibration at the Markleeville gage corrects for locally-generated biases that are not readily removed when calibrating using downstream information at Gardn- erville. Fig. 12 shows that specific calibration of the interior point CMEC1-3 led to improved values of peak and runoff volume error statistics for events compared to CMEC1-2 for two models (UPV and OHD). As stated earlier in Section 3.4, the CMEC1-2 results may be influenced by the large size of the CMEC1 basin compared to the parent GRDN2 basin. Participants were also requested to generate uncalibrated and calibrated hourly SWE simulations at two instrumented points in the NFDC1 basin and four instrumented points in the CMEC1 basin. The SWE simulations represent the grid or other computational area at which the model was run. Given the uncertainties involved in comparing point to grid values of SWE, our goal was to understand the general ability of the models to simulate the character of the snow accumulation and ablation processes (e.g., Shamir and Georgakakos, 2006). Shamir and Georgakakos (2006) defined a ''good'' SWE simulation as one that fell between simulated uncertainty bounds and also had consistent agreement with sensor observations.
We computed an average value of the %Bias of the simulated SWE compared to the observed SWE. These average values repre-sent the entire snow accumulation and ablation periods (approximately October to June) for all years at each of the six snow gage sites. Table 4 presents the results of this overall analysis for calibrated models. For the two stations in the NFDC1 basin, %Bias values were greater in absolute magnitude for the Blue Canyon site compared to the Huysink site for all four models. This agrees with Shamir and Georgakakos (2006) who found that the largest uncertainty was for that part of the snow pack located where the surface air temperature is near the freezing level. With an elevation of 1609 m, the Blue Canyon site is near the elevation of 1524 m typically used by the CNRFC to delineate rain and snow in its river forecast operations.
The %Bias values at the four SNOTEL sites in or near the higher elevation CMEC1-3 basin are generally less than those of the stations in the lower elevation NFDC1 basin. Here, the UOB and OHD models achieved the lowest values of the %Bias measure.
Large %Bias values can be seen in Table 4, highlighting the difficulties of simulating SWE in mountainous areas. For example, the UCI model featured a large over-simulation of SWE for the Spratt Creek SNOTEL site. This is due to the fact that the SWE sim- ulation for this site was generated by the UCI sub-basin model in which the average elevation of the sub-basin containing the site is over 400 m higher than the site itself-thus leading to overly frequent snowfall events. The large %Bias values may also reflect the scale mismatch between point SWE observations and simulations of SWE generated over a grid, sub-basin, or elevation zone. Fig. 20 presents the simulated and observed SWE for a large snow year for the Blue Canyon site in the NFDC1 basin and the Blue Lakes site in the CMEC1-3 basin. The dates plotted are October 1, 1992 to July 31, 1993. The Blue Lakes site accumulates about twice as much SWE as does the Blue Canyon site. Up to day 151, all models accumulate snow at a rate similar to that observed for Blue Lakes. However, the onset of melt is different amongst models, and all models melt off the snow more quickly than is observed. Problems with snow accumulation are more evident for the Blue Canyon site, perhaps as a result of difficulties in tracking the rain/snow line, intermodel differences in treating precipitation gage undercatch due to wind, and differences in how models determine the form of precipitation.
One issue that arises in modeling studies is whether errors (uncertainty) in precipitation and temperature forcings mask the simulation differences resulting from model physics. Fig. 21 illustrates how intermodel spread in SWE simulations compares to differences in SWE caused by errors in the temperature forcing. We use the work of Lei et al. (2007), who compared the responses of the snow model in the Noah LSM (Koren et al., 1999) and the Snow-17 model to random and systematic errors in temperature, solar radiation, and other meteorologic forcing variables available in the NARR (Mesinger et al., 2006). SWE simulations for the Snow-17 model with different levels of random error in the temperature data are shown in the top panel for water year 1999. The middle panel shows the SWE simulations from the Noah mod-el for the same water year and same levels of random temperature error. The spread in the DMIP 2 West SWE simulations at the Blue Lakes SNOTEL site for water year 1999 is shown in the bottom panel of Fig. 21. It can be seen from this figure that the spread in model-generated SWE could be as great as the spread caused by  random errors in the temperature data, depending on the level of data error. The timing of the overall snow accumulation and melt was evaluated by computing the difference between the observed and simulated SWE centroid dates (SCD; Kapnick and Hall, 2010). The SCD is computed using Eq. (3): where SWE is the daily observed or simulated SWE in mm, t is the number of the day from the beginning of snow accumulation, and i denotes an individual SWE value. Fig. 22 shows the difference SCD for each site for each of the years of the combined calibration and validation periods. Each box represents the 25-75% quartile range of SCD differences in days, while the red line is the median of the values. In Fig. 22, it is desirable to have an SCD difference of zero. Looking collectively at the plots, there is a general trend for the models to have an early SCD compared to the observed SCD. The values for the Huysink site in the NFDC1 basin have the largest and most consistent departure   from zero. There is a tendency in the NFDC1 basin for the SCD to be slightly earlier in time compared to the observed value. The participants' SCD values for the Huysink site are consistently about 10-15 days earlier than the Blue Canyon station. Consistent with Shamir and Georgakakos (2006), better results were achieved at higher elevation snow sites. The best SCD results are for the two highest elevation stations, Blue Lakes and Ebbet's Pass. At these sites, four models achieved the best timing as evidenced by the smallest spread in the 25-75% quartile range and SCD differences less than 10 days. Larger errors in timing (Fig. 22) and bias (Table 4) could be the result of scale mismatch between the point observations and the models' computational element size. For example, the larger errors seen in the CEM snow simulations may be the result of using a coarse modeling scale (5 elevation zones). The statistics presented here should be viewed with caution as considerable uncertainty exists in the representativeness of point SNOTEL (and other) observations of SWE to the surrounding area (e.g., Shamir and Georgakakos, 2006;Dressler et al., 2006;Garen and Marks, 2005;Simpson et al., 2004;Pan et al., 2003). Of course, point-to-point comparisons can be made when such data are available (e.g. Rutter et al., 2009;Rutter et al., 2008). However, there is great variability amongst results even when models are run at the point scale and compared to research-quality point observations. For example, the SnowMIP 2 project examined the performance of 33 snow models of various complexities at four point sites (Rutter et al., 2009). One of the conclusions from SnowMIP 2 was that it was more difficult to model SWE in forested sites compared to open sites. Moreover, there was no 'best' model or subset of models, and models that performed well at forested sites did not necessarily perform well (in a relative sense) at open sites. Along these lines, Mizukami and Koren (2008) noted discrepancies between satellite-estimated forest cover and the description of cover contained in a station's metadata. Such discrepancies could impact model simulations.
3.9. Modeling scale DMIP 2 West intended to examine the science question of appropriate model scale. In the project formulation phase, it was hoped that there would be a sufficient number of models to make inferences between simulation results and model scale. Unfortunately, the number of participants (five plus LMP) did not allow us to investigate this issue as fully as hoped. Nonetheless, a brief discussion is provided here given the void in the literature on this aspect of modeling.
The spatial modeling scales ranged from grids of 250 m (UOB), to 400 m (UPV) to 4 km (OHD) to elevation zones (five in CEM; two in LMP) and to sub-basins (8 sub basins with average size 90 km 2 , UCI). Runoff statistics reflect the integration of many processes including snow accumulation and melt, rain on snow, rain on bare ground, and hillslope and channel routing. Accordingly, we acknowledge that the results portray a mix of both model physics and modeling scales and that definitive conclusions regarding modeling scale cannot be made. Nonetheless, it is interesting that no trends between performance and modeling scale can be seen in the calibrated r mod and %Bias plots of Fig. 8 and Fig. 9, respectively, for the calibration, validation, and total simulation periods. Results are plotted in Fig. 8 and Fig. 9 in order of increasing model resolution. Even where the model physics is similar (i.e., OHD, LMP, and UCI), there is no trend in these runperiod statistics.
However, a different picture emerges when looking at the event statistics for calibrated models. Focusing on the two models with common physics which were run in all cases, the average Blue Canyon (1,609m) Huysink ( event statistics in Figs. 10-12 for calibrated models show that OHD provides better values than LMP. In all three basin cases, the OHD model provides lower peak error values, and in two out of three cases the runoff volume is better. Recall that LMP is actually the case of two elevation zones, above and below 1524 m in NFDC1 and 1724 m in GRDN2 and CMEC1. The event improvement statistics in Fig. 12 further illustrate the improvement of OHD compared to LMP. Uncalibrated run period and event statistics (not shown) also show improvements of OHD compared to LMP. Comparing models with the same physics, the OHD and LMP results agree with the scope and results of Dornes et al. (2008), who modeled the snow and runoff processes of a basin with lumped and distributed applications of the same precipitationrunoff model. In a limited way, the results herein support the expectation that higher resolution modeling scale will improve simulation performance.

Rain and snow partitioning
DMIP 2 West intended to address the important issue of rainsnow partitioning. Primarily this was to be addressed via the use of radar-detected observations of freezing level from HMT-West (Minder and Kingsmill, 2013;White et al., 2010White et al., , 2002 after participants set-up and ran their models with the baseline DMIP 2 West gridded precipitation and temperature data. Participants would then make additional simulations using the radar-based estimates of the rain-snow line to note the improvement. Delays in DMIP 2 West caused by the need to generate a new QPE data set precluded the use of HMT-West data in formal experiments with participants. Nonetheless, using the DMIP 2 West modeling framework, Mizukami et al. (2013) tested the OHD and LMP models with and without the HMT-West radar-derived rain-snow data for the 2005-2006 winter period. Mixed simulation results were seen; some runoff events were better simulated while other events were worsened. Interested readers are referred to Mizukami et al. (2013) for more information.

Conclusions
We present the major conclusions generally in order of the science questions listed in Section 1.2. Interspersed among these are additional conclusions and comments.

Distributed vs. lumped approaches in mountainous areas
Overall, no single model performed best in all basins for all streamflow evaluation statistics. Neither was any distributed model able to consistently outperform the LMP benchmark in all basins for all indices. Nonetheless, one or more distributed models were able to achieve better performance than the LMP benchmark in a number of the evaluations. These results are consistent with the findings of DMIP 1  and DMIP 2 West (Smith et al., 2012b). We highlight several aspects of model performance below.
Considering the r mod and %Bias measures computed for the multi-year calibration, validation, and combined calibration-validation periods, mixed results were achieved. No single model performed best in all periods in all basins. In addition, no distributed model consistently performed better than the benchmark LMP model. However, three models (OHD, UOB, and CEM) were able to outperform the LMP model for certain periods in certain basins.
The models were also inter-compared by evaluating specific precipitation/runoff events. Here, only one model (OHD) was able to perform at a level near to or better than the LMP benchmark for peak flow and runoff volume. However, three models (OHD, CEM, and UPV) achieved improvements in peak event timing compared to LMP, highlighting the potential of distributed models to capture spatially-variable precipitation and runoff processes. This evaluation of precipitation/runoff events showed that taken together, distributed models were able to provide improved peak flow values in 24% of the 17 model-basin pairs, improved runoff volume in 12% of the pairs, and improved peak timing in 41% of the pairs. Even though the gains by distributed models over the LMP benchmark were modest, all models performed well compared to those in the less-hydrologically-complex Oklahoma basins in DMIP 2. For example, the r mod and %Bias results of all models in the multi-year run-period tests are commensurate with those in the non-snow-dominated DMIP 2 Oklahoma basins (Smith et al., 2012b). Similarly, the event-averaged absolute % runoff error and absolute % peak error values agree well with the range of values for the DMIP 2 ELDO2 basin (Smith et al., 2012b). These results are noteworthy in that the DMIP 2 West basins have complexities such as orographic enhancement of precipitation, snow accumulation and melt, rain-on-snow events, and highly varied topography which are not present in the DMIP 2 Oklahoma basins. Looking at the results herein and the DMIP experiments overall, it is clear that at this point in their evolution, distributed models have the potential to provide valuable information on specific flood events that could complement lumped model simulations.
Based on these mixed results, care must be taken to examine a range of statistical measures, simulation periods, and even hydrograph plots when evaluating the performance of distributed models compared to lumped models. As in the DMIP 2 Oklahoma basins (Smith et al., 2012b), our results actually reflect the model/user combination and not the models themselves.
It proved difficult to determine the dominant factors which led to the improvement of the OHD distributed model over the benchmark LMP model for mixed rain/snow events. Our limited analyses on one of two study basins identified complex interactions of precipitation partitioning, spatial variability of liquid precipitation, runoff generation, and channel routing.

Estimation of models inputs and model sensitivity to existing data
The distributed models used gridded forms of the precipitation and temperature data widely used by NWS RFCs in mountainous areas for hydrologic model parameter calibration. In the study basins, the density of precipitation and temperature gauges was sufficient to develop useful gridded estimates of these variables over a 20-year span. A sufficient number of hourly (recording) rain gauges were available to distribute daily precipitation observations. These data were able to support effective model calibration and good simulations through the validation period, evidenced by %Bias values within or near the ±5% criteria for NWS model calibration (Smith et al., 2003), low cumulative runoff errors, and high values of r mod .
For this study, careful quality control of the raw precipitation data was essential. This seemed especially warranted given the sensitivity of the hydrologic models noted in the development of the QPE data. Numerous errors, especially in the precipitation observations, were identified and corrected. The OHD and LMP models were sensitive to these errors for hourly time step simulations of mixed rain/snow events. Such errors manifested themselves as anomalous hydrograph peaks. The impact of such precipitation data errors may not be as evident in streamflow hydrographs that are dominated by snow melt.

Internal consistency
The ability of distributed models to simulate snow accumulation and melt was investigated at six SWE observation sites. The best results in terms of timing and volume were seen at the higher elevation stations. Larger errors in simulated SWE were apparent at a station near the typical elevation separating rain and snow. In addition, larger errors in timing and volume of snow accumulation and melt were seen in the distributed models featuring larger computational element sizes. This result may reflect the scale mismatch between the point observations and the computational element size. Our findings should be viewed in light of the considerable uncertainty that exists in the SWE observations and their representativeness of the surrounding area.
A limited test with one interior flow point showed that some distributed models calibrated at the outlet were able to achieve valid simulations of streamflow at the interior location. In particular, the overall r mod statistics for the blind interior point CMEC1-2 were commensurate with those achieved through explicit calibration at the interior point (CMEC1-3). However, these good results may hinge on the large size of the interior basin compared to the parent basin.

Scale issues
Scale issues continue to be perplexing in mountainous areas. While our study was limited in scope, the results address a void in the literature regarding modeling scales that consider snow accumulation, melt, and runoff generation. Even in the highly variable terrain of NFDC1, a range of modeling scales led to relatively good streamflow simulations. Among the models that shared the same snow/rainfall/runoff schemes, better event statistics were achieved at higher resolution modeling scales. Considering all the models, which admittedly represented a mix of physics, user knowledge, and model scales, it was surprising that more apparent trends did not appear given the range of modeling resolution from 250 m to two elevation zones. We were not able to pinpoint an optimal modeling scale.

Parameter calibration
While not an explicitly identified science question, our results show that parameter calibration led to improved goodness-of-fit statistics for nearly all model-basin pairs. This suggests that calibration strategies can be effective in areas with complex hydrology. It also suggests that calibration strategies are needed even with advances in model structure and the development of a priori parameter estimates.

Recommendations
While DMIP 2 West provided interesting and informative results, much work remains to further address the science questions posed in DMIP 2 West and other issues that have bearing on mountainous area hydrologic simulation and forecasting.
Our results should be further examined in the context of uncertainty in forcing data and model parameters. For SWE, one simple method would be to use the approach of Shamir and Georgakakos (2006). In their approach, the uncertainty bounds were defined by running the snow model on an adjacent south-facing grid cell and a nearby north-facing grid cell. The resultant SWE simulations formed the lower and upper uncertainty bounds, respectively. Another idea is to use the results from Molotch and Bales (2006) to understand the relationship between SNOTEL SWE observations and the SWE simulations generated over the computational units within the participants' models. The amount of forest cover at each SWE site should be derived so that our results can be placed in the context of the SnowMIP 2 results (Rutter et al., 2009).
Continued efforts are necessary to diagnose the causes of differences between distributed and lumped model simulations. These efforts will require detailed analyses, probably along the lines of the hydrologic and meteorological studies of Lundquist et al. (2008) and Minder et al. (2011), respectively. While additional work can be done with the data on hand, the advanced data available from the HMT-West program will undoubtedly aid in this process.
DMIP 2 West was formulated as a general evaluation of distributed and lumped models in complex terrain without specific tests to highlight the benefits of model structure. To address this limitation, experiments are recommended to uncover and diagnose the impacts of model structure on performance (e.g., Clark et al., 2011;Butts et al., 2004).
The benefits of using the HMT-West data sets of additional surface temperature (Lundquist et al., 2008) and precipitation, optical disdrometer, vertically pointing radar-based freezing level (Mizukami et al., 2013;Minder and Kingsmill, 2013;Lundquist et al., 2008;White et al., 2010White et al., , 2002, soil moisture, and gap-filling radar-derived QPE (e.g., Gourley et al., 2009) should continue to be explored. These data sets should also aid in the diagnosis of modeling improvements. DMIP 2 West was always intended to be a multi-institutional and multi-model evaluation of the QPE, disdrometer, soil moisture, radar-freezing level, and other observations afforded by the rich instrumentation deployments in HMT-West. The intent was to first generate streamflow and SWE simulations using the 'basic' DMIP 2 West gage-only QPE and temperature fields. After calibrating and running their models with the basic data, it was planned to have participants rerun their models using the HMT-West data (radar QPE, snow level, and soil moisture) to note the improvements gained by advanced observations. However, both DMIP 2 West and HMT-West experienced major delays, with the unfortunate result being that the HMT-West data sets could not be explored in formal DMIP 2 West experiments.
Based on our experience with deriving precipitation, temperature, and evaporation forcing data sets for DMIP 2 West, continued work in deriving these forcings in complex terrain is of near-paramount importance for model testing, development, and calibration. Continued work is needed to address gage network density issues in mountainous areas. This is true for gage-only QPE and for the use of rain gages to bias-adjust radar estimates of precipitation. In spite of the enormous effort involved, data sets covering a large number of basins would support additional experiments and lead to broader conclusions (e.g., Andréassian et al., 2009Andréassian et al., , 2006. River Forecast Centers within the NWS should consider the use of the OHD model. Other operational forecasting agencies should consider the use of distributed models in complex terrain. For the foreseeable future, such models should be viewed as complements to existing lumped forecast models rather than outright replacements.

Table A1
Participating groups and major model characteristics. Howard of NSSL assisted with processing the PRISM data. The careful reviews and comments from the journal reviewers and editors have contributed to the clarity of this paper.
Appendix D. Statistical equations used in the analysis of DMIP 2 West results

D.1. Percent bias, PB (%)
PB is a measure of total volume difference between two time series. PB is computed as: where S i is the simulated discharge for each time step i, O i is the observed value, and N is the total number of values within the time period of analysis. While not used explicitly in the DMIP West results analysis, we present the formula for the correlation coefficient as background Table B1 Calibration strategies for the DMIP 2 West models.

LMP
Systematic manual adjustment of parameters starting with baseflow and proceeding to fast response flow generation processes (Smith et al., 2003). Several statistical measures used at different points in the process to evaluate the fit of the simulation OHD Start with a priori parameters defined from soil texture. Revise a priori parameters using lumped calibrated parameters (derived using procedures in Smith et al., 2003): scale gridded a priori values by ratio of the SAC-SMA parameter value from the lumped calibration to the average parameter value from the a priori grid. Evaluate if this initial scaling is appropriate. Then use scalar multipliers to uniformly adjust each parameter field while maintaining spatial variability. Scalars are calibrated manually and/or automatically. Automatic calibration uses a multi-time scale objective function (Kuzmin et al., 2008) CEM The parameters were estimated using a steepest-descent type method combined with overall prior screening of the parameter space (see e.g. Mathevet, 2005). The resulting parameters were applied identically to all sub-basins UPV Use correction factors to globally modify each parameter map, assuming the prior spatial structure and thus reducing drastically the number of variables to be calibrated. In the used TETIS configuration, there were a total of nine correction factors: eight affecting the runoff production parameter maps and one for the stream network velocity. The TETIS model includes an automatic calibration module based on the SCE-UA algorithm (Duan et al., 1994). For this application, the objective function was the Nash-Sutcliffe efficiency index. The calibration was carried out in three steps: (1) calibration of rainfall-runoff parameters using the 1990 summer period; (2) calibration of snowmelt parameters using the 1992-93 and 1994-95 winter periods; (3) refinement of rainfall-runoff parameters using the period 1989-1993 UCI The optimal parameter set was estimated through calibration of the lumped SAC-SMA and SNOW-17 models (over the entire watershed) using SCE-UA calibration algorithm (Duan et al., 1993) and MACS calibration scheme (Hogue et al., 2000). The resultant parameter set was then applied identically to all subbasins in the distributed model configuration in order to generate streamflow at the outlet and interior points UOB Soils parameters derived from STATSGO texture classes in each grid. Adjusted horizontal and vertical saturated hydraulic conductivity and soil depth.
Calibration process was carried out with a 'trial and error' methodology, focusing on the highest flood events in order to obtain the best model performance according to the Nash and Sutcliffe coefficient (Coccia et al., 2009)  for the discussion on the modified correlation coefficient. The correlation coefficient r is defined as: Modified correlation coefficient, r mod (McCuen and Snyder, 1975) In this statistic, the normal correlation coefficient is reduced by the ratio of the standard deviations of the observed and simulated hydrographs. The minimum standard deviation (numerator) and maximum standard deviation (denominator) are selected so as to derive an adjustment factor less than unity: r mod ¼ r Á minfr sim ; r obs g maxfr sim ; r obs g D.6. Root mean square error (%) Â 100 D.7. The following aggregate statistics were generated for selected individual events a. Percent absolute event runoff error, E r , % This is the absolute value of the runoff bias from several events expressed as a percentage: b. Percent absolute peak error, E p , % This is the absolute value of error in peak discharge for several events expressed as a percentage: c. Percent absolute peak time error, E t , h This is the absolute value of the error in peak time for several events expressed as a percentage: where B i is the runoff bias per ith flood event, mm; Y avg the average observed flood event runoff, mm; Q p,i the observed peak discharge of the ith flood event, m 3 s À1 ; Q ps,i the simulated peak discharge of the ith flood event, m 3 s À1 ; Q p,avg the average observed peak discharge, m 3 s À1 ; T p,i the observed time to the ith peak, h; T ps,i the simulated time to the ith peak, h, and N the number of selected events.
D.8. Statistics to measure improvement over the LMP benchmark a. Flood runoff improvement I y , % This statistic measures the improvement in computed runoff volume: b. Peak flow improvement I p , % This statistic quantifies the gain in simulating the peak event discharge: I p ¼ P N i¼1 ðjQ p;i À Q ps;i j À jQ p;i À Q pz;i jÞ N Á Q p;avg Á 100 c. Peak time improvement I t This statistic measures the improvement in simulated peak time: ðjT p;i À T ps;i j À jT p;i À T pz;i jÞ N where Y i is the observed runoff volume of the ith flood, mm; Y s,i the (distributed model) simulated runoff volume of the ith event, mm; Y z,i the (lumped model) simulated runoff of the ith flood to compare with, mm; Y avg the average observed flood event runoff volume of N events, mm; Q p,i the observed peak discharge of the ith event, m 3 s À1 ; Q ps,i the (distributed model) simulated peak discharge of the ith event, m 3 s À1 ; Q pz,i the (lumped model) simulated peak discharge, m 3 s À1 ; Q p,avg the average observed peak discharge of N events, m 3 s À1 ; T p,i the observed time of the ith peak, h; T ps,i the (distributed model) simulated time of the ith peak, h; T pz,i the (lumped model) simulated time to ith peak, h and N is the number of selected events.