Multi-model ensemble hydrologic prediction using Bayesian model averaging

Multi-model ensemble strategy is a means to exploit the diversity of skillful predictions from different models. This paper studies the use of Bayesian model averaging (BMA) scheme to develop more skillful and reliable probabilistic hydrologic predictions from multiple competing predictions made by several hydrologic models. BMA is a statistical procedure that infers consensus predictions by weighing individual predictions based on their probabilistic likelihood measures, with the better performing predictions receiving higher weights than the worse performing ones. Furthermore, BMA provides a more reliable description of the total predictive uncertainty than the original ensemble, leading to a sharper and better calibrated probability density function (PDF) for the probabilistic predictions. In this study, a nine-member ensemble of hydrologic predictions was used to test and evaluate the BMA scheme. This ensemble was generated by calibrating three different hydrologic models using three distinct objective functions. These objective functions were chosen in a way that forces the models to capture certain aspects of the hydrograph well (e.g., peaks, mid-flows and low flows). Two sets of numerical experiments were carried out on three test basins in the US to explore the best way of using the BMA scheme. In the first set, a single set of BMA weights was computed to obtain BMA predictions, while the second set employed multiple sets of weights, with distinct sets corresponding to different flow intervals. In both sets, the streamflow values were transformed using Box–Cox transformation to ensure that the probability distribution of the prediction errors is approximately Gaussian. A split sample approach was used to obtain and validate the BMA predictions. The test results showed that BMA scheme has the advantage of generating more skillful and equally reliable probabilistic predictions than original ensemble. The performance of the expected BMA predictions in terms of daily root mean square error (DRMS) and daily absolute mean error (DABS) is generally superior to that of the best individual predictions. Furthermore, the BMA predictions employing multiple sets of weights are generally better than those using single set of weights. 2006 Elsevier Ltd. All rights reserved.


Introduction
The prevailing practice by hydrologists to date has been to rely on a single hydrologic model to perform hydrologic predictions. Despite the tremendous amount of resources invested in developing more hydrologic models, no one can convincingly claim that any particular model in existence today is superior to other models for all type of applications and under all conditions [42,43,35,3,13]. Different models have strengths in capturing different aspects of the hydrologic processes. Relying on a single model often leads to predictions that represent some phenomena or events well at the expenses of others. Further, a proper accounting of uncertainty associated with these predictions has not received adequate attention. Ensemble approaches based on multi-parameter sets and ensemble hydrologic forcing inputs can help improve the uncertainty estimation [4,20,41,23]. But the structural error inherent in any single model cannot be avoided in this kind of ensemble strategy [17]. This has motivated a number of researchers to advocate multi-model methods for hydrologic predictions [33,17,1,3].
Multi-model methods were used in various forecasting applications such as economic and weather forecasting as early as the 1960s [2,9,10,28,38]. Shamseldin and colleagues were probably the first to explore the use of multi-model methods for hydrologic predictions [33]. Georgakakos et al. [17] recently used a multi-model combination approach to analyze the simulation results from multiple models that participated in the distributed model intercomparison project (DMIP) [35]. These multi-model techniques provide consensus predictions by linearly combining individual model predictions according to different weighting strategies. The weights can be equal for all models in the simplest case, or be determined through certain regression-based methods. In the latter case, the weights are the regression coefficients. Shamseldin and O'Connor [34] also explored the use of artificial neural network (ANN) techniques to estimate the model weights. Raftery et al. [29] pointed out that the weights determined by those regression based techniques are hard to interpret because they take on arbitrary negative or positive values and are not connected to model performance. Furthermore, the reliability of the multi-model predictions from these approaches is not satisfactory. Nevertheless, multi-model ensemble averages produced by these methods have shown to consistently perform better than single model predictions when they are evaluated based on various predictive skill and reliability scores [33,34,19,45,17,1].
Recently, Bayesian model averaging (BMA) has gained popularity in diverse fields such as statistics, management science, medicine and meteorology [18,40,15,29,30]. Like predictions from other multi-model methods, BMA predictions are weighted averages of the individual predictions from competing models. But unlike some multi-model methods, BMA also provides a more realistic description of the predictive uncertainty that accounts for both between-model variances and in-model variances. The BMA weights, all positive and summing up to 1, reflect relative model performance because they are the probabilistic likelihood measures of a model being correct given the observations. In various case studies, BMA has been shown to produce more accurate and reliable predictions than other multi-model techniques Raftery et al., 1997; [8,40,31,30,14]. Recently, BMA methods have also been applied to hydrologic applications such as groundwater modeling by Neuman and Wierenga [27,26].
This study explores the use of BMA for hydrologic streamflow predictions. We are interested in how BMA scheme can be used to improve both the accuracy and reliability of streamflow predictions. Particularly, we investigate different ways to apply BMA scheme to fully exploit the strengths of individual models. This paper is organized as follows. Section 2 presents the BMA methodology. Section 3 discusses the generation of hydrologic model ensemble and the design of the numerical experiments and test data sets. Section 4 describes the test and validation results of multi-model predictions using BMA schemes. Section 5 provides summaries and conclusions.

Bayesian model averaging (BMA)
Bayesian model averaging (BMA) is a statistical scheme to infer a probabilistic prediction that possesses more skill and reliability than the original ensemble members produced by several competing models [22,29]. BMA has been used primarily in generalized linear regression applications. Recently, Raftery et al. [29,30] successfully applied BMA to dynamical modeling applications (i.e., numerical weather predictions). In this study, we apply BMA to streamflow prediction problems. The BMA scheme is briefly described as follows.
Consider a quantity y to be the forecasted variable (or predictand), D ¼ ½y obs 1 ; y obs 2 ; . . . ; y obs T to be the training data with data length T, and f ¼ ½f 1 ; f 2 ; . . . ; f K the ensemble of all considered model predictions. p k ðy j f k; DÞ is the posterior distribution of y given model prediction f k and observational data set D. According to the law of total probability, the probability density function (PDF) of the BMA probabilistic prediction of y can be represented as: where pðf k j DÞ is the posterior probability of model prediction f k , also known as the likelihood of model prediction f k being the correct prediction given the observational data, D. This term reflects how well this particular ensemble member matches the observations. If we denote w k ¼ pðf k j DÞ, we should obtain P K k¼1 w k ¼ 1: The posterior mean and variance of the BMA prediction can be expressed as [29,30]: where r 2 j is the variance associated with model prediction f k with respect to observation D. In essence, the expected BMA prediction is the average of individual predictions weighted by the likelihood that an individual model is correct given the observations. There are several attractive properties to the BMA prediction. First the BMA prediction receives higher weights from better performing models as the likelihood of a model is essentially a measure of the agreement between the model predictions and the observations. Second, the BMA variance is essentially an uncertainty measure of the BMA prediction. It contains two components: the between-model-variance and the within-model-variance, as shown in the first and second terms of the right hand side of Eq (3). This measure is a better description of predictive uncertainty than that in a non-BMA scheme, which estimates uncertainty based only on the ensemble spread (i.e., only the between-model variance is considered), and consequently results in under-dispersive predictions Raftery et al. [29].
Before we present the BMA algorithm, it is assumed that the conditional probability distribution p k ðy j f k ; DÞ is Gaussian. Considering that the probability distribution of streamflow error is non-Gaussian, both modeled and observed streamflow data are pre-processed using the Box-Cox transformation prior to the BMA procedure, so that the transformed variables will be close to the Gaussian distribution. The Gaussian assumption is made for computational convenience and BMA scheme can be applied by assuming other probability distributions. Statistical techniques such as Markov Chain Monte Carlo (MCMC) method is capable of simulating any complex probability distribution, therefore, can be a strategy to conduct BMA without using the Gaussian approximation [21]. However, it is beyond the focus of this paper. We will work with the log-likelihood function since it is more convenient to compute than the likelihood function itself. If we denote h ¼ ½fw k ; r k ; k ¼ 1; 2; . . . ; Kg, the log-likelihood function can be approximated as: Obviously, it is impossible to obtain analytical solution of h and an iterative procedure must be used. Following the recommendation of Raftery et al. [30], we used the Expectation-Maximization (EM) algorithm for this purpose. In brief, the EM algorithm casts the maximum likelihood problem as a ''missing data'' problem. The missing data may not be actual data. Rather, it can be a latent variable that needs to be estimated. For this study, a latent variable z k;t is introduced. If the kth model ensemble is the best prediction at time t, z k;t ¼ 1; otherwise z k;t ¼ 0. At any time t, there is only one z k;t equal to 1 and the rest is equal to 0. As in the namesake, the EM algorithm alternates between the E (or expectation) step and the M (or maximization) step. It starts with an initial guess, h ð0Þ , for parameter h. In the E step, z k;t is estimated given the current guess of h. In the M step, h is estimated given the current values of the z k;t . The EM steps are repeated until certain convergence criteria are satisfied. The EM algorithm is illustrated in Box 1. The EM algorithm can only find local optimum and the optimal solution is very sensitive to initial guess of the optimizing variables. For a more detailed description of the EM algorithm, readers are referred to McLachlan and Krishnan [24].

Box 1: EM Algorithm
A. Initialization: where T is the total number of data points in the training period B. Computing the initial likelihood: where g(.) denotes Gaussian distribution.

D. Executing the maximization step
Compute the weight;w Iter Update the variance;r 2 ðIterÞ Update the likelihood using Eq (B1).

E. Checking convergence:
If 'ðh Iter Þ À 'ðh IterÀ1 Þ is less than or equal to a pre-specified tolerance level, stop; else go back to Step C.
With proper estimate of h ¼ ½fw k ; r k ; k ¼ 1; 2; . . . ; Kg and p k ðyjf k ; h; DÞ, we can easily generate probabilistic predictions based on Eq. (1). An algorithm to generate BMA probabilistic predictions is presented later in Section 4.3.2.

Generation of hydrologic model ensemble
To test BMA scheme for streamflow predictions, an ensemble of competing predictions from several hydrologic models were produced. For this study, we employed three conceptual hydrologic models: the Sacramento Soil Moisture Accounting (SAC-SMA) model, the Simple Water Balance (SWB) model, and the HYMOD model. SAC-SMA is the most complicated model among the three, with 16 model parameters, and is still the most widely used operational hydrologic model in National Weather Service for river and flood forecasting purpose [7]. SWB, also developed by NWS, is a simple hydrologic model used operationally in the Nile River Forecast System in Egypt [32]. HYMOD is a simple hydrologic model developed for research purposes at the University of Arizona [5]. Both SWB and HYMOD have five tunable model parameters. All three models need precipitation and potential evapotranspiration data as forcing inputs. A detailed description of these individual models is outside the scope of this paper. Interested readers should refer appropriate literature to gain a more in-depth understanding of these models.
The three hydrologic models were calibrated using the Shuffled Complex Evolution method (SCE-UA, [11,12]. Three distinct objective functions were used to force the hydrologic models to favor different phases of the hydrograph: • Daily root mean square error (DRMS), • Daily absolute error (DABS) • Heteroscedastic maximum likelihood estimator (HMLE) The analytical expressions of these objective functions are: where y obs t and y est t are observed and estimated streamflow values, x t ¼ f 2ðkÀ1Þ t and k is the unknown Box-Cox transformation parameter [36]. The first objective function, DRMS, forces the models to fit the high flows well, while the last objective function, HMLE, tends to push the models to match low flows well. The second objective function, DABS, places equal emphasis on all parts of the hydrograph and is a compromise between DRMS and HMLE.
We carried out hydrologic model calibration on three hydrologic basins located in the United States: Bird Creek River basin, near Sperry, OK; Leaf River basin, Near Collins, MS; and French Broad River basin at Blantyre, NC. These basins are chosen because they have been widely studied and the hydrologic data sets were carefully prepared by NWS Hydrology Laboratory [37,32]. Table 1 lists the geophysical and climatic characteristics of these test basins. The hydrologic data periods for these basins are also shown. These basins span different hydroclimatic regimes, from semi-arid (e.g., Bird Creek), to moderate (e.g., Leaf River), and to wet (e.g., French Broad River).
The combination of three models and three objective functions yields a nine-member ensemble of distinct model predictions for each test basin. This ensemble forms the basis for testing the BMA scheme in the next sections. Table 2 summarizes the DRMS statistics over the calibration period for each ensemble member in different flow intervals for the test basins. The entire flow range was broken into a number of flow intervals based on pre-specified non-exceedance threshold (e.g., 10%, 25%, 50%, 75%, 90%, 100%). As expected, different ensemble members exhibit different goodness-of-fit statistics in different flow intervals, with the SAC-SMA as an obviously better model among the three in most cases. It is worth noting that, with a few exceptions, the ensemble members calibrated using DRMS tend to be associated with better statistics at the high flow ranges, while the ensemble members calibrated using HMLE generally correspond to better statistics in the low flow ranges. The ensemble members calibrated with DABS tend to favor the middle-ranges.
A necessary condition for obtaining unbiased, optimal results using the BMA scheme, as outlined in 2, is that the likelihood function of the prediction error must be properly computed. In this study, we employed a likelihood function that assumes the underlying variable is normally distributed for computational simplicity. It is well known that the error in streamflow prediction is heteroscedastic and non-Gaussian [36,39]. To deal with this problem, we performed a Box-Cox transformation on streamflow values prior to BMA testing to ensure that the streamflow prediction error is approximately Gaussian.
In the following section, we carried out several numerical experiments to assess the usefulness of BMA scheme for streamflow prediction. Particularly, the BMA scheme was evaluated using two strategies. In the first strategy, we applied the BMA scheme to obtain a single set of BMA weights for the entire Box-Cox transformed time series. In the second strategy, we break the Box-Cox transformed streamflow values into several flow ranges and then apply the BMA scheme to each flow range separately. There is an intuitive advantage to using a multiflow interval approach: the strengths of individual models in capturing different aspects of the hydrographs (i.e., peak flows and low flows) are reflected in the computation of the model weights. A model that predicts high peak flows better than other models would be assigned a higher weight than other models during peak flow periods. Reversely, a model that represents low flow better would also be given a higher weight during the low flow periods.
The results presented below seek to answer the following questions: (1) how do we exploit the diversity in skill levels of different predictions over different flow periods? (2) will the BMA weights as defined in Eqs. (2)-(4) reflect the model performance statistics? (3) how consistent are the BMA predictions when the BMA weights obtained from the training periods are applied to independent validation periods?

Statistical verification criteria
Before the results from the numerical experiments are presented, we first define the criteria used to evaluate the performance of model predictions. For hydrologic predictions, the common goals are to maximize the predictive accuracy and reliability. There are many ways to measure these goals. In this study, we employed a number of criteria: DRMS and DABS (as defined in the previous section), Ranked Probability Score (RPS) and Reliability Diagram (REL). DRMS and DABS are commonly used for evaluating the accuracy of deterministic predictions and they are used here to evaluate the association of the expected BMA predictions with observations. For probabilistic predictions, it is desired that the probability density function (PDF) is sharp subject to calibration. By ''calibration'', it means that the PDFs of the predictions and observations are consistent. RPS and REL are widely used as measures for assessing the quality of probabilistic predictions. RPS is essentially the mean-squared error of the probability forecasts averaged over multiple events. A small value for RPS means that the PDF is sharp and well calibrated. In streamflow prediction, the probability forecast is often expressed as a non-exceedance probability forecast in pre-specified categories (i.e., 5%, 10%, 25%, 50%, 75%, 90%, 95% 100% non-exceedance). The observed value for a given forecast category takes on the value of 1 if the observed flow value is less than the threshold for that category. Otherwise, the observed value is 0. The analytical expression of RPS for an event is given as: where F ðtÞ j is the forecast probability and O ðtÞ j is the observed value, j ¼ f1; 2; . . . ; Mg is the probability category and t is the event index. Here, we treat the flow value for each day as an event. The average RPS ðtÞ over an evaluation period t ¼ f1; 2; . . . ; T g is equal to the overall RPS: Reliability diagram (REL) measures how the forecast probability matches observation for all forecast categories. In probability term, REL is the conditional distribution of an observation given a particular forecast, pðO j F Þ. A perfect forecast implies pðO ¼ 1 j F Þ ¼ F . We will explain the interpretation of REL later when we present the results.
For a more general discussion on verification statistics, readers are referred to Murphy et al. [25,44]. For a more detailed discussion on verification of probabilistic hydrologic forecast, readers are referred to Franz et al., Bradley et al. [16,6]. Verification statistics such as DRMS, DABS and RPS, are not meaningful when they are viewed in absolute terms. That is why skill scores are used widely in verification literature [25,44]. Skill scores, DRMSS, DABSS, and RPSS, are usually computed as the percentage improvement over a reference point: where DRMS, DABS, and RPS are the verification statistics of a prediction, and DRMS Ã , DABS Ã , and RPS Ã are the reference verification statistics. In this study, DRMS Ã and DABS Ã are the verification statistics associated with the best individual prediction among the original ensemble, while RPS Ã is the RPS value computed from the original ensemble. Note that for skill scores, the larger the values, the better are the predictions.

Box-Cox transformation of streamflow values
Before we applied the BMA scheme as described in Section 2 to the nine-member ensemble shown in Table 2, a Box-Cox transformation was first performed on both the ensemble members and the observation. The Box-Cox transformation is given as follows: where y t is the original variable, z t is the transformed variable, k is the Box-Cox coefficient. For each basin, we derived a common optimal estimate of k for all ensemble members and the observations, based on Kolmogorov-Smirnov test statistics. Fig. 1 displays the normal probability plots of the original and transformed ensemble members. It is clear that the original ensemble members are highly non-Gaussian, while the transformed members appear much closer to be normally distributed. Still we notice that a few transformed ensemble members depart from the Gaussian distribution at the lower tail end. The BMA scheme was applied to the transformed ensemble members to obtain a single set of BMA weights for each basin using all data points from the training period. The weights for different ensemble members are shown in Fig. 2. Before we examine the BMA weights, let's first look at the performance statistics of the expected BMA predictions (denoted as BMA 1 predictions hereafter), which are really alternative deterministic predictions to the individual predictions. Note that all statistics were computed on streamflow values in original space (not the Box-Cox transformed space). The DRMSS and DABSS statistics of the expected BMA 1 predictions, along with that of the simple model average (SMA) predictions, are shown in Fig. 3a and b. Fig. 3a shows that the DRMSS statistics of the expected BMA 1 predictions are better than that of the best individual predictions, substantially in the cases of Bird Creek and French Broad. In terms of the DABSS statistics, the BMA 1 predictions are slightly better than the best individual prediction in Bird Creek basin. But in two other basins, the DABSS statistics of BMA 1 predictions are slightly worse than the best individual predictions (Fig. 3b). Compared to the best individual predictions, SMA predictions generally performed much worse than the best individual predictions, except in one basin (i.e., Bird Creek). This indicates that simply averaging the original ensemble predictions would not necessarily lead to improved accuracy of the predictions.
One premise of the BMA scheme is that BMA weights should reflect relative model performance. From Fig. 2, we quickly notice visually that, indeed, individual model performance roughly reflects the BMA weights, with SAC model weighed more heavily than other models. The correlation coefficients between the BMA 1 weights and the DRMS and DABS statistics were computed for each basin and were shown in Table 3. All of these correlation coefficients have high negative values, indicating that higher weights are strongly associated with lower DRMS and DABS values. This confirms that the BMA weights do indeed reflect model performance.

Verification of the skill and reliability of the BMA 1 probabilistic predictions
One feature of the BMA scheme is that it can derive probabilistic ensemble predictions from competing individual deterministic predictions. Box 2 briefly describes how the BMA probabilistic ensemble predictions are generated (see also [30]. For this study, we generated 100 BMA ensemble predictions to get a reasonable empirical PDF at each time step. Step (1). Fig. 4 displays the expected BMA 1 predictions along with the 90% confidence interval of the BMA 1 ensemble for one typical calendar year for each of the test basins. The corresponding observations are shown as dots. To put Fig. 4 in a proper perspective, we also show the corresponding SMA predictions along the 90% confidence interval of the original ensemble spread in Fig. 5. Fig. 6 shows that the RPSS statistics. This figure clearly indicates that the RPS values of BMA 1 predictions are significantly better (>30%) than that of the original ensemble predictions. This implies that the PDFs of the BMA 1 predictions are sharper and more consistent with the observations than that of the original ensemble predictions. Therefore, the BMA 1 predictions are much more skillful than the original ensemble. Fig. 7 shows the reliability diagram of the BMA 1 predictions and the original ensemble predictions for three flow ranges: low flow (i.e., bottom 25% quantile based on observations), middle flow (i.e., middle 50% quantile) and high flow (i.e., top 25% quantile). The reliability of the BMA m predictions is also included in the figure. We will discuss the reliability results in Section 4.4.2.

Testing of BMA scheme using multiple sets of weights
In previous section, the BMA scheme was applied to the entire Box-Cox transformed time series. In this section, we broke the streamflow values from the training data period into several flow intervals. We then applied BMA scheme to each flow range and obtain a distinct set of weights for each flow range. The BMA predictions for each flow range were computed individually using the BMA weights corresponding to that particular flow range. Afterwards, the BMA predictions for different flow ranges were combined to obtain the BMA prediction for the entire training data period. In the following sections, the verification statistics were again computed in the original space (i.e., not in the Box-Cox transformed space).

Verification of the accuracy of the expected BMA probabilistic predictions obtained using multiple sets of weights
The streamflow values were broken into several flow ranges based on non-exceedance thresholds, as explained in Section 2. For this study, the flow range values for Leaf River and French Broad basins correspond to 10%, 25%, 50%, 75%, 90% and 100% non-exceedance levels for each basin. Because about 23% of the streamflow values for Bird Creek basin take on the value of 0, the flow ranges for this basin correspond to 50%, 75%, 90% and 100%. For each flow range, we used the BMA scheme to estimate a distinct set of BMA weights.    Table 4 shows the values of correlation coefficients between BMA weights and DRMS and DABS statistics averaged over different flow ranges for each basin. These high correlation values re-affirm the previous finding in Section 4.3 that BMA weights do reflect model performance. Now let's evaluate the DRMSS and DABSS statistics of the combined BMA predictions obtained using multiple sets of weights (denoted as BMA m predictions hereafter). Fig. 9 exhibits the DRMSS and DABSS statistics of both the BMA 1 and BMA m predictions. This figure indicates that the BMA m predictions not only improve on the best individual predictions, but also do better than the BMA 1 predictions in terms of DRMSS and DABSS statistics. This tells us that there is a potential advantage in using multiple weight sets over a single set of values. This probably indicates that BMA m predictions are more capable of taking advantages of the diversity of the ensemble members.

Verification of the skill and reliability of the BMA m probabilistic predictions
Using the procedure described in Box 2, we again created 100-member BMA m ensemble predictions for each flow range using the associated BMA m weights. The combined BMA m predictions for all flow ranges are shown in Fig. 10, along with the 90% confidence interval. Again, the 90% confidence interval seems to encompass the observed very well. Fig. 11 shows the  RPSS statistics for the BMA predictions using both the single-set weights and multi-set weights. The RPSS statistics for BMA m predictions indicates that BMA m predictions are significantly more skillful and reliable than the original ensemble predictions. But the RPSS statistics of BMA m and BMA 1 predictions is essentially the same.
The reliability diagram of BMA m predictions is shown in Fig. 7, along with that of the BMA 1 and original ensemble predictions. Based on Fig. 7, the reliability of all three sets of ensemble predictions is comparable. All of the ensemble predictions have good resolution, as indicated by the full coverage of the observed probability range by all predictions. The reliability is excellent for some middle flow and all high flow ranges, as indicated by the reliability curves closely wrapped around the 45°l ine. The high reliability score is probably due to the fact  that all of the streamflow predictions have been calibrated to observed streamflow data. The reliability for the low flow ranges is not as good as in the high flow ranges. In these low flow cases, the original ensemble predictions tend to have an over-prediction bias. The reliability of the two BMA ensemble predictions is similar and they all tend to over-predict the low end and under-predict the high end of each flow range. Note that the low flow reliability diagram for Bird Creek is not shown. This is because almost all of the observed and predicted flow values in the low flow range are equal to 0. Note also that the reliability diagram of the middle flow range for Bird Creek is similar to that of low flow range for the two other basins.

Validation of BMA predictions using data from independent periods
The previous sections show that the BMA scheme is a promising tool for generating probabilistic predictions. A natural question to ask is how robust are these results. In this section, we designed a set of experiments to evaluate how the BMA predictions perform when they are evaluated using data from an independent validation period.
The validation periods for the test basins are listed in the last row of Table 1. We used the weights obtained from the training periods and computed BMA predictions for the validation periods. The performance statistics, including DRMSS, DABSS, RPSS and Reliability Diagrams, are again employed to examine the consistency of the BMA predictions. Table 5 lists these statistics for the training and validation periods for BMA 1 predictions, while Table 6 provides the same information for BMA m predictions. In terms of the DRMSS statistics, the performance of both BMA 1 and BMA m predictions in the validation period is degraded somewhat from that in the training period. The DABSS statistics indicates a mixed picture: the performance of both BMA 1 and BMA m predictions in the validation period is shown to  be improved over that in the training period for two basins (i.e., Bird Creek and French Broad) and the reverse is true for Leaf River.    Creek, the RPSS statistics for the validation period is about half of the training period. The results here indicate that there is some degradation in performance when the weights generated from the training period are used for the validation period. However, the advantage of using BMA approach is still obvious compared to the bench marks used (i.e., the best individual predictions and the original ensemble). Fig. 12 shows the reliability diagrams for the low, middle and high flow ranges (as defined in Section 4.3.2) for all three basins. The reliability of all three sets of ensemble predictions is very good and there is no degradation in terms of reliability measure between the validation and calibration period for the test basins.    Table 1. Geophysical and climatic characteristics of the test basins.

Summary and conclusions
All models are imperfect representations of the realworld processes. Different models have strengths in capturing different aspects of the real world processes. It is highly desirable that some kind of consensus predictions can take advantage of the diverse skills in different individual predictions. BMA scheme has shown to be a useful statistical scheme that generates probabilistic predictions from different competing predictions. This is accomplished through a weighting strategy based on the likelihood of an individual prediction being correct given the observations. We have illustrated how the BMA scheme can be used to generate probabilistic hydrologic predictions from several competing individual predictions. Here are the major findings of this study.
(1) The expected BMA 1 predictions obtained by using a single set of weights computed over the entire training period are better than or comparable to the best individual predictions in terms of DRMSS and DABSS statistics. The advantage of BMA 1 predictions over the original ensemble predictions is very significant, by 30% or better. In Leaf River, the DRMSS and DABSS scores is relatively weak, indicating that the advantage of BMA approach can be limited in certain cases. (2) The use of multiple sets of BMA weights to generate BMA m predictions was a way to accentuate strengths of individual models in capturing different phases of the hydrograph. This is achieved by breaking the streamflow records into a number of flow ranges so the statistical property of the predictive errors in each flow range is more homogeneous. We found that the expected BMA m predictions are markedly improved in terms of DRMSS and DABSS statistics compared to the best individual model predictions. More significantly, BMA m probabilistic predictions are generally better than the BMA 1 predictions. Both BMA 1 and BMA m predictions are much better than the original ensemble based on RPSS statistics. The original ensemble, BMA 1 and BMA m predictions are all very reliable as the observed values are reliably contained within the ensemble ranges. (3) In both single weight and multi-weight BMA studies, we found that the BMA weights were highly correlated with the model performance statistics, confirming one of the central assumptions of the BMA scheme that better performing models receive higher weights because their likelihood of being correct is higher given the observations. (4) In validation studies, we found that there is some degradation of performance in the validation period in terms of DRMSS statistics. However, the DABSS statistics send a mixed message, with two basins showed improvement and one basin degradation. Furthermore, the RPSS statistics still indicate the clear advantage of BMA predictions. There is no degradation in terms of reliability between the calibration and validation periods.
This study was based on three hydrologic basins with limited lengths of hydrologic data. Unless more basins and longer data sets are used, the results may not be generalized to other basins. It would be an interesting future study to find out minimum number of years of data needed to have consistent results.
We must note the results from this study are basically an analysis exercise involving post-processing of existing retrospective model simulations or predictions. This scheme can be easily implemented within any existing operational or research prediction systems, where multiple models are available. It can be quite suitable for applications to realtime ensemble hydrologic forecasting. However, care must be taken as uncertainty associated with retrospective simulations (as in this study) is usually much less than that of real-time predictions in which predicted meteorological forcing data such as precipitation and air temperature is used.