Quantification of the upstream‐to‐downstream influence in the Muskingum method and implications for speedup in parallel computations of river flow

The mathematical formulation of the Muskingum method, like that of many numerical schemes used for river routing, requires that all upstream river reaches be updated prior to updating the flow rate of any given reach. Due to this topological constraint, such numerical schemes have traditionally been solved in an upstream‐to‐downstream manner which imposes inherent limitations on the speedup that can be achieved in a parallel computing environment because each computing core has to wait for completion of all cores addressing upstream subbasins prior to starting its own subbasin. The research presented in this paper quantifies the exact influence among river reaches during the update step of the Muskingum method and shows that the influence decreases with increasing distance between two reaches until it becomes too small to be accounted for by floating‐point arithmetic. A formal definition of the minimal distance from which the relative influence becomes numerically inexistent—the radius of influence—is presented. Based on this distance, expressed as a number of river reaches, a new estimate of the maximum theoretical speedup that can be achieved by the Muskingum method or by similar numerical schemes is presented and implies large potential gains in computing time when domains are much larger than the radius of influence. An application to the approximately 180,000 river reaches of the Upper Mississippi River Basin at a 15 min time step over 2004 shows a radius of influence on the order of 150 river reaches. The speedup obtained for this application is much higher than previously thought possible but also much lower than could be attained, suggesting that further investigations are necessary.


Introduction
[2] Over the past decade, parallel computers have become increasingly accessible to scientists, and their use for studying water resources is progressively becoming common. Generally, parallel computing is used to address problems larger than could otherwise be tackled in which case scalability is of concern ; or it is used to solve problems of constant size but in less time in which case speedup is studied. Many aspects of the terrestrial water cyclesuch as the surface and subsurface water balances-have already benefitted from parallel computing, and the reader is referred to Vivoni et al. [2011] or David et al. [2011b] for a more detailed general overview of parallel computing in terrestrial hydrology. Few studies, however, focus on the use of parallel computing for river flow modeling and on related specific speedup issues as done in this paper. One way to leverage parallel computing for river routing is to assign disconnected river basins to different computing cores, therefore requiring no time-consuming intercore communication, but this ensues in load imbalance, and performance becomes limited by the largest river basins [e.g., Larson et al., 2007]. Hence, the study of parallel computing methods applied to river modeling within large basins is of interest. The main challenge of such an endeavor is that rivers flow mostly from upstream to downstream; therefore, one may have to account for network connectivity and direction of flow when assigning tasks to different computing cores, depending on the type of numerical scheme used for river routing. These schemes generally update a given river reach based on a combination of some or all of the terms corresponding to the same reach at previous time step, the upstream reaches at previous time step, and the upstream reaches at update time step. The simplest numerical schemes-referred to as type 1 here-update a given river reach based on the same reach at previous time step and on the upstream reaches at previous time step and hence allow computations to be updated all at once [e.g., Larson et al., 2007;von Bloh et al., 2010]. In contrast, more accurate numerical schemes-referred to as type 2 here-update a given reach based on the same reach at previous time step and on upstream reaches at both the previous and the updated time step and hence require that computations be carried from upstream to downstream in order [e.g., Li et al., 2011]. Both type 1 and type 2 numerical schemes necessitate information concerning those river reaches that are located directly upstream of a given reach, but at the previous time step for type 1, and at the previous and update time steps for type 2. The differentiation made here therefore only reflects on if the terms corresponding to the upstream reaches at update time step are used (type 2) or are not (type 1). The maximum theoretical speedup of type 1 schemes is ideal, i.e., the computing time decreases by a factor exactly equal to the number of computing cores. In the case of type 2 schemes, however, the estimates of maximum theoretical speedup have traditionally accounted for the necessity to wait for prior upstream updates and are therefore much lower that the speedup of type 1 schemes. Larson et al. [2007] used a type 1 scheme on disconnected basins which resulted in load imbalance. von Bloh et al.
[2010] also used a type 1 scheme but split computations of each separate basin into a few processing cores to balance computing loads and accounted for necessary communications at subbasin boundaries. Li et al. [2011] used a type 2 scheme in which the subbasin that each processing core addresses changes at each time step in an upstream-to-downstream sequence. In that case, careful algorithms for choosing the ordering of computations [e.g., Li et al., 2010;Veitzer and Gupta, 2001] can help maximize parallel performance, but speedup is traditionally assumed to be limited by river network topology. David et al. [2011b] also used a type 2 scheme but virtually avoided sequential upstream-to-downstream computations through the use of an iterative solver on multiple computing cores and showed better speedup results than was predicted by existing theory. David et al. [2011b] proposed that such speedup may be explained by flow waves not being fast enough to propagate through the entire domain within any given time step but failed to provide a proof.
[3] The purpose of this paper is to quantify the exact upstream-to-downstream influence among river reaches during the update between two consecutive time steps of the Muskingum method. We put the magnitude of this influence in the perspective of precision in computer operations and present a new estimate of the maximum theoretical speedup that can be obtained when performing river routing computations that follow a numerical scheme in which the update of any given river reach requires prior update of all directly upstream reaches.
[4] This paper first provides a theoretical approach, followed by a small theoretical example. A practical application focusing on the Upper Mississippi River Basin over the year 2004 at a 15 min time step and using approximately 180,000 river reaches ( Figure 1) is then presented. Finally, practical results, discussion, and conclusions are given.

Muskingum Method as a Linear System of Equations
[5] The Muskingum method [McCarthy, 1938] is a river routing scheme in which flow waves propagate in a river reach j during the computing time step Át as a function of a temporal parameter k j and of a dimensionless parameter x j following equation (1).
where t is time and Át is strictly positive. The variable Q up j is the flow entering reach j and coming from upstream river reaches. The formulation of the Muskingum method given in equation (1) includes the inflow of water Q e j coming from outside of the river network into a given reach j (e.g., lateral inflow from runoff and river/aquifer exchanges) and added upstream. Such inclusion of Q e j in the Muskingum method as un upstream contribution is in effect identical to approaches using time-averaged lateral inflow [National Environment Research Council, 1975;Ponce, 1986] or time-varying lateral inflow [O'Donnell, 1985;Orlandini and Rosso, 1998]. These previous approaches only differ by the time at which Q e j is specified in equation (1), i.e., Q e j t ð Þ as done here, Q e j t þ Át ð Þas in Orlandini andRosso [1998], or both as in O'Donnell [1985]. In this study, Q e j is computed by a land surface model and is assumed to be slowly time-varying; the impact of the time at which Q e j is specified is hence small. The variable Q j is the flow exiting reach j and therefore represents the flow at the node downstream of a given river reach j, although it is loosely referred to here as the flow at reach j in order to simplify the wording. The parameters C 1 j , C 2 j , and C 3 j are constant parameters computed using equation (2).
[6] One should note that the sum C 1 j þ C 2 j þ C 3 j always leads to a value of one. With the Muskingum method, updating the flow in any given river reach necessitates access to the updated flow in river reaches that are located directly upstream as can be seen from the first element of the right-hand side in equation (1). For this reason, solving the Muskingum method using computers has traditionally been done sequentially from upstream to downstream on one unique computing core.
[7] Adapting equation (1) using a matrix notation, David et al. [2011b] showed that the Muskingum method can be adapted to equation (3), which can be solved in parallel by splitting the vectors and matrices on multiple computing cores and allowing for intercore communication.
where the bold notation is used for vectors and square matrices which are all of size m, the total number of river reaches in a river network. The matrix I is the identity matrix; C 1 , C 2 , and C 3 are diagonal matrices in which diagonal elements are C 1 j , C 2 j , and C 3 j , respectively ; N is the network matrix in which-assuming a maximum of one downstream river reach being allowed for each river reach-a value of one is used at row i and column j if reach j flows into reach i and zero is used elsewhere. The vector Q (respectively, Q e ) is made of the corresponding elements Q j (respectively, Q e j ). The partial temporal uniformity of Q e j simplifies the river routing model formulation, limits the quantity of input data, facilitates the coupling with land surface models, and is a valid assumption if such information is made available less often that the river routing time step as in this study. One should note that equation (1) allows for solving ''all river reaches, one time step after another'' or for solving ''all time steps, one river reach after another,'' whereas equation (3) is limited to the former. By combining all elements on the right-hand side of equation (3) into a single vector b, the following equation is obtained: [8] Solving for the unknown Q t þ Át ð Þ in equation (4) can be done by providing the highly sparse matrix I À C 1 Á N and the right-hand side b t ð Þ to a linear system solver. Equations (3) and (4) are valid regardless of the ordering of N. However, N can be made strictly lower triangular-i.e., all elements on and above the diagonal are zero-by ordering the river reaches in an upstream-to-downstream manner which will be assumed in the following. More information on the derivation of this matrix-based Muskingum method can be found in David et al. [2011b].

Inversion of This Linear System of Equations
[9] The matrix C 1 being a diagonal matrix, and N being a strictly lower triangular matrix, the product C 1 Á N is strictly lower triangular (see Appendix A) and I À C 1 Á N is a lower unit triangular matrix (i.e., a lower triangular matrix for which all diagonal elements have a value of one). The determinant of a triangular matrix is the product of its diagonal elements ; therefore, det I À C 1 Á N ð Þ¼1 which is not null, and hence, the inverse of I À C 1 Á N exists and is unique. This is not surprising because the Muskingum method can be solved explicitly from upstream to downstream, as done traditionally.
[10] The characteristic polynomial P IÀC 1 ÁN y ð Þ of I À C 1 Á N is defined by [11] Accounting for C 1 Á N being strictly lower triangular and for the determinant of a triangular matrix being the product of its diagonal elements, equation (5) can be simplified as [12] Following the Caley-Hamilton theorem, I À C 1 Á N satisfies its own characteristic polynomial : where 0 is the null matrix. Substituting equation (7) into equation (6) leads to [13] Taking equation (8) into account, one can then show that where p is an integer. The unique inverse of I À C 1 Á N has therefore been determined. This inverse can be seen as the Muskingum operator M, which like I À C 1 Á N, is a unit lower triangular matrix: [14] Finally, the linear system of equation (4) can be inverted as shown in equation (11).
[15] Despite being presented here for time-invariant Muskingum parameters, one should expect similar results if C 1 varied with time in which case the linear system matrix I À C 1 Á N and the Muskingum operator M should be updated at every time step, although such is beyond the scope of this study.
[16] Albeit being lower triangular, M is much denser than its highly sparse inverse I À C 1 Á N. Therefore, for computer performance purposes, a river routing scheme should solve equation (4) using a linear system solver rather than solving equation (11) directly. However, despite its high associated computational cost, M can be used to quantify the exact relative influence among river reaches and is therefore useful in the context of this study.

Concerning the Precision That Can Be Expected From the Muskingum Method
[17] The Muskingum method-whether in its traditional form of equation (1) or in the matrix-based form of equation (4)-is a linear system of equations in which the matrix I À C 1 Á N and the right-hand side b t ð Þ are given and the unknown Q t þ Át ð Þis to be determined. The properties of this linear system are properties of the linear equations to be solved (i.e., the physical problem being considered) and not of the mathematics used to solve them. One such property is the relationship linking small relative variations of I À C 1 Á N and of b t ð Þ to those of Q t þ Át ð Þ. The corresponding inherent stability (or instability) of the linear system is generally referred to as its condition (or ill condition) and measured by a quantity called the condition number defined by where kk is a norm. The paper by Turing [1948] and the book by Wilkinson [1963] have both been helpful to the authors in understanding the inception of the condition number, and the reader is referred to these for additional information. However, we found that the formulation of the relationship linking small relative variations of I À C 1 Á N, b t ð Þ and Q t þ Át ð Þto the condition number is more clearly presented in Higham [1990] than in these earlier studies and is hence better suited here. The formulation of Higham [1990] is where " is a small real number, and the same norm type is used in equations (12) and (13) for both vectors and matrices. The numerical value of the condition number therefore depends on the linear system being considered as well as on the norm being used. In the case of the 2-norm, equation (12) can be expressed as where min and max are, respectively, the minimum and maximum singular values of I À C 1 Á N. The determination of the exact singular values for I À C 1 Á N is nontrivial and outside of the scope of this study. However, a well-established scientific library is used in section 5 to estimate the singular values and hence the condition number, therefore providing an estimate of the bounds relating relative variations of I À C 1 Á N, b t ð Þ, and Q t þ Át ð Þ.
[18] Variations of I À C 1 Á N and b t ð Þ are to be expected among various computer models. For example, the value of b t ð Þ is prone to small relative variations depending on the ordering of the floating-point operations used for its computation.
would not lead to the exact same results in a floating-point environment (see section 2.7), despite these two expressions being mathematically equivalent. Therefore, the exact floating-point value of b t ð Þ will differ from one computer model to another. Such differences are also bound to exist for I À C 1 Á N. The resulting relative differences in the values of I À C 1 Á N and b t ð Þ will generate relative differences in Q t þ Át ð Þof a magnitude linked to the condition number of the Muskingum method. Consequently, one should not hope to achieve better accuracy in the computation of Q t þ Át ð Þthan that allowed by equation (13). Such differences are not only expected, they are also unavoidable.
[19] In addition to these precision issues which are inherent to the physical problem represented by the Muskingum method, one should also expect small variations in the values computed for Q t þ Át ð Þ from the various mathematical ways of solving the linear system, i.e., directly with equation (1) or equation (11); or using a linear system solver with equation (4). Such matters are also discussed in general terms in Turing [1948] and Wilkinson [1963]. However, variations in Q t þ Át ð Þresulting from the mathematics of resolution that are of the same order of magnitude as those due to small variations of the linear system matrix or of the right-hand side can therefore be deemed acceptable.

Calculation of the Muskingum Operator on Computers
[20] In river networks with m river reaches, computing the elements of M based on equation (10) necessitates on the order of m þ 1 ð ÞÁm=2 multiplications of matrices of size m. In large river networks-e.g., m on the order of 100,000-equation (10) would therefore be highly computationally demanding, and an alternate method is preferable. The matrix I À C 1 Á N being the inverse of M, their product is the unit matrix, which can be expended in the following equality : [21] By definition of sums and products of matrices, applying equation (15) where i; j ð Þ is the Kronecker function which has a value of one if i and j are equal, and zero otherwise. The quantity q is an integer representing all river reaches, and C 1 Á N ½ i; q ð Þ is the element of C 1 Á N located at row i and column q. Accounting for a maximum of one downstream river reach per given reach and for related consequences on the properties of N, C 1 Á N, and their powers, equation (16) leads to (detailed derivation given in Appendix A): where N þ represents all positive integers, 9 is the mathematical symbol for existence, ! is used for uniqueness, and () is for material equivalence. The integer quantity r is therefore the unique river reach (if it exists) that is located directly upstream of reach i and that is also located p reaches downstream of river reach j. Physically, equation (17) means that if a river reach i is located directly downstream of a reach r which is itself somewhere downstream of a reach j, then one can compute M ½ i; j ð Þ as a function of M ½ r; j ð Þ. Also, M ½ i; j ð Þ is null if reach i is not anywhere downstream of reach j which makes sense because flow only goes downstream in the Muskingum method. One can therefore travel downstream through the river network to determine the elements of M. Equation (17) is much more computationally manageable than equation (10) and is used herein to determine the elements of M.

Magnitude of Elements in the Muskingum Operator
[22] By mathematical induction, equation (17) leads to where i; j ð Þ represents the ensemble of river reaches l that are both strictly upstream of river reach i and strictly downstream of river reach j. The computation of M from equation (18) involves the product of elements C 1 j , and their magnitude is therefore of importance. According to Cunge [1969], the Muskingum method is stable for x j 2 0; 0:5 ½ , regardless of the strictly positive value of Át, which associated to equation (2) allows to derive the following inequalities: which can also be expressed as [23] Therefore, being the product of elements C 1 j , the subdiagonal elements of M ½ i; j ð Þ are all bounded by the interval À 1; 1½, i.e., the strict interval between À1 and 1 that excludes both À1 and 1. The parameter C 1 j approaches unity for flow waves of infinite celerity, i.e., infinitesimally small values of k j with regard to the modeling time step Át, but such is usually avoided by picking a time step smaller than k j . Additionally, each element M ½ i; j ð Þ becomes increasingly closer to zero as a reach i downstream of a given reach j becomes increasingly distant from it.

Quantification of Upstream-to-Downstream Influence in the Muskingum Method
[24] By definition of the matrix-vector product, equation (11) can be expressed as [25] The vector b t ð Þ is by definition a combination of the inflow N Á Q t ð Þ from upstream, of the inflow Q e t ð Þ from outside of the river network, and of the outflow Q t ð Þ; all defined at the previous time step t. Therefore, equation (21) shows that the contribution from each reach j at the previous time t to the updated flow Q i t þ Át ð Þat reach i can be seen in the product M ½ i; j ð Þ Á b j t ð Þ. Additionally, one can show (detailed derivation given in Appendix B) that the contribution from each reach j at the updated time t þ Át to the updated flow Q i t þ Át ð Þat a different reach i can be seen in the product M ½ i; j ð Þ Á Q j t þ Át ð Þ. Given that the magnitude of the elements in the Muskingum operator decreases with distance between river reaches, one can therefore wonder how far does the update need be accounted for.

Radii of Influence
[26] Given " the bound for the relative error when a real number is rounded to its closest floating-point number (i.e., the machine epsilon), the precision that can be expected from the floating-point addition of two floating-point numbers y and z is [Goldberg, 1991] where F designates all floating-point numbers. The approximation of floating-point additions can also be expressed as [27] Applying equation (23) to the contribution from each reach j to each reach i during the update step, one can Þj is always smaller than 0:5 Á " Á jQ i t þ Át ð Þj, then from a floating-point arithmetic perspective, the river reach j has never an influence on the update of river reach i. Therefore, the river reaches i and j could be computed independently, and such would be true for both equations (11) and (4). Since the elements M ½ i; j ð Þ become increasingly closer to zero as the distance between reaches i and j increases, the independence of reaches i and j is bound to happen starting at a largeenough distance which will be here referred to as the radius of influence. A radius of downstream influence R down j can be defined as [28] The radius of downstream influence R down j is defined here as an integer for each river reach, but one could also think of other definitions where R down j varies not only in space but also in time. Physically, this means that the influence that any river reach further away than R down j downstream from reach j is smaller than can be accounted for by floating-point arithmetic. Similarly, one can define a radius of upstream influence : [29] R down j therefore allows looking downstream at what river reaches are influenced by a given reach, and R up i allows looking upstream at what river reaches have influence on a given reach. One should note that the values of both M ½ i; j ð Þ and Q j t þ Át ð Þneed be accounted for in computing the radii of influence since instantaneous values of Q j t þ Át ð Þ can be of greater magnitude than that of Q i t þ Át ð Þ-e.g., when a flow wave is propagating-even if such is not true for their respective temporal averages.

Implications for Parallel Computing
[30] Most studies looking at parallel speedup of river flow computations using a simple upstream-to-downstream river routing scheme of type 2 as done here assume that the update of any river reach cannot be done prior to the update of all of its upstream river reaches [e.g., Li et al., 2011]. Under such an assumption, the maximum theoretical speedup S 1 N ð Þ that can be reached when assigning N subbasins to N different computing cores while keeping loads balanced is where N b is the largest number of subbasins (i.e., computing cores) crossed-out of all possible paths going from upstream to downstream-in the entire river network.
[31] Based on the work presented in this paper, using subbasins that are longer than the radii of influence and accounting for the corresponding independence of river reaches, one can therefore compute each subbasin entirely in two iterations and guarantee that the downstream-most elements are computed exactly in the first iteration while others are still inexact, and be assured that a second iteration is sufficient to update all other elements of each subbasin. Such an approach can allow reaching the following maximum theoretical speedup : [32] One should note here that equation (27) assumes that each core computes all its corresponding elements twice-as done in this study-which is not necessary, and therefore, S 2 N ð Þ could be higher. Alternatively, one could update all elements in a first iteration, again knowing that only the downstream-most river reaches are exactly computed, and limit the second iteration to those reaches that were inexact, i.e., those located within a distance of R down À 1 downstream of the connections between subbasins. The integer quantity R down can be picked, for example, as the maximum value of all R down j . In doing so, one could expect the following ideal maximum theoretical speedup: [33] In the case where the radius of influence or the number of subbasins are much smaller than the total number of river reaches in the entire domain, i.e., R down À 1 À Á Á N=m << 1, S 3 becomes close to ideal. As a consequence, if the modeling domain allows for more than two consecutive subbasins that are large enough with regard to the radii of influence, the maximum theoretical speedup of equations (27) and (28) can be much higher than that of equation (26).

Theoretical Example
[34] Figure 2 shows an example river network made out of 12 river reaches. The corresponding network and linear system matrices are given by equations (29) and (30), respectively, in sparse format (only the nonzero values are shown here and saved by the computer program): : [35] The Muskingum operator being much denser, only the first column is shown here, for clarity: [36] One can see that the last element of the first column is the product of seven elements C 1 j and is likely to be of small magnitude given the bounds of C 1 j themselves.
[37] Applying the traditional Muskingum method to river reach j ¼ 7 and substituting into the same equation applied to j ¼ 9 lead to [38] One should note here that if Q 6 t þ Át ð Þwere to be substituted by its value as a function of b 6 t ð Þ, the same Figure 2. A river network made out of 12 river reaches and 7 nodes split into four subbasins and used as a theoretical example.
multiplying scalar M ½ 9; 6 ð Þ ¼ ½ C 1 Á N ð Þ 2 9; 6 ð Þ ¼ C 1 9 Á C 1 7 would appear before b 6 t ð Þ as it does before Q 6 t þ Át ð Þin equation (32); as was previously demonstrated in section 2.6. Assuming a radius of downstream influence of two for all river reaches, any given river reach only has accountable influence on the updated flow rates for itself and for the immediate downstream reach. Consequently, the value of C 1 9 Á C 1 7 Á Q 6 t þ Át ð Þcan always be neglected with regard to the value of Q 9 t þ Át ð Þ. Therefore, one could compute an estimateQ 9 of Q 9 using [39] Such an estimate would actually be exactly correct from a floating-point arithmetic perspective, and therefore, one would not need to wait for Q 6 t þ Át ð Þto be computed in order to compute Q 9 t þ Át ð Þ. Hence, if subbasins are constructed larger than the radius of downstream influence, then the downstream-most element-at the least-can be computed exactly without accounting for the inflow from upstream subbasins, and new computing algorithms can be developed.
[40] Figure 3 shows three ways of solving the Muskingum method if splitting the entire river network in the four subbasins of same size that are shown in Figure 2, and Figure 4 shows the corresponding time sequences. In both of Figures 3 and 4, method (a) corresponds to the traditional Muskingum method, method (b) corresponds to an iterative linear system solver, and method (c) is raised as a potential new way to solve the Muskingum method.
[42] Another way of solving the Muskingum method is to start computing everywhere at the same time and to allow for two iterations as shown in Figures 3b and 4b. Using core 3 as an example, a first iteration gives whereQ j t þ Át ð Þ is used for a temporary update of Q j t þ Át ð Þ. Notice that Q 6 t þ Át ð Þ¼0 is used in equation (34) for the computation of Q 7 t þ Át ð Þ because its real value has not yet been computed by core 1. A second iteration gives [43] The values ofQ 9 t þ Át ð Þ and of Q 9 t þ Át ð Þ are exactly the same in equations (34) and (35), because with a radius of downstream influence of two, C 1 9 Á C 1 7 Á Q 6 t þ Át ð Þ is too small to be accounted for by computer operations. This method allows obtaining a maximum theoretical speedup of S 2 4 ð Þ ¼ 4=2 ¼ 2:00.
[44] Finally, the method of Figures 3c and 4c could be used to update everywhere in a first iteration knowing that the upstream-most elements in each subbasin are inexact, and finish the update of these upstream elements in a second iteration which would lead to S 3 4 ð Þ ¼ 4= 1 þ 2 À 1 ð ÞÁ4=12 ð Þ ¼ 3:00.
[45] Accounting for the relative upstream-to-downstream influence between river reaches when solving the Muskingum method can therefore double or even triple the maximum theoretical speedup for this particular example. The value of two picked here for the radius of downstream influence is smaller than the values obtained in the practical application presented later, for clarity, but the same concept applies.

Practical Application
[46] The Upper Mississippi River Basin is used here as a test bed for the theoretical approach presented in section 2.
The enhanced version of the National Hydrography Dataset (NHDPlus) [U.S. Environmental Protection Agency and U.S. Geological Survey, 2010] provides all river reaches of this large network as well as the corresponding contributing catchments. Runoff data from the second phase of the North American Land Data Assimilation System (NLDAS2) [Xia et al., 2012a[Xia et al., , 2012b are used here to estimate the inflow of water from surface and subsurface into the river network. The computations of river flow using the Muskingum method are performed with the Routing Application for Parallel computatIon of Discharge (RAPID) [David et al., 2011b], and Muskingum parameters are optimized based on gauge observations from the U.S. Geological Survey (USGS). Our analysis focuses on days from 1 January to 31 December 2004, and all tools used are briefly described in this section.

NHDPlus Description of Upper Mississippi River Basin
[47] The Upper Mississippi River Basin is available as NHDPlus Region 07 ( Figure 1) and has 191,646 river reaches of which 182,240 have known network connectivity and are used in this study. These 182,240 river reaches vary in size from 0.001 to 27.353 km (median: 1.64 km, mean: 1.87 km, standard deviation 1.64 km). The total basin area of 491,777.837 km 2 is divided into 180,787 contributing catchments, each corresponding to a unique river reach with known network connectivity. The smallest of the river reaches with known network connectivity do not have an associated contributing catchment. The contributing catchments vary in size from 0.001 to 318.91 km 2 (median: 1.58 km 2 , mean: 2.72 km 2 , standard deviation 4.63 km 2 ).

Inflow of Surface and Subsurface Water Into NHDPlus River Reaches
[48] NLDAS2 [Xia et al., 2012a[Xia et al., , 2012b offers hourly land surface data from four land surface models from 1979 to current times and on a 1/8 degree grid. Out of the four NLDAS2 land models, only the Variable Infiltration Capacity (VIC) model originally developed by Liang et al. [1994], and now in its version 4.0.5 was used in this study. Three-hourly inflows of water into NHDPlus river reaches were estimated from the sum of VIC surface and subsurface runoff by means of the conceptual translation between the gridded NLDAS2 environment and the vector-based NHDPlus catchments that is described by David et al. [2013] and that uses catchment centroids.

Flow Observations on NHDPlus Rivers
[49] Flow observations are needed in this study for optimization of river routing parameters and for comparisons between computations and observations. Here we use the gauges of the USGS National Water Information System (NWIS) of which the exact location on the NHDPlus river network is available as part of the NHDPlus deployment. A total of 1277 NWIS gauges exist in the NHDPlus description of the Upper Mississippi River Basin but only 1251 are located on river reaches with known network connectivity. Out of these 1251 gauges, only 419 have full daily record from 1 January to 31 December 2004. Each of these 419 gauges is located on a unique NHDPlus river reach, except for four which are located on two river reaches (two stations per reach). The comparisons between computed and observed flow are done here using the flow at the outlet of reaches that are host to a gauge, and hence only the most downstream gauge was retained in case of duplicates. Therefore, daily data from 417 USGS gauges are used in this study. The selection and downloading of data were performed using HydroExcel (http://his.cuahsi.org/hydroexcel.html) and HydroGET (http://his.cuahsi.org/hydroget.html).

Routing Application for Parallel Computation of Discharge
[50] The RAPID [David et al., 2011b] is a river routing model. Given surface and groundwater inflow to rivers, this model can compute the flow and volume of water everywhere in river networks made out of many thousands of reaches. The design of RAPID allows it to be adapted to any river network, if given basic connectivity information as is the case in NHDPlus. Computations in RAPID can be performed using the traditional Muskingum method of equation (1) and accounting for surface/groundwater inflow or with the matrix-based Muskingum method of equation (3). RAPID also has an automated parameter estimation procedure that allows finding optimal model parameters based on available gauge measurements. This model uses the Fortran programming language along with the Portable, Extensible Toolkit for Scientific Computation (PETSc) [Balay et al., 1997[Balay et al., , 2010[Balay et al., , 2012 and the Toolkit for Advanced Optimization [McInnes et al., 2011] and can be run on personal computers, as well as on massively parallel supercomputers. The model code for RAPID as well as related documents and animations can be found at http://www.ucchm.org/david/rapid.htm. Calculations in RAPID are performed using double-precision floating points.
[51] As in David et al. [2011bDavid et al. [ , 2013, a river routing time step of Át ¼ 900 s is used. This time step should properly capture the flow dynamics at least for river reaches that are 0.9 km long or longer (i.e., more than two thirds of river reaches in this study) if flow wave celerities are c ¼ 1 m=s or slower, as in this paper.

Calibration of RAPID
[52] In this study, the optimization of the Muskingum parameters k j (time) and x j follows the simple approach of David et al. [2011b] using a squared-error cost function 1 computed using all 417 available gauges over 1 January to 30 December 2004. This procedure leads to the following values: where L j is the length of each river reach j, c 0 ¼ 1 km=h is a constant flow wave celerity, and k j is expressed in seconds. The value of the flow wave celerity c ¼ c 0 =0:3087 ¼ 0:90 m=s obtained here for the Upper Mississippi River Basin is close to c ¼ 0:78 m=s that was presented by David et al. [2013] for the Texas Gulf Coast Hydrologic Region; a domain of similar size but with smaller rivers. Figure 5 shows an example hydrograph obtained for the most downstream gauge of the Upper Mississippi River Basin. The more advanced optimization methods presented by David et al. [2011aDavid et al. [ , 2013 could be used here to improve the match between observed and computed hydrographs. However, such is beyond the goal of this paper, and the simple parameters of equation (36) will be used for the analysis in this study.

Computation of the Muskingum Operator and Quantification of the Upstream-to-Downstream Influence
[53] The values of the elements C 1 j for the Upper Mississippi River Basin are computed based on equations (2) and (36). Figure 6 shows the probability density function and the cumulative distribution function of C 1 j for the entire study area. As expected, the elements C 1 j are bounded by the interval À 1; 1½. In this particular case, no negative values are obtained for C 1 j because x j is small.
[54] Equation (17) was then used to compute the Muskingum operator M. One should note here that storing the values of the Muskingum operator in a file requires saving 182240 2 double-precision floating points, each necessitating 8 B, which represents 247 GB; i.e., the equivalent of approximately 60 years of three-hourly river flow data for the same domain. Allowing for a maximum of three potential upstream river reaches for each river reach as done in NHDPlus, the sparse linear system matrix I À C 1 Á N only necessitates a maximum of 182240 Á 3 þ 1 ð Þ floats which represents only 5 MB. It is therefore much easier to use I À C 1 Á N for river routing on computers than it is to use M not only because M is much denser and hence requires a higher number of computations, but also because of its associated higher memory needs. Because of the large size of M, only the part corresponding to all 3916 river reaches located upstream of the Mississippi River at Brainerd, MN, is shown here for clarity of the graphics (Figure 7). As expected, the magnitude of M gets smaller as one moves away from the main diagonal, which confirms that the relative upstream-to-downstream influence decreases with the distance between river reaches.

Radii of Influence
[55] Figure 8 shows all river reaches located upstream of the Mississippi River at Brainerd, MN, as well as those river reaches that have accountable influence on and from the Mississippi River at Aitkin, MN. Equation (24) was used to select all river reaches for which the influence from Aitkin is large enough to be accounted for by a computer model, and 24 river reaches match this criterion ; so R down Aitkin ¼ 24. Equation (25) was used to select all the river reaches that have a large-enough influence on Aitkin, and a total of 472 river reaches match this other criterion but include many branches upstream of Aitkin and therefore provide an upper estimate for R up Aitkin . The number of river reaches for which floating-point arithmetic allows accounting for the relative influence among river reaches is therefore much smaller than the domain size in this study.
[56] Equation (24) was then used to compute the radius of downstream influence for the entire domain, and values are shown in Figure 9. Values of R down j range between 0 and 155 and have an average of 36 which is again much smaller than the domain size. The spatial repartition of R down j does not seem to particularly match any topological network structure, which can be explained by the dependence of Q j t þ Át ð Þ on inflow of water from surface and subsurface. Similar values are found for the radii of influence when replacing Q j t þ Át ð Þby b j t ð Þ in equations (24) and (25), although these results are not shown here.
[57] One should note that the computations of both radii of influence are made here using the worst-case scenario, i.e., the maximum absolute value of Q j t þ Át ð Þ and the minimum absolute value of Q i t þ Át ð Þover the entire duration of study, both evaluated at the 15 min time step. The radii of influence reported here are therefore safe overestimates since the ratio of Q j t þ Át ð Þ over Q i t þ Át ð Þ is always smaller-at any time during the simulation-than if computed based on their respective maximum and  minimum as done here. The value of " ¼ 2 À53 % 1:11 Â 10 À16 corresponding to double-precision floatingpoint operations was used.

Speedup of Parallel Computations
[58] Similar to David et al. [2011b], RAPID was run with various computing methods all using the same input data and the same routing parameters ; the corresponding computation times were compared among themselves (Figure 10). The traditional Muskingum method can be directly compared with a noniterative matrix-based Muskingum method and performed similarly; but the two methods can only be applied to a unique processing core when all river reaches of the domain are fully interconnected as is the case in the Upper Mississippi River Basin. However, an iterative matrix-based Muskingum method can be used on one or several processing cores by splitting the domain in subbasins and assigning each subbasin to a different processing core, as shown in Figure 11. The domain decomposition used here is the same as David et al. [2011b] and uses a native NHDPlus field informing on the relative upstream/ downstream position of river reaches. Despite an initial overhead brought by the iterative matrix-based Muskingum method ( Figure 10)-mainly due to the time required to compute an initial residual error in order to provide the linear system solver with a convergence criterion-the computation time quickly decreases with the number of processing cores before being limited by intercore communication and increased number of iterations at 32 cores.
[59] The experimental speedup between the iterative matrix-based Muskingum method applied on 16 cores and the traditional Muskingum method is S 16 ð Þ ¼ 250=72 ¼ 3:46 which is higher than S 1 16 ð Þ ¼ 16=13 ¼ 1:23 but also much lower than S 2 16 ð Þ ¼ 16=2 ¼ 8:00 or than S 3 16 ð Þ ¼ 16= 1 þ 155 À 1 ð Þ 16=182240 ð Þ ¼ 15:79. The real quantities S 2 N ð Þ and S 3 N ð Þ therefore give a new upper limit on what can be achieved for parallel computing of river flow with the Muskingum method depending on how the equations are solved. The computations of the iterative matrix-based Muskingum method converge in less than two iterations for all runs with 16 cores or less, so the domain decomposition used can be considered large enough with regard to the radii of influence. At 32 cores, an additional iteration is needed, suggesting that the domain decomposition is not any more appropriate. Detailed explanations on the differences between the methods used here can be found in David et al. [2011b].

Magnitude of Differences in Results Obtained Among Computing Methods
[60] David et al. [2011b] reported exact match-on a byte-to-byte basis-of all output files among computing   methods tested. However, despite computations being performed using double-precision floating-point operations, the outputs of David et al. [2011b] were saved using single-precision floating points. In this study, comparisons among computing methods are made using double-precision output files in order to provide a more in-depth description of differences. The various three-hourly outputs from RAPID corresponding to each computation method of Figure 10 were compared among themselves for each three-hourly average in order to quantify the differences among flow rates obtained in all methods. Results of these comparisons are presented in Table 1. The use of threehourly outputs as opposed to computations at the 15 min routing time step is a simplification but one can argue that such high-resolution computations are generally not created by computer models anyway. With one unique computing   David et al. [2011b]). core, differences exist between the computations of the traditional Muskingum method represented by equation (1) and those of the matrix-based Muskingum method represented by equation (4), and their relative magnitudes have a maximum of 1:04 Â 10 À15 . Such differences are independent of an assumption concerning the radius of influence since a unique computing core is used. These dissimilarities can therefore be attributed to two different numerical ways of solving the same linear system. Also on one unique computing core, no differences exist between the matrix-based Muskingum method and the iterative matrix-based Muskingum method, as expected. Using multiple computing cores, differences in computations of the iterative matrix-based Muskingum method exist, and their relative magnitudes have a maximum of 1:34 Â 10 À15 . Two iterations of the linear system solver being sufficient (for all simulations using between 2 and 16 processing cores), the corresponding subbasins can be considered large enough with regard to the radius of influence, and such differences are again independent of assumptions based on the radius of influence. These differences can therefore also be attributed to diverse numerical ways of solving the same linear system when using multiple processing cores. The reader is here again referred to David et al. [2011b] for further explanations on the differences between the computing methods used here. In any case, comparisons among all methods tested show that the maximum relative difference for any three-hourly average is max 8t kÁQk 2 =kQk 2 ð Þ¼ 1:34 Â 10 À15 , and the maximum absolute differences for any three-hourly average and any river reach is max 8t;8j ÁQ ½ j ð Þ ð Þ¼4:00 Â 10 À11 m 3 =s .
[61] The precision that can be expected from the Muskingum method can be estimated by means of its condition number I À C 1 Á N ð Þ following equation (13). In this study, the Scalable Library for Eigenvalue Problem Computations (SLEPc) [Hernandez et al., 2005] was used to estimate I À C 1 Á N ð Þwith the Muskingum parameters k j and x j of equation (36). The results obtained by SLEPc are [62] With relative errors in kbk 2 and kI À C 1 Á Nk 2 on the order of " and using double-precision floating-point operations, one should therefore deem relative differences in flow rate computations on the order of 2 Á Á "= 1 À Á " ð Þ%2 Â 11:31 Â 1:11 Â 10 À16 = 1 À 11:31 Â 1:11 ð Â10 À16 Þ % 2:52 Â 10 À15 as acceptable in this study. The maximum relative difference observed among all experiments performed in this study (see Table 1) can therefore be deemed acceptable. Additionally, the maximum absolute differences among flow rates at any river reach and any three-hourly average are on the order of 10 À11 m 3 =s and are hence much smaller than can be expected from the precision of river flow observations or attained by current modeling frameworks.

Discussion
[63] The study presented in this paper was motivated by results presented by David et al. [2011b] for which the parallel speedup was much higher than predicted by existing theory, suggesting that the speedup theory for parallel computing of river flow should be revised. Through the inversion of a large matrix, we are able to quantify the worstcase (largest) relative contribution of a given river reach on the updated flow rate of any other reach between two consecutive time steps of the Muskingum method. We show that such contribution is smaller than can be accounted for by floating-point arithmetic if the reaches are far-enough away from each other. Therefore, even if they are mathematically related-albeit by an infinitesimally small amount-these river reaches become independent from a computer modeling perspective. The minimum distance for which the relative influence between reaches becomes insignificant-the radius of influence-is defined and quantified here. It is important to note that we are not suggesting that river reaches located further away than the radius of downstream influence from a given river reach do not ever feel the influence of this given reach. Given enough time, water will flow from a given river reach to any of its downstream reaches ; unless it is removed from the river network by evaporation, infiltration, or human withdrawals which are beyond the purpose of the modeling addressed in this study. However, between two consecutive time steps, a given river reach has no accountable influence on far downstream elements in a floating-point environment. Therefore, in effect, flow waves are not fast enough to travel through an entire modeling domain within one time step if the domain is large enough. Such was not previously considered and explains why parallel speedup can be attained that is much higher than previously predicted. To our knowledge, the new upper limits for parallel speedup presented here have not been shown earlier. In order to achieve such speedup, one has to make sure that the subbasins assigned to different computing cores are large enough with regard to the radii of influence, and it would therefore be interesting to study network decomposition techniques such as those of Veitzer and Gupta [2001] or Li et al. [2011] but including a limitations of subbasin sizes based on the radii of influence. However, the computations of radii of influence presented here have challenges because these radii depend on the model time step, model parameters (themselves related to spatial/temporal resolution and to flow wave celerities which partly depend on landscape geometry) but also on the magnitude of water inflow into each river reach. Additionally, such computations can become very demanding as larger domains are addressed.
[64] Lastly, the parallel performance shown here is much lower than the new estimate of maximum parallel speedupdespite large-enough subbasins for almost all experimentswhich suggests that other limitations currently exist. Such limitations deserve further investigations which could focus on current parallel computing technology, programming methods used, etc.

Conclusions
[65] Many river routing methods use a numerical scheme in which updating the flow rate for a given river reach depends on the updated flow rate of the reaches that are located directly upstream. By construction, such numerical schemes generate constraints on the ordering of computations and have therefore traditionally been solved by using an upstream-to-downstream approach. The maximum speedup that such an approach can attain in a parallel computing environment has previously been estimated and is based on the intuitive assumption that one has to wait for the update of all upstream river reaches to be performed prior to updating a given reach. The work presented here uses one of these numerical schemes-the Muskingum method-and quantifies the relative influence among river reaches during the update of flow rates between two consecutive time steps. This influence can be quantified exactly using linear algebra and is shown to be null from a floatingpoint arithmetic perspective, given that two river reaches are located far-enough away from each other with regard to a distance referred to here as the radius of influence. This rather counterintuitive finding can be explained because the relative upstream-to-downstream influence becomes increasingly smaller with distance, so small that it becomes insignificant in the addition performed by a computer. Physically, this means that flow waves are not fast enough to cross a large modeling domain within one unique time step. Therefore, one does not really have to wait for all upstream elements to be updated prior to updating a given river reach. Based on this finding a new maximum theoretical speedup for parallel computation of river flow is presented. The application of the proposed theoretical framework to the 182,240 river reaches of the Upper Mississippi River Basin over the year 2004 at a 15 min routing time step allows the estimation of the radius of influence on the order of 150 computing elements (river reaches), i.e., about three orders of magnitude smaller than the domain size which suggests large potential gains in computing times. The value found here for the radius of influence depends on the model time step, the spatial resolution of the river network, the local flow wave celerities and corresponding model parameters, and the magnitude of water inflow in our study domain, but the theoretical approach presented can be adapted to other applications. Also, despite the use of a domain decomposition where subbasins assigned to different computing cores are large enough when compared to the radii of influence, the speedup of the application presented remains far from the new upper limit developed in this study. The reasons for this less-than-optimal speedup remain to be determined, but we show an experimental speedup that is already much higher than what was previously considered an upper limit. As we address river modeling experiments of ever-increasing computing sizes, increases in the upper limits of parallel speedup such as the one presented here are likely to have valuable impacts. Finally, the work presented in this paper could be adapted to other river routing methods in which the updated flow rate depends on the prior update of upstream (and/or downstream) elements, and similar results are to be expected.
System. The authors also wish to thank the PETSc developers, especially Victor Eijkhout for recommending the investigation of the magnitude of subdiagonal elements of the linear system matrix, Jed Brown for input concerning the precision that can be expected from computations of linear system solvers in PETSc, and Matthew Knepley for mentioning the importance of the condition number. Gilbert Strang deserves appreciation because his Fall 2008 Lecture 14-MIT 18.085 shared freely through the Massachusetts Institute of Technology Open Course Ware was particularly helpful to the authors. Finally, the authors are thankful to the three anonymous reviewers, to the Associate Editor, and to Graham Sander, the Editor, for their valuable comments and suggestions that helped improve the original version of this manuscript.