A Hybrid Constrained Coral Reefs Optimization Algorithm with Machine Learning for Optimizing Multi-reservoir Systems Operation

The continuous growing demand for water, prolonged periods of drought, and climatic uncertainties attributed mainly to climate change mean surface water reservoirs more than ever need to be managed efficiently. Several optimization algorithms have been developed to optimize multi-reservoir systems operation, mostly during severe dry/wet seasons, to mitigate extreme-events consequences. Yet, convergence speed, presence of local optimums, and calculation-cost efficiency are challenging while looking for the global optimum. In this paper, the problem of finding an efficient optimal operation policy in multi-reservoir systems is discussed. The complexity of the long-term operating rules and the reservoirs’ upstream and downstream joint-demands projected in recursive constraints make this problem formidable. The original Coral Reefs Optimization (CRO) algorithm, which is a meta-heuristic evolutionary algorithm, and two modified versions have been used to solve this problem. Proposed modifications reduce the calculation cost by narrowing the search space called a constrained-CCRO and adjusting reproduction operators with a reinforcement learning approach, namely the Q-Learning method (i.e., the CCRO-QL algorithm). The modified versions search for the optimum solution in the feasible region instead of the entire problem domain. The models’ performance has been evaluated by solving five mathematical benchmark problems and a well-known continuous four-reservoir system (CFr) problem. Obtained results have been compared with those in the literature and the global optimum, which Linear Programming (LP) achieves. The CCRO-QL is shown to be very calculation-cost-effective in locating the global optimum or near-optimal solutions and efficient in terms of convergence, accuracy, and robustness.


Introduction
Reservoirs are usually designed to serve multiple users: urban, environmental, industrial, agricultural, and hydropower. Optimizing a reservoir's operation while considering the quantitative flow characteristics and water demands along with climatic conditions is a complex process that deals with several parameters. Adding these parameters to long-term planning and management would result in a variety of decision criteria and objective functions that include hundreds of variables and constraints.
In a multi-reservoir water supply system, joint operating rules should address the total release from the system and the amounts to be released from each reservoir in all periods according to the reservoir's upstream and downstream demands. This optimization problem in water resources systems is too complex to be solved using classical optimization methods. With dimensionality and non-linearity as the primary causes of complexity, researchers have now used heuristic optimization methods to derive policies that best serve multiple water uses and help identify the tradeoffs between various objectives. Since classical optimization approaches are limited in managing search space structure, modern meta-heuristic methods have been the core of recent research dealing with "hard" optimization problems. For example, bioinspired algorithms have been developed that use the aspects of natural evolution, based on the continued survival of the fittest individuals, to find an optimal or near-optimal solution. Ehteram et al. (2017a, b) applied the Shark Algorithm and a new hybrid algorithm by combining the GA with the Krill Algorithm to maximize the generated hydropower and water supply benefits in four-reservoir and ten-reservoir benchmark systems [16,17]. Samadi-koucheksaraee et al. (2019) applied a Gradient Evolution (GE) algorithm to optimize single-and multi-reservoir systems. GE is employed to optimize several case studies demonstrating the proposed method's superior ability [41]. Mohammadi et al. (2019) applied the TOPSIS technique to compare the genetic algorithm's performance with a Hybrid Whale-Genetic Algorithm (HWGA) for optimizing multi-reservoir systems, which shows the advantage of the hybrid method over GA 3 and Whale Optimization Algorithm (WOA) alone [27]. In the most recent studies [7,26], a constrained version of the Improved Artificial Bee Colony (IABC) algorithm and a hybrid Cellular Automata-Simulated Annealing (CA-SA) method implemented to optimize reservoir operation and hydropower operation, respectively. Both proposed methods show relative improvements in benchmark problems and real-world case studies. Yet evolutionary optimization algorithms significantly contribute to flood control and susceptibility assessment [13,29] and uncertainty assessment methods [4,5].
A wide variety of meta-heuristic methods have been proposed and investigated since they are approximate and usually non-deterministic. However, meta-heuristic algorithms do not guarantee a globally optimal solution, especially in certain classes of problems. As such, further approaches are being investigated, which may provide a sufficiently good solution to an optimization problem, especially for complex problems for which limited computational capacity is available [44].
As a novel bio-inspired algorithm, the Coral Reefs Optimization (CRO) algorithm was first introduced in 2013 by Salcedo-Sanz et al. [34,37]. This new method has been used to solve several continuous and discrete benchmark problems and practical ones. the substrate layer (CRO-SL) as a competitive co-evolution algorithm to control structures' vibration via tuned mass dampers. This approach provides benefits by exploiting the combination of different types of searching mechanisms [32]. Some recent research has improved the CRO's efficiency using hybridizing techniques to modify exploration and exploitation abilities. Most of them have focused on CRO-SL as a co-evolutionary algorithm [9,23,31,43]. Some others have combined CRO with: game theory [18], statistical approaches [14], simulated annealing [48], and memetic strategy [15], while some are still using the original version of CRO [3,6,19]. 4 An expanding list of the CRO algorithm applications on top of generating reliable results in various fields reveals its potential capability in water resources optimization problems. At the same time, the proposed modifications are another confirmation of the correctness of the No Free Lunch (NFL) theorems in search [47] and optimization [46] problems. While almost all previous studies have focused on problem-solving methods, current research has also attempted to address how to describe the problem mathematically.
This study is the first to apply the CRO algorithm to solve water resources system optimization-problems to the best of the authors' knowledge. This work investigates the CRO algorithm's effectiveness in reservoir operation problems by solving a hypothetical multi-reservoir system. Two new modifications to the algorithm are proposed to reduce the calculation cost by narrowing the search space when the algorithm deals with recursive equations. Moreover, the reproduction operators have adjusted by applying a reinforcement learning approach (i.e., the Q-Learning method). This method can improve solutions in terms of convergence, accuracy, and calculation costs. The proposed model's performance was investigated by solving five mathematical benchmark problems and applying them to a multi-reservoir system.

Coral Reefs Optimization (CRO) model description
The CRO is a novel meta-heuristic search approach based on coral reef formation and reproduction [40]. It simulates different types of corals reproduction (i.e., internal/external and sexual/asexual) and reef formation, including competition between new corals to settle in the reef and depredation of the weak ones.
The algorithm is based on a coral reef's artificial modeling, (Λ) consisting of an M×N square grid. Each square (i,j) can allocate a coral (Ξi,j), representing a possible solution to the optimization problem at hand. The primary reef forms by assigning some squares in Λ to be occupied by corals that are random solutions to the problem, and some other squares in the grid to be empty where new corals (polyps) in the next generations can freely settle and grow [34,35].

5
Each coral is associated with a function that denotes its health status, indicating the fitness of the solution at hand ( (Ξ , ): I  ℝ). The reef will progress if better corals or fitter solutions outlive the inferior ones. In iterative-based generations, new corals reproduce by sequentially applying several operators until a given stop criterion is met, for instance, a maximum number of iterations or a maximum number of objective function evaluations [40]. Such operators are used for modeling sexual reproduction (broadcast spawning and brooding), asexual reproduction (budding), and polyps depredation.

Broadcast spawning (External Sexual Reproduction, ESR)
The ESR operation consists of the following steps [30,36]: 1. At a given step k of the reef formation phase, a fraction of the existing corals is selected uniformly at random to be broadcast spawners (Fb). Corals that are not selected for broadcast spawning will reproduce later via brooding during the algorithm execution.
2. Each of the couples permitted to be parents will breed one or two coral larvae (depending on the operator) by sexual crossover. The couple selection can be either uniform random or any fitness proportionate selection approach.

Brooding (Internal Sexual Reproduction, ISR)
The fraction of corals that will reproduce via brooding is 1-Fb. The modeling consists of forming a coral larva by a random mutation in the brooder coral (i.e., self-fertilization in hermaphrodite corals). The produced larvae are then released into the water as well as the offspring generated in step 2 [33].

Larvae setting
Once all larvae are formed at step k either through broadcast spawning or by brooding, they will try to settle and grow in the reef, based on their health function. The struggle for survival starts when each larva tries to settle on a random square (i,j) of the reef. If this square is empty, the coral grows therein independently of the value of its health function. Otherwise, when the given position is already occupied, the new larva will set if it 6 is healthier than the existing coral. Each larva has a limited chance to compete with the existing ones. After unsuccessful attempts, it will be discarded and eliminated from the population [34].

Budding or fragmentation
In this asexual reproduction process, a small fraction of healthier corals (Fa) duplicates itself and tries to settle in a different part of the reef, following the configuration process described in step 2.1.3. A predefined maximum number (μ) of equal corals are allowed in the reef. Mutations are also introduced with probability Pa to obtain variability [40].

Depredation
A small number of corals (applying a very small probability Pd, exclusively to a fraction Fd of the least fit corals) drops at the end of each iteration k for the sake of liberating space in the reef for the next generation. Figure 1 illustrates the flowchart of the CRO algorithm, along with the operators described above. External sexual reproduction (i.e., broadcast spawning process) has the primary role of exploration, whereas internal sexual reproduction (i.e., brooding) is essential for local search. The budding (or asexual reproduction) ensures that the best solutions replicate and span over the reef; so, this process facilitates the exploitation characteristic of the CRO algorithm while maintaining elitism.

CRO evaluation in Continuous Benchmark Problems (CBP)
In applied mathematics, test functions, known as artificial landscapes, are used to evaluate optimization algorithms' characteristics. This includes the convergence rate, precision, robustness, and general performance.
Five well-known continuous benchmark functions (Table 1) have been applied for evaluating the proposed CRO algorithm compared to other evolutionary algorithms [21]. Figure 2 illustrates the unimodality of all functions, while the f4 and f5 have many local minima. For the sake of visualization, a fixed number of variables (n=2) has been applied for f2 ~ f5. Table S The parameters controlling the CRO model were identical to those in the previous study [40]. The number of 8 function evaluations used was the same for all compared algorithms: 20,000 in f1, f2, f4, and 10,000 in f3 and f5.
The number of function evaluations (NFE), as a hardware-independent index, was used to evaluate and assess the computational cost (effort) and the complexity in different CRO versions. Due to the non-deterministic nature of the studied algorithms, comparisons should be made on a large set of results obtained from each algorithm's independent executions. The Kruskal-Wallis test can validate the statistical significance of the obtained results by different algorithms. It is a non-parametric method as an extension to the Mann-Whitney U test for testing whether samples originate from the same distribution [45].
At each simulation, 30 executions of each algorithm are launched to obtain well-sampled performance statistics (best, average, and standard deviation for all iterations).
The original version of CRO takes advantage of Gaussian and Cauchy mutations (G+C) [40]. In this research, a simulated binary crossover (SBX) operator, which uses polynomial probability distribution ( ) [12] is studied for brooding of the corals.
Where ( ) is the polynomial probability distribution, is the perturbance factor, u is a random number in the range (0,1), c is the mutant coral, b is the brooder, and Δmax is the maximum permissible range for decision variable. The polynomial distribution produces neighbors by keeping its mean at the current value and its variance as a function of the distribution index . A large value of makes brooding more exploitative, and a small value of makes it more explorative [11].

Multi-reservoir benchmark problem
Larson first formulated the hypothetical continuous four-reservoir (CFr) problem in 1968 [24] and later adopted by Heidari in 1971 [22]. The current benchmark problem is the one modified by Chow and Cortes-Rivera (1974) [10] and resolved using differential dynamic programming by Murray and Yakowitz (1979) [28].
As shown in Figure 3, the water resources system consists of four reservoirs controlling the flow in two streams, primarily for hydropower production and irrigation purposes. The reservoirs' operation is subject to seasonal storage requirements for flood control, dictating the reservoirs' maximum storage capacity and recreation and fish conservation requirements, imposing a minimum storage capacity. The objective is to maximize the total benefit (B) from the system over 12 months of operation periods defined as follows: Where R is the total number of reservoirs, T is the total number of periods, ( ) is the benefit function in period t for r th reservoir and ( ) is releases in period t from reservoir r. The fundamental constraints include the continuity equation for each reservoir over each operating period t and defined as: Where ( ) and ( + 1) are storages in reservoir r at time t and t+1 respectively and ( ) is the inflow volume at period t into reservoir r. Constraints on releases from the reservoirs and constraints on reservoir storage are: Where ( ), ( ), ( ), and ( ) are the minimum and the maximum release from/storage of reservoir r at time t, respectively. and are the initial/target volume of the r th reservoir at the beginning of the operation period and the end of the operation period, respectively.   The reservoirs connectivity matrix (RCM) describes how releases from upstream reservoirs accrue to downstream ones. Data required for modeling the system, such as inflows, reservoir storages, and constraints posed on the CFr problem, are presented in Table 2 and Table 3. Further details for this problem can be found in Murray and Yakowitz (1979) [28]. Since the reported value for global optimum varies in some literature  According to the search domain, two approaches have been studied to solve constrained optimization problems. The first is searching in a broader domain by applying a penalty method as the most common approach. This could be done by taking some of the constraints into account via adding a term to the objective function that prescribes a relatively high cost for violation constraints. In this approach, the search algorithm may offer not acceptable solutions in terms of violating constraints. However, the feasible direction method (primal direction method) works on the original problem by considering all restrictions via searching the feasible region for an optimal solution. Therefore, the search domain would be as narrow as possible.
Formulating the problem and setting up the algorithm to deal with the feasible space is challenging, especially in non-convex optimization problems, like the one at hand.

CFr operation using CRO algorithm with the penalty method
All initial corals in the reef would be generated in the interval [ , ] satisfying the corresponding constraint in Eq. (7) when using the penalty approach. During sexual reproduction, this range is similarly used to limit the reproduced larvae. For this purpose, each larva enters a procedure named larvae correction to apply the associated limitation by projecting the outbound solutions on edge. For those temporary solutions that violate the state variable's constraints defined by Eqs. (8) and (10), a lower benefit allocates by the operation policy. This can be achieved by a penalty method in which the operation's total cost is considered as the sum of the operation cost defined by Eq. (4) and a penalty cost defined as: Where, ′ is the total penalized benefit function, ( ) is a penalty function related to constraints in Eqs. (8) and (10), ( ) is the violation function and, represents the penalty parameter with a large enough positive value to force the search into a feasible region. It doesn't worth mentioning that the penalty term in Eq. (11) will be zero for feasible solutions.

CFr operation using constrained CRO algorithm
Combining Eq. (16) with the original box constraint of (7) leads to the following constraints for the release volumes for period t. This approach could achieve better solutions and improved convergence characteristics by reducing the search space size. Furthermore, it eliminates the problem of tuning the penalty parameter for better method's performance. Since the Larvae Correction operator always produces permissible solutions, the constrained CRO's search effort to converge to an optimal or a near-optimal solution is expected to be much smaller than other approaches where any value is allowed during the optimization process.

Improving the CCRO algorithm using the Q-Learning method (CCRO-QL)
Reinforcement learning is an area of machine learning focused on how learning agents should take actions in an environment to achieve a goal, such as maximizing cumulative reward. The agent, during its course of learning, experiences various situations in the environment (States). The agent in each state may choose an allowable action that may get a reward (or penalty). Over time, the agent learns to maximize these rewards to behave optimally at any given state. In problems where a model of the environment is not available or cannot be learned, model-free methods as an explicit trial-and-error algorithm should be used. "Q-learning is a modelfree reinforcement learning algorithm to learn the quality of actions telling an agent what action to take under what circumstances" [42]. Q-Learning uses Q-values (also called Action-Values) to improve the behavior of the learning agent iteratively. Q-Values, Q(S, ), is an estimation of how good it is to take action A at the state S. This estimation will be iteratively computed using the Temporal Difference (TD) Update rule. The TD-Update rule, which is applied at every time step of the agents' interaction with the environment, can be represented as follows: Where S is the agent's current state, and A is the current action picked according to some policy. S′ is the next state where the agent ends up, ′ is the next best action to be selected using current Q-value estimation, and R is the current reward observed from the environment. γ is a parameter, 0 ≤ ≤ 1, called the discount rate that determines the present value of the future reward. α is a positive constant step-size parameter that influences the rate of learning.
As was mentioned in section 2.3.3, the recombination of old solutions to build offspring solutions is not an effective strategy in recursive optimization problems since the constraints' violation is common. If the operator used for the external sexual reproduction (ESR) is not carefully selected, the rate of change in larvae as a result of the larvae correction phase will be such that larvae will not benefit from their broadcast spawners, as if the crossover has led to a random offspring. Under such conditions, the exploitation part of the algorithm will be lost. As a solution to this issue, the authors proposed a hybrid method to make the exploitation phase more efficient using a machine learning approach. The authors believe that eliminating the ESR operator alongside upgrading the ISR operator using the Q-Learning method can improve the CRO algorithm's exploitation efficiency. The aim is to select some bits purposefully in a brooder to participate in the ISR process instead of random selection. The brooding operator remains a polynomial probability distribution (Eq. (1)). Since the broadcast spawning operator is removed, all corals in the reef will enter the brooding stage in every iteration.
Assume that each solution for the problem (i.e., water release amount) is a state (S). The possible actions are sets of some nominated bits out of a coral's bits (4×12=48). For each coral, Q-Values is another 4×12 matrix that indicates the importance of each bit in the solution's matrix. The benefit rate in Table S.2 would be a proper beginning action-values after the reef initialization. The agent is reward-motivated and will learn how to pick bits through feedback from the environment; hence, certain rewards (including penalties) and their magnitude are needed accordingly. The agent will receive a positive reward if the objective function is improved, ΔB, (i.e., total benefit in Eq.(4)); otherwise, it will be penalized for decreasing the total benefit.
Accordingly, the reward can be defined as the normalized ratio of ΔB to the bit's shift amount. Also, it should be noted that a high value for the discount factor ( ) captures the effective long-term award, whereas a low discount factor makes the agent consider an immediate reward.
The tradeoff between exploration and exploitation is controlled by a policy called , preventing possible overfitting during training. In an -greedy policy, most of the time, the algorithm chooses an action with maximal estimated action-values. Yet with a small probability ( ), the action might be selected randomly. The values of α, , and are mostly based on intuition and some based on hit and trial. Generally, all three should decrease over time because it builds up more resilient priors as the agent continues to learn. Figure 4 illustrates the pseudo-code of the CCRO-QL algorithm with different phases.
Due to the interaction between the learning agent and its environment in terms of states, actions, and rewards, the CCRO-QL method uses the formal framework of Markov decision processes (MDP). This framework represents the artificial intelligence problem's essential features, such as a sense of cause and effect, uncertainty and nondeterminism, and an explicit goal [42]. In CCRO-QL, all agents' population learn according to a common policy; therefore, it can be considered multi-agent reinforcement learning.

Hyperparameters optimization
The conventional way to achieve hyperparameter optimization is to perform a grid search or a parameter sweep; an exhaustive search through a manually designated subset of the feasible space. In grid search, the set of trials is formed by assembling a discrete combination of possible values. In both approaches (CRO and CCRO), a two-stage grid search has been performed for the model operators and their parameters. In the first stage, based on the CRO evaluation of mathematical test functions, three selection methods (i.e., roulette wheel selection, tournament selection, and uniform random selection) are used to choose broadcast spawners. Two conventional methods, geometric and arithmetic crossover operators, were investigated as well. Three wellknown probability distributions, the Gaussian, Cauchy, and polynomial distributions, were designated for the brooding operator. In the second stage, a search through a fine grid was performed for all parameters mentioned in  Yang et al. [49], Salcedo-Sanz et al. [38][39][40]. Figure 5 shows the evolution of PSO solutions and the two CRO versions for two functions out of the five CBP.

CRO evaluations in the CFr problem
Except for the reef size, the CRO parameters and operators' values are initialized based on the previous step results, as shown in Table S Almost all recent studies that applied meta-heuristic methods to solve the CFr benchmark problem are listed in Table S (7) and (8). In the backward method, these steps are inverted with the initial assumption ( + 1) = and rewriting Eq. (5) in terms of ( ). To keep 20 the diversity, both forward and backward approaches were used based on a generated uniform random number.    From Table 5, it can be deduced that CCRO is more robust than the original CRO because its STD value is seven times smaller than that of the CRO, which implies that the algorithm uncertainty decreased by taking the primal direction method. The p-value obtained from Student's T-Test rejects the null hypothesis. Thus, there is a significant difference between the CRO results and the CCRO's at the 99% confidence level.

CCRO-QL Performance
Statistics, including mean, standard deviation, and p-value in Table 5, depict that CCRO-QL outperforms other proposed methods. The agent learns to pick more effective bits in each step properly. In multi-agent reinforcement learning, the scenario in which all the agents try to maximize their own reward signal that they receive is known as a competitive setting. In this situation, conflicts of interest among the agents are likely, 22 implying that some agents' proper actions are improper for others. That is why fixed hyperparameters (i.e., step-size, discount-rate, and probability) did better than adaptive ones in terms of converging to local minimums for the CFr problem.
The best objective function obtained with the CCRO-QL method equaled 308.29 at two decimal accuracy, which is the same as the maximum global value achieved via linear programming (LP). Since the precise global optimum is guaranteed through linear programming, decimal accuracy is necessary to distinguish the local optimum. Figure 6 displays three different paths with an equal total benefit (B) at two decimal place accuracy, although two do not correspond to the global optimum. Figure 7 a and b compares the optimal release volume and storage volume, obtained using the CCRO-QL algorithm (with B =308.2906 and NFE=300,000) and those of the LP method. The fact that the constrained algorithm is a search method in the feasible region of the search space can be best illustrated by the convergence of the solutions shown in Figure 8. It is evident that both CCRO and CCRO-QL algorithms start with B above 275, while the CRO algorithm could produce this level of solution later in the search. The maximum and average results, respectively, with NFE100,000 were 308.2628 and 308.0355, which is still reliable. The proposed method could obtain the exact global optimum (i.e., 308.2915) by performing 880,000 function evaluations. Table S.5 shows the release amount obtained by LP and CCRO-QL for the CFr problem, which has never been provided in the literature. As seen from the table, all decision variable values converge to those from LP. Therefore, it can be concluded that the CFR problem only has one global optimum (i.e., unimodal function), which is the same as a result obtained via LP.
It should be noted that almost all literature results (Table S.6) as optimal have nearly the same decision-variable values according to the best fitness with two decimal places of accuracy. Yet, the results better than LP would be questionable since linear programming produces the final global optimum as a deterministic numerical method.

Conclusions
This paper has solved the continuous four-reservoir problem as a benchmark problem in surface water resources management. Three versions of a meta-heuristic algorithm have been used to tackle the problem, the so-called Coral Reefs Optimization algorithm that simulates reproduction and formation of coral reefs. The first step has shown that the proposed approach can deal with continuous problems through experiments carried out for five well-known benchmark functions. The CRO has obtained reliable and robust solutions, which is supported by previous research [39,40]. indicate that the optimized solutions' uncertainty decreases after narrowing the search space and modifying the reproduction operators because the uncertainty was mainly caused by penalized offspring and an inefficient recombination method.