A linear surrogate for optimising functions of an orthogonal matrix with applications in wave function theory

The technique of surrogate optimisation is to use a simpler function to approximate a complex function that is time-consuming to evaluate. We show that the maximum of a special type of surrogate function is at , and that there is one and only one local maximum both in and . This function has been found to be useful in various aspects of electronic structure theory, including proving the Carlson-Keller theorem, and localising orbitals. As one other example, we apply it here to optimise the ground state of molecules using the Generalised Valence Bond wavefunction. GRAPHICAL ABSTRACT


Introduction
Optimisation problems are ubiquitous in quantum chemistry, and sometimes the objective function is hard to evaluate. For example, in geometry optimisation, we are finding local minima (or saddle points) on molecular potential energy surfaces, and in self-consistent field (SCF) orbital optimisation, we are trying to reach the minimum on a Grassman manifold. Both problems are non-trivial due to the computational cost of evaluating the objective function and its gradients at a large number of trial points. Therefore, we can achieve a considerable speedup if we construct an approximate function that is fast to evaluate -this is the idea of surrogate optimisation. Recently, this idea has attracted some interest in chemistry, with research emphasis particularly on using machine learning to do geometry optimisation [1][2][3][4]. There are also applications in experimental design [5], CONTACT Martin Head-Gordon mhg@cchem.berkeley.edu Chemical Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA predicting molecular properties [6] and calculating the electronic structure [7,8].
It is uncommon to see surrogate optimisation discussed in electronic structure theory. There is one common example: the Roothaan step in SCF optimisation [9] updates the density matrix by finding the global minimum (or a saddle point in exotic cases [10][11][12]) of a surrogate function, E = Tr(PF), where F is the current Fock matrix (taken as constant) in an orthogonal basis. This surrogate function is directly optimised by diagonalising F, yielding the density matrix as P = O j=1 C j C T j where C j are the orthonormal eigenvectors corresponding to the O (number of electrons) lowest eigenvalues: FC j = C j ε j . Apart from this classic example, and the two machine-learning-relevant papers mentioned above, there is only one highly developed idea -the secondorder convergence method [13]. For instance, in multi-configuration SCF (MCSCF) [13][14][15][16] and generalised valence bond (GVB) theory [17,18], we want to minimise the system energy E, which is a function of an orthogonal matrix E(U). U can be written as the matrix exponential of an antisymmetric matrix U = exp( ), and second-order convergence suggests expand E either to the second order of and minimise the surrogate analytically or to the second order of T = U − I and minimise it iteratively. In this paper, however, we will find the exact extremum point of a U-linearised surrogate function Tr(AU), without the need to further approximate U by a polynomial or to iteratively solve a set of non-linear equations.
Based on the following lemma, we can find the exact extremum of f (U) = Tr(AU). Lemma 1.1: Given A ∈ R n×n an invertible matrix and f : then f has one and only one local maximum. Following standard notation [19], SO(n) refers to the special orthogonal group of rotation matrices with determinant equal to 1, in contrast to the more general orthogonal group, O(n), which also includes e.g. reflection matrices.
Subotnik and co-workers attempted to prove it [20], and claimed that the maximum point of f is

However, this is incorrect, consider
The resulting U 1 will be which is clearly contradictory. In this paper, we will give an accurate statement and complete proof of this lemma in Section 2. Then, several applications are considered in Section 3, including orbital orthogonalisation, where another proof of the Carlson-Keller theorem [21] is given, orbital localisation, where we ensure that the hidden conjecture made by Subotnik et al. is actually reasonable, and finally orbital optimisation in GVB theory. This lemma, as pointed out by an anonymous reviewer, is equivalent to the 'orthogonal Procrustes problem' [22] used in many other fields, including psychometrics [23], crystallography [24,25], computer science [26], bioinformatics [27], and is a special case of the Wahba's problem [28]. But as we have mentioned before, this type of surrogate approach is rare in electronic structure theory. We hope this paper might attract some interest in proposing and analyzing new surrogate optimisation models in this field.

Lemma and its proof
where G T UF ∈ O(n). As G T UF is orthogonal, its diagonal elements must not have absolute values larger than 1, so and the equality is achieved only when here I n is the identity matrix. Actually, as AA T = F 2 F T , we know that For the case that f is restricted on SO(n), the location of the maximum depends on whether det A is positive or negative.

Local maximum when detA > 0
To find any local maximum, we first consider the case that det A > 0. In this case, because G T SO(n)F = SO(n), the problem is equivalent to finding the local maximum of where is a positive sorted diagonal matrix, i.e.
where k 1 , k 2 , · · · , k m are the multiplicity of singular values σ 1 > σ 2 > · · · > σ m > 0, and I k is the identity matrix of dimension k. Any local maximum must satisfy two conditions: first it must be a stationary point, and second its function value must not be smaller than that of a point in its neighbourhood. The condition for a stationary point is presented in the Lemma 2.3 below.

Lemma 2.3: Given a positive sorted diagonal matrix (the definition of such matrices is given above), the stationary points of function f
where U k 1 , U k 2 , · · · , U k m are symmetric orthogonal matrices of dimensions k 1 , k 2 , · · · , k m .
Proof: At a certain stationary point candidate U 0 , we consider a shifted function f 0 : SO(n) → R and and we write U = exp( ) where is antisymmetric. Differentiating with respect to elements of at the point = 0 gives Therefore, stationary point U 0 must have U 0 = ( U 0 ) T . We claim that every U satisfying U = ( U) T has the form in Equation (2), and if that is verified, and as U is symmetric, the squared norm of its first row must be equal to that of its first column, i.e.
However, U is orthogonal, so u 2 11 + u 2 12 + · · · + u 2 1n = u 2 11 + u 2 21 + · · · + u 2 n1 = 1, which means so, for i = 2, · · · , n, either σ 2 ii = σ 2 11 or u i1 = u 1i = 0. Considering that has the form in Equation (1), we know that U must now be of the form where U k 1 is a k 1 × k 1 matrix and * stands for a part that is not yet determined. Because the first k 1 rows of U are orthonormal, U k 1 is orthogonal. Also, as U is symmetric, σ 1 U k 1 is symmetric, and thus U k 1 is symmetric. Similarly, we can apply the same argument to the (k 1 + 1) th row and (k 1 + 1) th column, getting similar results; finally, we will arrive at where U k 1 , U k 2 , · · · , U k m are symmetric orthogonal matrices but their determinants are not necessarily positive.
We know that the function value of a local maximum cannot be smaller than that of a point in its neighbourhood, and we will test this requirement at all stationary points; the result is Lemma 2.4. Lemma 2.3, at all stationary points, only U = I n can be a local maximum, and as U = I n is the global maximum point (Lemma 2.2), it is indeed a local maximum.

Lemma 2.4: Given the matrix and function f as in
Proof: Firstly, U can be diagonalised: and is a diagonal matrix with elements 1 or −1; we write in a block diagonal form = diag( k 1 , · · · , k m ). Notice that W k 1 , · · · , W k m are all orthogonal matrices, and U k i = W k i k i W T k i for all i. We will prove this lemma in two steps: considering U being a stationary point and a local maximum of f , we will first show that each block of U cannot have two -1 eigenvalues, and then we will show that all −1 eigenvalues are in the same block of U. With these two claims, U can have only one −1 eigenvalue at most; considering that det U > 0, all eigenvalues must be 1, and thus To see that each block of U cannot have two negative eigenvalues, without losing generality we can assume that the first two eigenvalues of the first block U k 1 are -1, i.e.
where ±1 stands for elements that can be either 1 or −1. We can build, for θ ∈ [0, 2π), It is clear that for any neighbourhood of U in SO(n), there exists a θ that is close but not equal to π satisfying that U is in that neighbourhood. But we see that so U cannot be a local maximum.
In the case that two different blocks have −1 eigenvalues, we can assume that the first eigenvalues of the first two blocks U k 1 , U k 2 are −1.
can be built similarly as before, for a θ ∈ [0, 2π) that is close but not equal to π , and it can be ensured that such θ exists for any neighbourhood of U such that matrix U = W W T is in that neighbourhood. We can calculate that where w k 1 ,1i is an element in the first row of W k 1 and w k 2 ,1i is defined similarly; the last equality follows from the fact that W k 1 , W k 2 are orthogonal. The fact that Tr( U ) > Tr( U) means U cannot be a local maximum, so all -1 eigenvalues must be in the same block, and the proof is thus complete.
With Lemma 2.3 and Lemma 2.4, we know that Lemma 1.1 is true when det A > 0, and the only local maximum is U 1 = A T (AA T ) −1/2 .

Local maximum when detA < 0
When det A < 0, G T SO(n)F = SO(n), but

Tr(AU) = Tr(F G T U) = Tr( LLG T UF),
where the matrix L = diag(1, · · · , 1, −1), and now LG T SO(n)F = SO(n), so by defining = L, it is sufficient to prove that function f (U) = Tr( U) has only one local maximum in SO(n); here and we need σ m to be non-degenerate to ensure that the global maximum is unique. The proof of Lemma 2.3 does not require to be positive, it only requires elements with larger absolute values to be in the upper left part of the matrix, which is also the case of -nothing needs to be done there. The first part of Lemma 2.4, showing that two -1 eigenvalues cannot be in the same block, requires the relevant σ to be positive; but the block in that has negative σ is only one-dimensional, meaning that finding two −1 eigenvalues is naturally impossible. The second part of Lemma 2.4 is showing all −1 eigenvalues are in the same block. The proof only requires the sum of σ in the relevant blocks to be positive, but σ 1 > σ 2 > · · · > σ m > 0, so that requirement is satisfied. Therefore, Lemma 2.3 and Lemma 2.4 are applicable to f (U) = Tr( U), giving the only local maximum at U = I n , which means that Lemma 1.1 is also true in this case and that the local maximum is LG T UF = I n ⇔ U = GLF T = U −1 .
Finally, the complete form of Lemma 1.1 is summarised below,

Löwdin symmetric orthogonalisation
Löwdin symmetric orthogonalisation [29] is a method to orthogonalise orbitals. For a set of n linearly independent orbitals {φ i } n i=1 with their coefficient matrix C 0 and overlap matrix S 0 , symmetric orthogonalisation can be conveniently written as where U is an arbitrary matrix in O(n). The Carlson-Keller theorem [21,[30][31][32] claims that the symmetric orthogonalised orbital set resembles the original set most, i.e. U = I minimises the function , and we know directly from Lemma 2.5 that the global minimum point is indeed U = I, and that U = I is also the only local minimum in SO(n).
This result can be directly generalised to weighted orthogonalisation [33]. In that case, the function to be minimised is here the positive weights are {w i } n i=1 and the matrix W is a diagonal matrix W = diag(w 1 , w 2 , · · · , w n ).
From Lemma 2.5, we know that the global minimum point of g is so the matrix of the weighted orthogonalised orbital set is which is the same as the classic result, and we also know that U is the only local minimum in SO(n). The fact that the local minimum is unique justifies iterative algorithms designed for these processes [34].

Surrogate optimisation in orbital localisation
Orbital localisation techniques are crucial to understanding chemical bonding, and also underpin most of the linear scaling algorithms in quantum chemistry. Three important examples are Boys [35,36], Pipek-Mezey [37], and Edmiston-Ruedenberg [38] localisation, which maximise the following functionals of U ∈ O(n) given a set of orthonormalised orbitals {φ i } n i=1 .
Pipek-Mezey : where in f PM , N A is the number of atoms and P A is the projection operator onto the space spanned by the atomic orbitals of atom A. These functionals are not linear with respect to U, so Lemma 2.5 cannot be utilised directly. However, we can easily write surrogate functions and hope that a series of surrogate optimisations will converge to the global maximum. In other words, we write surrogate functions is a fixed set of orbitals. All three surrogate functionals can be recast into the form of Tr(AU); for example, f B (U) can be written as where the elements of A B are 0 |r|φ i,0 ).

Similarly, the A matrices for Pipek-Mezey and Edmiston-Ruedenberg are
Therefore, a primitive iterative algorithm looks like: (1) For the current set of orbitals (matrix C), calculate A; (2) Compute the optimal U based on Lemma 2.5; (3) Update C = CU, and if U ≈ I, stop, otherwise go back to Step 1.
The algorithm can be accelerated by extrapolation techniques, for example DIIS, as studied in Subotnik et al.'s paper [20]. It is worth mentioning that although Lemma 2.5 promises that each line-search step we will move to the global maximum of the surrogate function, the algorithm does not promise to converge to the global maximum point of the problem. The computational cost of the algorithm for all three localisation techniques scales as O(N 3 ), where N is the number of the basis set functions.
As the determinant of A is not necessarily positive here (cf. the previous section where det A is always positive), we should briefly discuss this question. Since the occupied orbitals are orthogonal to each other, one can expect that A PM and A ER are strictly diagonal dominant. So, they must have positive determinants in most chemical systems because the diagonal elements of these two matrices are always positive. On the other hand, A B , being heavily dependent on the position of the origin, does not share this property and may have a negative determinant. Figure 1   It is only because we linearise f to f that a parity-violating difference occurs.

Surrogate optimisation in generalised valence bond theory
The idea of preparing a linearised surrogate function in the previous section can also be applied to other branches of quantum chemistry. We will briefly look at an example of the active-active mixing in the minimal basis generalised valence bond (GVB) theory [39]. In contrast to conventional SCF orbital optimisation, where only occupied-virtual orbital mixings affect the energy (Grassman manifold), all active-active rotations affect the GVB energy (Stieffel manifold) which typically makes optimisation more challenging [18,40,41]. For simplicity, we will study a closed-shell system with frozen cores. The spatial part of the perfect pairing (PP) VB wavefunction is where A is the antisymmetriser, n p is the number of electron pairs, c gi , c ui > 0 and c 2 gi + c 2 ui = 1. All φ orbitals are normalised and are orthogonal to each other. Goddard et al. [17,42] have derived the energy associated with this GVB-PP wavefunction, : here n = 2n p is the number of active orbitals, f i = c 2 i is the orbital occupation coefficients, h ii = (φ i |h |φ i ) and In the last formula, n f is the number of frozen (doubly occupied) orbitals, h is the usual one-electron Hamiltonian, and J j , K j are standard Coulomb and exchange operators for frozen orbital j; . As Lemma 2.5 only finds the maximum of a function of an orthogonal matrix, using the lemma to simultaneously optimise {φ i } n i=1 and {f i } n i=1 is not possible. We can, however, propose a two-step algorithm as below: , which can be achieved by a surrogate function as will be described below; , which is an algebraic, n p variables, constrained optimisation problem; (3) Check if self-consistency is reached, and if yes, stop, otherwise go back to Step 1.
We will not discuss Step 2 in detail as it is not the main topic of this paper. For Step 1, we can linearise the energy function such that is a fixed set of orbitals. We eliminate the factor of 2 to compensate for the fact that the two terms in E(U) have different order with respect to U (quadratic and quartic, respectively). Also, and by defining where is the Fock operator of orbital i. The scaling of the algorithm is O (n 3 N), where N is the number of atomic orbital basis functions, as we need to do an integral transformation to build G.
We have tested the algorithm on CH 4 with the minimal STO-3G basis set ( Figure 2). To achieve stable convergence, on every iteration in Step 2, instead of transforming by the optimal U, we use U 1/2 , and we update {f i } n i=1 after every 5 iterations of orbital optimisation. The {f i } n i=1 is optimised by the Rosen Gradient Projection method with Golden Section method on the line search steps. From the figure, it is clear that the 1-pair GVB treatment is enough to achieve a qualitatively correct potential energy surface, although of course the 4pair GVB wavefunction indeed acquires more correlation energy than the 1-pair wavefunction. At the bond length of 2.00 Å, it takes about 1500 iterations to converge, which is much slower comparing with about 25 iterations when using the geometric direct minimisation method (GDM) [18] -how to stably accelerate the algorithm is a direction of future study. It is worth pointing out that this algorithm can, theoretically, also be applied to HF calculation, where {f i } n i=1 is fixed and we only need to optimise the orbitals. However, this is not likely to be competitive with the Roothaan step, which exactly extremises a surrogate function that is quadratic in U.

Conclusion
In this paper, we have presented the correct form and the complete proof of a lemma (Lemma 2.5), establishing that the function f (U) = Tr(AU) has one local maximum point each in SO(n) and O(n) − SO(n) with different function values. Various applications of the lemma in quantum chemistry were given, including orbital orthogonalisation (Carlson-Keller theorem and its weighted counterpart), orbital localisation (Boys, Pipek-Mezey, and Edmiston-Ruedenberg), and orbital optimisation (generalised valence bond). The surrogate optimisation technique associated with this lemma is proved to be effective in those cases. Although the lemma itself only gives the precise optimal point to a very specific type of surrogate function, and surrogate optimisation in SCF is routine, we believe that the full power of surrogate optimisation has not yet been utilised in quantum chemistry. There is scope for future research utilising the lemma in different chemical problems, as well as proposing other surrogate functions and studying their properties.