Optimizing Earthquake Nowcasting With Machine Learning: The Role of Strain Hardening in the Earthquake Cycle

Abstract Nowcasting is a term originating from economics, finance, and meteorology. It refers to the process of determining the uncertain state of the economy, markets or the weather at the current time by indirect means. In this paper, we describe a simple two‐parameter data analysis that reveals hidden order in otherwise seemingly chaotic earthquake seismicity. One of these parameters relates to a mechanism of seismic quiescence arising from the physics of strain‐hardening of the crust prior to major events. We observe an earthquake cycle associated with major earthquakes in California, similar to what has long been postulated. An estimate of the earthquake hazard revealed by this state variable time series can be optimized by the use of machine learning in the form of the Receiver Operating Characteristic skill score. The ROC skill is used here as a loss function in a supervised learning mode. Our analysis is conducted in the region of 5° × 5° in latitude‐longitude centered on Los Angeles, a region which we used in previous papers to build similar time series using more involved methods (Rundle & Donnellan, 2020, https://doi.org/10.1029/2020EA001097; Rundle, Donnellan et al., 2021, https://doi.org/10.1029/2021EA001757; Rundle, Stein et al., 2021, https://doi.org/10.1088/1361-6633/abf893). Here we show that not only does the state variable time series have forecast skill, the associated spatial probability densities have skill as well. In addition, use of the standard ROC and Precision (PPV) metrics allow probabilities of current earthquake hazard to be defined in a simple, straightforward, and rigorous way.

can be extended to forecasting methods as well. These methods have begun to be applied in India (Pasari & Sharma, 2020), Japan (Nanjo, personal communication, 2020) and Greece (Chouliaras, personal communication, 2019).
In a series of recent papers, Rundle and Donnellan (2020) and , Rundle, Stein et al. (2021) used small earthquake seismicity, together with machine learning, to produce earthquake nowcasting curves that mimic the cycle of stress accumulation and release. These methods were based on identifying correlations present in the seismicity itself.
We note that other investigators have adopted a different approach to anticipating earthquakes using machine learning. Rouet-LeDuc et al. (2017) developed a prediction technique for acoustic emissions from events in laboratory experiments on regular, nearly periodic stick-slip friction. They also applied a similar method for episodic tremor and slip events in the Pacific Northwest (Rouet-LeDuc et al., 2019), which are also relatively regular in time.

Data
The data we used for the analysis was downloaded from the USGS earthquake catalog [1]. Shown in Figure 1 are events having magnitude M > 3.29 for a region of 5° latitude × 5° longitude centered on Los Angeles.
In Figure 1a, the small black dots are all the events during the time period 1 January 1970 to 31 December 2022, superposed on the geography of the state, and the earthquake fault system (dark green lines). The blue circles are the events having magnitudes 6.9 > M ≥ 6.0. The larger red circles are all events having magnitudes M ≥ 6.9. The yellow star denotes a location for InSAR and GNSS observations that we discuss below.
In Figure 1b, we show the monthly number of earthquakes of all magnitudes as a function of time (light cyan vertical lines). In addition, Figure 1b shows the major earthquakes that occurred from January 1970 to 31 December 2022 as vertical lines and inverted triangles. Black dotted lines a black inverted triangles indicate earthquakes having magnitudes 6.9 > M ≥ 6.0, corresponding to the blue circles in Figure 1. Red dashed lines and red inverted triangles indicate earthquakes having M ≥ 6.9, corresponding to the red circles in Figure 1.
Note that the earthquake numbers are shown in the form Log 10 (1 + monthly number), the one being present to eliminate the singularity at Log 10 (0). Also, note that 1 month is defined as 4/52 years, equal to 0.07692 years. One can therefore think of this measure of time as a "lunar month." Most importantly, Figure 1b shows a dashed blue line labeled as "EMA" for "Exponential Moving Average." This EMA curve is striking: it shows a cycle of higher activity following large mainshocks, followed by an increasing quiescence prior to the occurrence of the next mainshock. The higher activity just after mainshocks is of course due to the aftershocks. Quiescence prior to the mainshock has been pointed out by a variety of authors (e.g., Chouliaras, 2009;Kanamori, 1981;Rundle et al., 2011;Weimer & Wyss, 1994).
Furthermore, note that the EMA curve, discussed in more detail in the following section, uses a number of weights N = 36. Here N = 36 was selected based on a supervised machine learning method, using the area under the Receiver Operating Characteristic (ROC [2]) as a loss function.
The important point to note is that the EMA curve resembles an "upside down" version of the classic earthquake cycle of stress accumulation and release. This cycle has been discussed in many works on seismology, for example, Scholz, 2019. It is this filtered time series to which we now direct our focus.
We improved the method by adding an additional parameter to the model, R min , which is an assumed minimum rate of small earthquakes occurring monthly. These two parameters, N and R min , represent a filter applied to the seismicity time series. Henceforth we refer to time series of this type as "state variable" time series, denoted by Θ(t). As we show in the following, values of the time series indicate the current state of the earthquake cycle in the region.
The addition of the parameter R min to the model is based on the hypothesis that a strain-hardening process occurs as the next mainshock is approached. This idea implies that there is a transition in small earthquake mechanics from unstable stick-slip early in the earthquake cycle to stable sliding ("silent earthquakes") late in the cycle. These "silent" small earthquakes would not be observable by seismographs, but would contribute to the overall crustal straining that would be observable by GNSS and InSAR observations. We discuss these issues in a later section.

Results
We first show some of the most important results. We emphasize that the model in this paper is trained on all the data, so results do not represent back tested "predictions." This is in part due to the possibility of catalog completeness and uniformity issues prior to full automation of the seismic network, that occurred after about 1980 (Hutton et al., 2010), a point that we discuss later. Figure 2 shows the optimized ROC curve (a), the time series Θ(t) ≡ Log 10 (1 + Monthly Number) (b), and the "Precision" (PPV [2] c) of the time series. Note that Θ(t) is the optimized EMA time series from Figure 1b flipped "upside down" so that Θ(t) decreases at the time of the most recent major earthquake, then increases as the time of the next large earthquake is approached. This curve is very similar to other such time series found in our previous works (Rundle & Donnellan, 2020;Rundle, Stein et al., 2021). In those works, we also pointed out that these time series are reminiscent of the cycle of tectonic stress accumulation and release.
The nowcast PPV [2]) is shown in Figure 2c. As discussed in the following, PPV is a quantitative measure of the success of nowcasting a large mainshock having M ≥ 6.75 within a future time window T W = 3 years.
The nowcasting process involves "predicting" that a large mainshock will occur during the future time window T W . One then tabulates the fraction of such time windows in which a large mainshock did occur. Success in this case is defined as the fraction of windows T W in which a large earthquake did occur divided by the total number of time windows, as a function of the current value of the state variable Θ(t). The two parameters in the model, N and R min , are optimized by use of the skill score of the Receiver Operating Characteristic (ROC), as shown in Figure 2a and discussed below.
At each value of Θ(t), such as the red dot in Figure 2b, the value of PPV can be calculated. The value corresponding to the red dot in Figure 2b left is shown by the red dot in Figure 2c, with the value of 52.9% shown in the PPV plot at the right. Figures 2a and 2b are a frame from a movie that is available for downloading from the Supporting Information S1, and corresponds to the month beginning 27 March 1999, about 6 months prior to the M7.1 Hector Mine, California earthquake.  Figure 2 shows that it is possible to build an optimized two-parameter model directly from the seismicity data, then compute the probability of major earthquakes by means of a rigorous process involving ROC methods. Improvements in the model can be made as we discuss in the following, but at this time, we propose a simple, transparent model.
Corresponding to the temporal model, we can also construct spatial probability density functions (PDFs) or Relative Total Intensity I R (x i ,t), indicating geographically where large earthquakes are most likely to occur. The basic assumption for these spatial calculations is that large earthquakes tend to occur at locations with the largest number of small earthquakes (Holliday et al., 2005(Holliday et al., , 2006a(Holliday et al., , 2006b(Holliday et al., , 2007Lee et al., 2011;Rundle et al., 2002).
To compute these I R (x i ,t) functions, we tesselate the spatial region into a local grid of boxes. The results are shown in Figure 3b for spatial probability densities I R (x i ,t) for two different grid box sizes, 0.5° × 0.5°, and 1° × 1°, for the same month as in Figure 2. Note that the two I R (x i ,t) functions are both normalized over the entire region to 100%, after which the function 10 (1 + ( , )) is contoured.
A question arises as to which of the two I R (x i ,t) functions is "better." Visually, it might seem that the finer-scale grid boxes convey more information. However, as we will discuss in the following section, this question can be answered at least partially by computing the spatial ROC, shown in Figure 3a. The result is (apparently) that finer-scale (smaller grid box) spatial partitions tend to have worse skill.
As discussed previously, skill is computed as the area under the spatial ROC curve when built using all future earthquakes having magnitudes M ≥ 3.29 within the future T W = 3 years. The future earthquakes are shown as dots and circles in the images. With the ROC method, we show below that the I R (x i ,t) function in Figure 3a (bottom, grid size: 1° × 1°) has better skill than Figure 3b (top, grid size: 0.5° × 0.5°).

Exponential Moving Average (EMA)
Consider a time series T(t j ) representing the number of small earthquakes occurring in a month, that is indexed at regular monthly intervals j = 1 ,…, J. With justification discussed previously and below, we introduce a new parameter R min such that, if T(t j ) < R min , we set T(t j ) = R min .  (c) represents the chance that an earthquake M ≥ 6.75 will occur during the following 3 years. The chance of such an earthquake M ≥ 6.75 at that moment is 52.9% as shown, both by the number in the green box and the red triangle at bottom. Skill is the area under the ROC curve in (a), which we find as 0.708. The no skill line is the diagonal line from lower left to upper right. Since the no skill area is obviously 0.5, this implies that the nowcast time series has significant skill. The cyan lines in (a) are ROC curves for 200 random time series, whose mean is the diagonal no skill line. The dotted lines bracketing the no skill line from lower left to upper right represent the ±1σ value for the random time ROC series. More results are shown in Table 1.
As described in [1]) the EMA averages over the preceding N times with a diminishing weight for more remote times in the past. More specifically: Here, ≤ 1 represents the amount of weighting decrease. A higher value of discounts older observations faster. The most common choice for is in terms of the number of weights used is [1]:  Table 1.

Temporal Receiver Operating Characteristic
As discussed in Rundle and Donnellan (2020) and , Rundle, Stein et al. (2021), the ROC diagram, first developed with the advent of radar technology, uses a threshold varying from high to low (or low to high) values to classify the data into categories quantified in a "confusion matrix" or "contingency These quantities are defined using the state variable time series from Figure 2b, Θ(t) ≡ Log 10 (1 + Monthly Number). Here we adopt the future time window T W = 3 years and classify all points on Θ(t) for a given threshold value into the four possibilities, TP, FP, FN, TN using the confusion matrix. The threshold is then varied over many values to produce an ensemble of confusion matrices. This then yields TPR(T h ), which is a function of the threshold value T h , as an implicit function of FPR(T h ).
The resulting temporal ROC is shown in Figure 2a for the time interval 1970-2022. The red curve is the ROC curve for Θ(t). The diagonal line from lower left to upper right is the "no skill" line. We also show as cyan lines ROC curves for 200 random time series by sampling randomly from Θ(t) with replacement (bootstrap method). The dashed black lines represent the standard deviation of the random time series.
ROC skill is represented by the area under the ROC curve. No skill is represented by the area of 0.5 under the diagonal no skill line. In Figure 3, the area under the red ROC curve for the time interval 1970-2022 is found by numerical integration to be 0.708, indicating the presence of significant skill.
Note that the values for N = 36 and R min = 30 were found by using the ROC skill as a loss function, and then optimizing these parameters with a supervised learning method. , but it can be seen that even one-parameter examples with R min = 0 have skill significantly better than random, which is 0.5, as seen by the one-tailed P-statistics in the table. (Bottom) Spatial Receiver Operating Characteristic (ROC) values for a series of spatial probability density functions. It can be seen that there is a tradeoff between spatial resolution and skill, with lower spatial resolutions having better skill than higher spatial resolutions. All of the examples in the table have significant skill, as seen by the one-tailed P-statistics. It can also be seen that even one-parameter examples with R min = 0 have skill significantly better than random. Another result from Table 1 is that the addition of the R min parameter offers only a relatively modest improvement in skill.

Parameter R min
Recall that the nowcast parameters were obtained by training on all the data from 1970 to 2018. R min was computed by first computing: We then defined a parameter , optimized using supervised learning with ROC skill as the loss function, to compute R min = . Optimization yielded the value = 18.

Relative Total Intensity Contours
Whereas the Θ(t) time series shown in Figure 2b uses all the M ≥ 3.29 earthquakes in the defined geographic region, the spatially gridded relative intensity time series I R (x i ,t) consists of all time series in the ensemble of grid boxes. As shown by Rundle et al. (2002), Tiampo, Rundle, McGinnis, Gross, et al. (2002), , and Holliday et al. (2005Holliday et al. ( , 2006aHolliday et al. ( , 2006bHolliday et al. ( , 2007, large earthquakes tend to occur where the most small earthquakes occur. This fact provides the initial justification for the method. We then defined a quantity ( , t) | , where ( , t) | is the exponential moving average (N = 36) of ( , t) and ( , t) is required to have a lower bound R min (x i ) = ( ) . Here ( ) is the mean of ( , t) over the years 1970-2018. The values of ( , t) | were then plotted and contoured as in Figure 3. is the same parameter as used in the global R min . We then normalized ( , t) | over space to 100%, so that it represents a spatial probability density.
We found that the addition of the lower bound R min (x i ) only modestly improves the skill of the spatial contours, as determined from the spatial ROC diagrams in Figure 3a. As for the temporal skill, we also found that using only the EMA for the spatial contours still produced a nowcast with skill better than random (Table 1 bottom).

Spatial Receiver Operating Characteristic
Using the same procedure as for the temporal ROC (Figure 3), we apply a varying spatial threshold ("waterline") and compute spatial TPR and FPR, as was done in Holliday et al. (2005). The results are shown for the two grid box sizes, 0.5° × 0.5° and 1° × 1° (Figure 4). It can be seen that the grid box size 1° × 1° has higher skill than the 0.5° × 0.5° grid box size.
The ROC curves in Figure 3 were computed at 1-year intervals from 1970 to 2018, producing the 48 cyan-colored curves. Similar to the temporal ROC, a future time interval T W = 3 years was used, and the skill was computed using all future events with M ≥ 3.29 within T W . The red curve is the mean of the 48 cyan-colored curves. Using the skill obtained by integrating the area under the 48 cyan-colored curves, we find the mean skill and standard deviation over the 48 years.
ROC skill for the curves in Figure 3, together with some additional calculations, are shown in Table 1 (bottom). There we include calculations for 3 different grid sizes, including one-tailed P-statistics, from which it can be seen that the at higher spatial resolutions, the skill is worse. We find a tradeoff between skill and spatial resolution. The P-statistics show that all skill scores are significant relative to random (no) skill.
We also include calculations in which we assume = 0 for the two grid sizes. It can be seen that while skill is not as high as for nonzero , there is spatial skill nonetheless. Nonzero does improve skill, but only modestly.

Summary and Discussion
We again emphasize that we have trained the two-parameter model on all the data from 1970 to 2018. Thus, the results are meant to illustrate consistency of model with data, rather than being a validation study with disjoint training and test data.
As discussed previously, we have proposed a two-parameter model, using N weights for the exponential moving average, and the bias term R min . The parameter R min improves ROC skill significantly for the temporal skill, but only modestly for the spatial ROC skill. We now discuss the physical motivation for including R min .
We find that this minimum earthquake rate R min becomes important in the lead-up to the large earthquakes as (presumably) the regional tectonic stress increases prior to the next large earthquake. This process is similar to the laboratory observed process of strain-hardening (Kandula et al., 2019) or strain-stiffening (Jiang & Srinivasan, 2013). In the former, more of the deformation is taken up by inelastic (plastic) straining as failure is approached. In the latter the laboratory rock sample becomes increasingly rigid, leading to a transition from unstable stick-slip on microcracks to stable sliding. In either case, the stable slip or plastic straining is modeled by the excess rate R min .
Transition from unstable stick-slip to stable sliding has been observed in laboratory friction experiments (Beeler, 2004;Beeler et al., 2001;Chen & Lapusta, 2009;Dieterich, 1986;Heslot et al., 1994). It is seen as the stiffness of the testing machine increases, as well as in experiments in failure in triaxial stress experiments and for observations of seismicity. Support for this application to earthquakes can be seen in both InSAR and GNSS observations of deformation prior to major earthquakes in California. In Figure 4 Figure 1, Cerro Coso Community College-CCCC). Using LiCSBAS, a spatial-temporal filter was applied that is high-pass in time and low-pass in space with a Gaussian kernel that is identical to filters used in other time series InSAR software packages such as StaMPS. The spatial filter width used is 2 km and the temporal filter width used is 0.16 years. After applying the filter, the velocity computed is 4.9 mm/yr. The fact that the deformation over the years prior to the Ridgecrest event is linear suggests that small earthquakes did not stop or decrease with time. When compared to the decrease in small seismic earthquakes prior to the large earthquakes seen in Figure 2, it suggests that there was a transition from unstable stick-slip events to stable sliding. This is because unstable events can only be seen with seismometers, whereas both unstable and stable slip can be observed with geodetic data. et al., 2020; [3]). Note that the frame ID is 064A_05410_131313, and other details can be found at the COMET Sentinel InSAR portal [4]. The GNSS data were processed at JPL by Heflin et al. (2020) [5]. Additional GNSS data for the 4 April 2010 M7.2 El Mayor Cucupah earthquake also shows the same lack of anomalous precursory deformation. Examples of these data are shown on the ESSOAR archive [6] under the Supporting Information.
Aside from seasonal fluctuations in the InSAR displacement, the trends are linear, showing no sign of the type of quiescence implied by the seismically observable data. This fact suggests an approach to potentially locate the epicenter of an impending earthquake, by comparing the trend in surface deformation, observed by InSAR and/ or GNSS, to the trend in Benioff strain due to the unstable slip-slip, small earthquakes. These observations might be focused on the areas of higher probability density seen in Figure 4.
Note that to calculate the velocity from the InSAR images, we adopt the following procedure. The velocity is computed using the method of small baseline inversion. Specifically, a stack of unwrapped interferograms is used to calculate the incremental displacements which are the displacements between two consecutive times. Then, in order to get the cumulative displacements and form the displacement time series, the incremental displacements are simply summed up for each acquisition. The velocity is then derived from the time series using the least squares method. Looking at the time series and ignoring the seasonal effects which manifest themselves as an increase in displacement during the Winter times and a decrease in the Summer time, we are able to observe a steady linear trend in displacement leading up to the earthquake.
As a final remark, we have found that the mean rate of small earthquakes has been highly variable in time, generally increasing. This time dependence could be explained either by the physical processes generating the earthquakes changing over time, or alternately by incompleteness of the catalog prior to the full implementation of the digital observation network in the mid-1990s. Hutton et al. (2010) have a discussion of the completeness level that is informative in this regard. Further refinements of the method presented here might include taking account of this time dependence, but this is a subject that we leave to future research.

Data Availability Statement
Python code that can be used to reproduce the results of this paper can be found at a GitHub repository (DOI: https://doi.org/10.5281/zenodo.7186635) [6]. Data for this paper was downloaded from the USGS earthquake catalog for California, and are freely available there. The Python code mentioned above can be used to download and model these data for analysis [6].
Note: from the US Department of Energy to UC Davis. GCF was supported by NSF CINES Grant CINES 1835598. Portions of this research were also carried out at the Jet Propulsion Laboratory, California Institute of Technology under contract with NASA. "LiCSAR contains modified Copernicus Sentinel data [Year of data used] analyzed by the Centre for the Observation and Modeling of Earthquakes, Volcanoes, and Tectonics (COMET). LiCSAR uses JASMIN, the UK's collaborative data analysis environment (http://jasmin.ac.uk).