Data is one of the essential components in analyzing complex earth system problems. With high-quality data more feasible to the researchers, more details of the system could be revealed by those data-intensive computational and data-driven approaches. The measurement and data collection devices have been developing dramatically, especially those used for earth system science. The high sampling resolution in all spatial, temporal and spectral scale have enabled the analysis of earth system problems into a data-driven era. Meanwhile, the fast development of computational ability and resources allow the emergence of innovative data-driven methods (e.g., information theory, traditional statistical learning models, deep learning models). The data-driven approach is different from the physical-based (or knowledge-based) modeling. It emphasizes learning and generalizing the rules from large amounts of representative data. It tries to fit the probability distribution function, for any questions, with the support of large numbers of observations with little constraining conditions like those from the physical-based model. However, before relying on purely data-driven methods, it is essential to remember that Earth systems are characterized as nonlinear, complex and dynamic systems with couplings and feedback among components and subsystems.
Additionally, these coupled processes change depending on the status of the system and the spatial and temporal scale at which the system is analyzed. To understand the underlying mechanisms that drive complex systems, it is useful to conceptualize the system as a network of variables undergoing interactions and feedback. Traditional statistical analysis methods are ill-suited to capture the key attributes of this type of feedback processes due to the stochasticity of the variables, the nonlinearities of the couplings and the non-stationarity of the system. The limitation of the data (in terms of resolution and length in both spatial and temporal scale) and computational ability further narrow the effectiveness of those methods.
The various science communities are now facing a new challenging problem. On the one hand, you have 1) more and more data being collected, 2) the significantly-improved ability to depict the status of a system and to describe the details of a relationship between the components within the system, and 3) the computational capacity and resources to be able to handle this large number of data, motivating the use of data-driven methods.
In this dissertation, I will examine the potential for integrating data-driven techniques into earth systems science to improve our understanding of earth-surface processes. Specifically, I focus on applying data-driven techniques for resolving causal interactions of the several complex earth systems over multispectral and temporal scales. Four complex earth system problems with different spatial and temporal scales are discussed. First, we implement the data-driven methods in regional and decadal issues, streamflow prediction, as a case study. Our findings suggest that while information-flow identifies dominant streamflow controls, the results should not be limited to only “critical hydrologic timescales;” instead they should guide a range of timescales over which inputs, stores, and losses are filtered into catchment discharge. Second, we analyzed a regional and yearly problem, the feedback process between vegetation and topography in a lake delta ecosystem. The transfer entropy analysis suggests that different vegetation communities play functionally different roles in landscape evolution that should be differentiated in ecogeomorphic models. Within such models, it would be most imperative to resolve detailed flow characteristics at lower to low-middle island elevations.
Furthermore, within elevation zones, it is likely essential to differentiate between the roles of multiple vegetation communities rather than treating the entire elevation zone as a single ecogeomorphic entity. Third, we analyzed global and millennium problems, the interaction among climatically variables over 42,000 years. We show that, during the past 420,000 years, orbital forcings trigger temperature and CO2 responses at short (5 kyr) time lags. Over longer timescales, internal feedback, mediated by interactions with dust, also plays a significant role in governing temperature and CO2 concentrations. The short-term influence of CO2 on temperature was stronger than dust’s long-term impact, consistent with on radiative forcing. However, dust remained an essential driver of temperature over 50-kyr time lags, the amount of time between sequential glacial maxima and minima during the latter portion of the Pleistocene. Last, we analyzed a global and decadal problem, the interaction between ocean and precipitation on land. We quantitatively demonstrate that Sea Surface Temperature (SST) over the Gulf of Guinea controls moisture advection and transport to the West Sahel region; strong bidirectional interaction exists between local vegetation dynamics and rainfall patterns. The spatial distribution map of time lag with most significant transfer entropy also shows the apparent trend of each climate indices tested in this research. The Niño 3+4 and Niño 4 have a relatively short time lag with significant transfer entropy to the west coast and have insignificant information transferred to the middle US. The Niño 1+2 and Niño 3 have a relatively short time lag with significant information transferred to the middle region but insignificant information transferred to the west coast.
By testing the effectiveness and efficiency of the data-driven methods in complex earth system problems over multiple spatial and temporal scales, the results verified the ability of those methods in identifying and quantifying the strength, statistical significance, directionality and critical time lags of feedback (as well as one-way forcing) among variables. With these data-driven methods, we could identify which components comprise the system, and which dominate changes within the system. With the input of that knowledge, we could further predict the behavior of an element of interest or the stationery of the whole system and simulate the future behavior of the system under different scenario after fully understanding the rules and the connections of a system.