Data science is a fundamental tool in hydrology nowadays. The significance of data science lies in its ability to confront the multifaceted challenges posed by global warming, facilitating a deeper comprehension of hydrological processes, and enhancing the accuracy of runoff predictions. This dissertation embarks on a journey aimed at advancing our insights into hydrological processes, refining the physical-consistency of runoff predictions, and addressing the intricate task of forecasting hydrological behaviors in ungauged basins through the application of data science techniques. Comprising three main bodies of work, this dissertation unfolds a comprehensive exploration of these objectives. The first contribution (Chapter 2) centers on the synthesis of extensive hydrological datasets and subsequent analysis of hydrological trends under recent warming. Chapter 3 explores a physics-informed machine learning model designed for predicting streamflow tested across different scenarios. Lastly, the fourth chapter evaluates the potency of various watershed clustering mechanisms for predicting within ungauged basins (PUB).
Chapter 2 addresses a long-standing limitation in comparative hydrology: the scarcity of geographically extensive, inter-compatible monitoring data on comprehensive water balance stores and fluxes. These limitations have, for example, restricted comprehensive assessment of multiple dimensions of wetting and drying related to climate change and hampered understanding of why widespread changes in precipitation extremes are uncorrelated with changes in streamflow extremes. In this chapter, both the requirements of developing a new data synthesis product and using this data product to detect trends in the frequencies and magnitudes of a comprehensive set of hydroclimatic and hydrologic extremes are addressed. The Comprehensive Hydrologic Observatory Sensor Network (CHOSEN), a database encompassing hydroclimatic and hydrologic variables from 30 diverse study areas across the United States is introduced. And a reproducible data pipeline that ensures data quality and accessibility is developed. Analyzing the CHOSEN dataset, the hotspots of hydroclimatic extremes in regions like the Pacific Northwest, New England, Florida, and Alaska are uncovered. The analysis reveals regional coherence in extreme streamflow wetting and drying trends, shedding light on the complex interplay between climate-induced changes and hydrologic processes.
Chapter 3 is built upon the development of the CHOSEN dataset to create subsequent analyses and a new runoff prediction model. The challenge of a lack of interpretability and physical consistency in machine learning models used for streamflow prediction is confronted. To address this issue, a physics-informed long short-term memory (PILSTM) model is proposed, incorporating water balance restrictions for runoff prediction. A physical rainfall-runoff model is combined with the long short-term memory (LSTM) model, and it is applied to eight intensively-monitored watersheds in the United States, selected based on data length and hydroclimatic diversity. LSTM, physical, and PILSTM models are used under non-stationary scenarios and data-scarce situations. Results show that the PILSTM exhibits similar or better performance to the LSTM counterpart in terms of multiple metrics and under various scenarios. Additionally, based on the analysis of feature importance, it is shown that adding physical constraints could potentially guide machine learning models to generate predictions that are more consistent with known physical processes.
Chapter 4 explores the effectiveness of watershed clustering, a conventional practice in watershed regionalization, in combination with neural networks for predicting in ungauged basins. Traditionally, watershed clustering involves grouping basins with similar characteristics to facilitate knowledge transfer from monitored to ungauged basins within the same cluster. Recent advancements in data science, however, suggest that clustering may not be necessary. This study aims to investigate this matter and presents a comparative analysis of various watershed clustering methodologies. The concept is explored by directly integrating static watershed attributes into predictive models for streamflow (entity-aware LSTM). The analysis covers 415 sites from the CAMELS (Catchment Attributes and Meteorology for Large-sample Studies) dataset. Results indicate that pre-clustering generally does not enhance the performance of entity-aware LSTM models for predicting in ungauged basins. Models incorporating clustering results either match or perform worse overall compared to global models that directly integrate clustering features as static inputs. Notably, among the different features used for clustering, hydrological signatures prove most effective in extracting information for use in the LSTM model.
Chapter 2 addresses crucial gaps in data availability, while the subsequent chapters explore novel approaches for forecasting streamflow across diverse scenarios and ungauged basins, leveraging the power of data science. In Chapter 3, the integration of physical and machine learning models is pursued, while Chapter 4 focuses on harnessing data science methodologies for predicting in ungauged basins. Collectively, these chapters offer an exploration of the intersection between data science and hydrology. This dissertation emphasizes the transformative potential of interdisciplinary strategies, which bridge data-driven insights with the dynamics of hydrological systems.