Advances in machine learning and growing computational power are enabling large scale data analysis experiments. To facilitate these experiments, data must be cleaned from its raw form into one suitable for analysis. Automatic data pipelines are able to facilitate the creation of large-scale experiments by automating the transformation and cleaning of data into a form amenable to analysis. In this thesis an automatic data pipeline is developed for two separate projects: “Machine Learning Assisted Sampling of SERS Substrates Improves Data Collection Efficiency” and “The Multiscale Atomic Zeolite Simulation Environment (MAZE): A Python Package for Improved Zeolite Structural Manipulations”. These two projects are related in that both automate key sections of the experimental data analysis, building the groundwork for future autonomous experiments. Machine Learning Assisted Sampling of SERS Substrates Improves Data Collection Efficiency: Surface-enhanced Raman scattering (SERS) is a powerful technique for sensitive label-free analysis of chemical and biological samples. While much recent work has established sophisticated automation routines via machine learning (ML) and related artificial intelligence (AI) methods, these efforts have largely focused on downstream processing (e.g., classification tasks) of previously collected data. While fully automated analysis pipelines are desirable, current progress is limited by cumbersome and manually-intensive sample preparation and data collection steps. Specifically, a typical lab-scale SERS experiment requires the user to evaluate the quality and reliability of the measurement (i.e., the spectra) as the data is being collected. This need for expert user-intuition is a major bottleneck that limits applicability of SERS-based diagnostics for point-of-care clinical applications, where trained spectroscopists are likely unavailable. While application-agnostic numerical approaches (e.g., signal-to-noise thresholding) are useful, there is an urgent need to develop algorithms that leverage expert user intuition and domain knowledge to simplify and accelerate data collection steps. To address this challenge, in this work, we introduce an ML-assisted method at the acquisition stage. We tested six common algorithms to measure best performance in the context of spectral quality judgement. For adoption into future automation platforms, we developed an open-source python package tailored for rapid expert user annotation to train ML algorithms. We expect that this new approach to use ML to assist in data acquisition can serve as a useful building block for point-of-care SERS diagnostic platforms.
The Multiscale Atomic Zeolite Simulation Environment (MAZE): A Python Package for Improved Zeolite Structural Manipulations: Zeolites are nanoporous materials with widespread industrial applications as catalysts and gas separators. Due to the enormity of the zeolite chemical space and structural complexity, computational experiments are needed to identify high-performing zeolites and interpret zeolite characterization data. These computational experiments are enabled by software packages, most notably the Atomic Simulation Environment (ASE) which provides an easy-to-use Python interface to drive low level simulation code. ASE has some limitations that make certain zeolite simulation workflows challenging and labor intensive. These limitations motivated the creation of the Multiscale Atomistic Zeolite Simulation Environment (MAZE) package which builds on-top of ASE to facilitate common zeolite structural manipulations that are challenging with the base ASE package. The improved interface of MAZE, compared to ASE, is demonstrated by applying application programing interface (API) design heuristics and showcasing the ability of MAZE to facilitate common zeolite workflows. It is demonstrated that the MAZE package has an improved API for zeolite workflows and that this can facilitate the creation of large databases of zeolite structural derivatives.