The development of a materials synthesis route is usually based on heuristics and experience. This thesis proposes to apply data-driven approaches to learn the patterns of synthesis from past experience and use them to predict the syntheses of novel materials. However, this route is impeded by the lack of a large-scale database of synthesis formulations. Scientific publications represent the largest repository of knowledge about material synthesis and can be used as a reliable source of data. However, human-written descriptions of syntheses require additional levels of interpretation for conversion into a codified, machine-operable format. Therefore, this thesis aims to achieve two objectives: 1) constructing a text-mining pipeline that extracts synthesis datasets from scientific publications, and 2) validating a novel synthesis hypothesis, minimum thermodynamic competition, by the text-mined dataset and systematic synthesis experiments.
To fulfill the first objective, we need to build a text-mining pipeline to extract essential information from scientific publications. Extraction of synthesis information is challenging, especially for extracting synthesis actions, because of the lack of a comprehensive labeled dataset using a solid, robust, and well-established ontology for describing synthesis procedures. In order to extract synthesis actions (Chapter 2), we propose the first unified language of synthesis actions (ULSA) for describing inorganic synthesis procedures. We created a dataset of 3,040 synthesis procedures annotated by domain experts according to the proposed ULSA scheme, and then built a neural network-based model to map arbitrary inorganic synthesis paragraphs into ULSA and used it to construct synthesis flowcharts for synthesis procedures.
We constructed the first large dataset of solution-based inorganic materials synthesis procedures by designing an advanced text-mining pipeline (Chapter 3), including the ULSA synthesis action extraction framework alone with other natural language processing(NLP) and deep learning techniques. The dataset consists of 35,675 solution-based synthesis procedures. Each procedure contains essential synthesis information, including the precursors and target materials, their quantities, and the synthesis actions and corresponding attributes. Every procedure is also augmented with the reaction formula.
Digitizing and systemizing the large synthesis corpus of existing materials science publications provides a foundation not only to build machine learning models, but also empirically validate the fundamental physical theory. Thermodynamics has strong predictive power for materials synthesis by identifying the stability regions of target phases, which can guide synthesis planning for computationally-designed materials. However, a stability domain does not give explicit information about the relative competitiveness of undesired byproduct phases, nor does it identify a precise synthesis condition for optimized kinetics to produce the target phase.
To fulfill the second objective, in Chapter 4, we define thermodynamic competition as the difference in driving force between one phase and its competing phases, and we hypothesize that one approach to optimizing the kinetics of phase-pure synthesis is to minimize the thermodynamic competition between the desired target phase and its competing phases. We systematically validate this hypothesis with two approaches: (1) we analyze large-scale solution synthesis procedures as text-mined from the literature and show that experimentally-optimized synthesis conditions are near our predicted thermodynamic optimum point, and (2) direct experimental evaluation of synthesis in LiIn(IO3)4 and LiFePO4; where we show phase-pure synthesis occurs only when thermodynamic competition is minimized. Our work demonstrates that thermodynamic competition is an effective descriptor for synthesis optimization and a promising tool for optimizing aqueous solution-based experimental synthesis conditions.
Finally, Chapter 5 summarizes the main findings of the dissertation and provides an outlook for the future directions of data-driven approaches synthesis design.