Biobanks linked to Electronic Health Records (EHRs) herald a new era of opportunities for etiological research of Severe Mental Illness (SMI). However, because EHRs are not primarily designed for research, translating these opportunities into actionable insights demands innovative frameworks and accurate phenotyping tools. This dissertation harnesses the potential of EHRs from psychiatric hospitals for in-depth studies of SMI. I set the stage by contextualizing the relevance of EHRs in psychiatric genetic research. Then, I describe the organizational makeup and data types within the EHR of the Clinica San Juan de Dios in Manizales — a regional psychiatric hospital in Colombia. The subsequent chapters explore transdiagnostic phenotypes by combining clinical notes and diagnostic codes, leading to the delineation of disease trajectories in SMI. Then, I explore the extraction and validation of psychiatric diagnoses through both rule-based and machine learning strategies. And finally, I conclude with the design and validation of a Clinical Natural Language Processing (cNLP) tool for extracting highly detailed psychiatric phenotypes from unstructured text.
Three strengths of EHRs are emphasized throughout this work: the integration of multi-dimensional data, enabling a comprehensive perspective of patient phenotypes; the innovative application of cNLP for symptom extraction from clinical narratives in Spanish; and the capacity of EHRs to provide longitudinal insights into patients' course of illness. Taken together, this dissertation not only highlights the potential of EHRs but also navigates the intricacies of employing them for psychiatric genetic research.