Statistical Topic Models has been widely studied in Text Mining as an effective approach to extract latent topics from unstructured text documents. We present a robust and computationally efficient Hierarchical Bayesian model for effective topic correlation modeling Generalized Dirichlet distribution (GD). GD-LDA is effective to avoid over-fitting as the number of topics is increased. We provide results using Empirical Likelihood (EL) in 4 public datasets. We show the application of topic models in two different domains: 1)Information Retrieval, and 2)Dynamic Prediction Models applied in health care.
In Information Retrieval, we propose to leverage statistical topic modeling techniques in relevance feedback to incorporate a better estimate of context by including corpus level information about the document. We show results using the OHSUMED dataset for three different variants and obtain higher performance, up to 12.5% in Mean Average Precision (MAP).
Patients often search for information on the web about treatments and diseases after they are discharged from the hospital. However, searching for medical information on the web poses challenges due to related terms and synonyms for the same disease and treatment. We present a method to retrieve healthcare related documents using the patient discharge document. We show that the proposed framework outperformed the winner of the retrieval CLEF eHealth 2013 Challenge by 68% in the MAP measure, and by 13% in NDCG.
We present a method to estimate dynamically the probability of mortality inside the Intensive Care Unit (ICU) by combining heterogeneous data. We propose a method based on Generalized Linear Dynamic Models that models the probability of mortality as a latent state that evolves over time. This framework allows us to combine different types of features (lab results, vital signs readings, doctor and nurse notes, etc.) into a single state. We update this state each time new patient data is observed. We test our proposed approach using 15,000 Electronic Medical Records (EMRs) obtained from the MIMIC II public data set.
We expand this dynamic mortality estimation model in two forms. We estimate the probability that a patient is readmitted after he is discharged from the ICU and transferred to a lower level care unit. We also present a method to predict the failure of physiological subsystems from patients admitted to the ICU using heterogeneous data dynamically. We model the probability of failure in each subsystem as a latent state. Then, we estimate the probability of patient mortality as a combination of the estimated failure propensity for all subsystems. We propose a method of imputing missing values using the non-ignorable nature of the patient data. Experimental results show that our method outperforms other approaches in the literature in terms of AUC, sensitivity, and specificity. In addition, we show that the combination of different features (numerical and text) increases the prediction performance of the proposed approach.