In this thesis, we consider statistical issues in classification for disease using diagnostic testing. We discuss two aspects of disease classification: the generation of testing algorithms to combine multiple diagnostic tests that address both accuracy and cost considerations and the application of an imperfect diagnostic test to determine cases in a case-control study.
Motivated by the problem of combining multiple biomarkers to identify recent HIV infection (<1 year), we first develop methods for identifying "serial testing algorithms" to reduce the cost of diagnosis. These "serial testing algorithms" are characterized by the ability to make a classification determination before all diagnostic markers are acquired. These algorithms are able to maintain accuracy while controlling costs of the diagnostic testing.
We present two approaches to this problem. A logic regression approach in which serial testing algorithms are developed by means of logical combinations of dichotomous tests. Testing costs are optimized through a permutation algorithm on the logical rule. We also develop a serial risk score classification approach. In this method, we establish multiple ordered stages of classification determined by a risk score model. In each stage, one or more diagnostic tests are added to the risk score model from the previous stage and each observation is either determined to continue on for further testing or classified as positive or negative.
The methods are studied in simulations and compared with logistic regression. We applied the methods to data from HIV cohort studies to identify HIV infected individuals who are recently infected (< 1 year) by testing with assays for multiple biomarkers. The biomarkers that we used as part of the classification rule were the CD4 count, viral load, BED assay and avidity assay. We find that serial testing algorithms can maintain accuracy while achieving a reduction in cost compared to testing all individuals with all assays.
We then investigate the application of a non-gold standard test to a case control study. This work was motivated by case-control studies for risk factors associated with recent (<1 year) HIV infection when the duration of infection cannot be directly observed. In this type of study, recently (< 1 year) and chronically (>1 year) infected people represent two types of cases. When the case type is misclassified, the usual standard estimates for an odds ratio associated with one of the case types can be biased. We discuss methods to adjust the odds ratio from a case control study using the performance characteristics of a classification rule. In particular, we discuss a matrix adjustment method to adjust the observed counts of each case type, and an adjustment method based on a multinomial logistic regression model. These methods have shown to reduce bias in the estimation of the odds ratio.
We conclude with a discussion of the described methods in disease classification that were motivated by problems in HIV research. These problems included the cost of diagnostic tests and the fact that dates of infection cannot often be determined. The methods we developed may also have application to other settings especially when the costs of diagnostic testing is high and there are multiple types of cases that cannot be distinguished with complete accuracy.