Classification Accuracy of Neuroimaging Biomarkers in Attention-Deficit/Hyperactivity Disorder: Effects of Sample Size and Circular Analysis

BACKGROUND: Motivated by an inconsistency between reports of high diagnosis-classi ﬁ cation accuracies and known heterogeneity in attention-de ﬁ cit/hyperactivity disorder (ADHD), this study assessed classi ﬁ cation accuracy in studies of ADHD as a function of methodological factors that can bias results. We hypothesized that high classi ﬁ cation results in ADHD diagnosis are in ﬂ ated by methodological factors. METHODS: We reviewed 69 studies (of 95 studies identi ﬁ ed) that used neuroimaging features to predict ADHD diagnosis. Based on reported methods, we assessed the prevalence of circular analysis, which in ﬂ ates classi ﬁ cation accuracy, and evaluated the relationship between sample size and accuracy to test if small-sample models tend to report higher classi ﬁ cation accuracy, also an indicator of bias. RESULTS: Circular analysis was detected in 15.9% of ADHD classi ﬁ cation studies, lack of independent test set was noted in 13%, and insuf ﬁ cient methodological detail to establish its presence was noted in another 11.6%. Accuracy of classi ﬁ cation ranged from 60% to 80% in the 59.4% of reviewed studies that met criteria for independence of feature selection, model construction, and test datasets. Moreover, there was a negative relationship between accuracy and sample size, implying additional bias contributing to reported accuracies at lower sample sizes. CONCLUSIONS: High classi ﬁ cation accuracies in neuroimaging studies of ADHD appear to be in ﬂ ated by circular analysis and small sample size. Accuracies on independent datasets were consistent with known heterogeneity of the disorder. Steps to resolve these issues, and a shift toward accounting for sample heterogeneity and prediction of future outcomes, will be crucial in future classi ﬁ cation studies in ADHD.

A significant challenge in the assessment and treatment of neuropsychiatric disorders is that diagnosis is typically based on subjective behavioral criteria, a process that is timeconsuming and requires considerable expertise and training.The need for objective diagnostic indicators has fueled efforts to define neuropsychiatric biomarkers, particularly based on structural and functional features of the brain, and with increasing deployment of machine learning methods.Results of these efforts have been variable; recent reviews indicate that classification accuracy is distributed broadly between chance and near 100% (1)(2)(3).Such variability can lead to puzzling outcomes, as is evident in the case of attention-deficit/ hyperactivity disorder (ADHD).On the one hand, reports of accuracies in excess of 90% (4)(5)(6)(7)(8)(9)(10)(11)(12)(13)(14)(15)(16)(17) have culminated in the electroencephalography-based theta/beta ratio (TBR) metric (18) gaining Food and Drug Administration support as an adjunct to clinical assessment of ADHD (19,20).On the other hand, the variability echoes increasing awareness of heterogeneity in ADHD in symptom presentation (21), neurocognitive impairment, (22,23) persistence (24)(25)(26), treatment response (27,28), and putative mechanistic pathways (29)(30)(31), and supports the existence of independent subgroups within ADHD (32)(33)(34)(35)(36)(37).The incompatibility between such heterogeneity and a diagnostic tool validated by existing ADHD diagnosis has contributed to discussion over the utility of neuroimaging in diagnosis of ADHD (38)(39)(40).It also raises a conceptual question: if current diagnosis of ADHD is too clinically variable for classification, how are high classification accuracies achieved?The answer to this question is important if it lies in methodological limitations, which may continue to be a concern in future studies.Thus, we examine this question using ADHD as an exemplar given the large existing literature base on neuroimaging classifiers of diagnosis.
Potential pitfalls of applying classification approaches to neuropsychiatric data have been discussed extensively (1,3,41,42).Two that are particularly relevant include circular analysis and sample size.First, to evaluate its role in clinical medicine, a machine learning classifier needs to have good generalizability: defined by good performance on patients not included in the study (i.e., new patients).In the experimental setting, this is assessed by cross-validation, whereby a subset of a dataset is not included in construction of the classification model ("training") and subsequently used to assess the performance of the model ("testing").However, the testing accuracy can be inflated owing to a common error of including all data when selecting features to be used for classification (i.e., before training).For instance, a t test may be performed on all subjects' data, before cross-validation, to identify brain regions that are the most discriminative of two groups.This step is typically performed to reduce the number of features (e.g., brain regions) that are included in the model.However, including all subjects' data in feature selection (rather than performing this step on the training subset only) creates circularity, or "peeking," in the training model that can inflate reported test accuracy (43).Simulations suggest that accuracy inflation can reach 40% depending on model parameters (3,44,45) (also see the Supplement for simulation results).In 2008, a reported 42% of high-impact journal functional magnetic resonance imaging (MRI) studies were subject to circular analysis, with another 14% lacking methodological detail to reach judgment (43,46), suggesting that such practice is not uncommon.A second concern is small sample size, as it can drastically increase both accuracy and variability of crossvalidation accuracy (41,42,47).Simulations show that accuracy estimates, in models designed for neuropsychiatric diagnostics, can become unstable when total sample size is less than 100 to 150 (41,(47)(48)(49)(50), and the problem is most severe when combined with circular analysis (45).
The objective of this study was to review neuroimagingbased studies on ADHD classification to assess the contribution of circular analysis and sample size to classification accuracy, thereby testing for accuracy-inflating effects of these two factors and whether these effects have changed over time.The results reveal a more accurate portrayal of classification accuracies in ADHD, revealing methodological weaknesses that should be addressed in future studies, and that generalize to studies of any neuropsychiatric disorder.

METHODS AND MATERIALS
We performed a literature search using multiple databases (PubMed, Web of Science) and search engines (Google Scholar), with key words including ADHD, ADD, classification, machine learning, classifier, prediction, and accuracy, retaining publications that explicitly described a classification framework to distinguish between ADHD and comparison groups (N = 95 studies) based on neuroimaging features.Studies were excluded if 1) no control group was examined (ADHD only or ADHD vs. other disorder groups) (n = 5); 2) sample size per class or age group was not specified (n = 5); 3) total sample was ,6, limiting within-group variance (n = 2); 4) accuracy was shown graphically but not reported in the text (n = 3); 5) the model did not use neuroimaging features (n = 9); and 6) classification was not performed based on original ADHD diagnostic labels (n = 1).One study was excluded due to a retraction.This exclusion protocol yielded a final total of 69 studies (Table 1; see Supplement for list of excluded studies).

Study Characteristics
For each study we identified sample size, population (adult, pediatric), feature type, and classifier model.We used a cutoff of 18 years of age for classifying studies as adult versus child populations.For simplicity, studies with participants up to and including 18 years of age were labeled as child studies and studies with participants over and including 18 years of age were labeled as adult studies.An exception was the 2017 study of Duffy et al. (51), who used a range of 2 to 22 years of age, which was labeled as "children" in Table 1 for simplicity.If studies performed separate analyses for adults and for children, we report the study twice, treating each group as a separate population.

Frequency of Circular Analysis
To assess the frequency of circular analysis, we evaluated the methods section of each study.We identified procedures for feature selection and those for classification, with the goal being to identify if the same dataset was used for feature selection and in the testing of the classification model.If this was unambiguously the case, the study was labeled as nonindependent (see Table 1) with respect to model testing.In many instances, there was ambiguity regarding nonindependence given the methods description and/or presented workflow.Such studies were labeled as unknown, with respect to nonindependence.For all such studies, we contacted the primary author to seek additional details to reduce the size of the unknown category.Some studies presented rationale for including all subjects' data in model training because the algorithm of feature selection analysis was independent from the analysis of the classifier, and thus should not affect classifier performance (14,51).However, as true independence in such cases can be difficult to ascertain (43), we included such studies in the nonindependent category.Therefore, we adopted a rather strict criterion of requiring a completely different set of subjects to be used for feature selection versus testing, to label a study as free of circular analysis.This definition subsumes cases where features were defined based on prior knowledge (i.e., prior studies de facto use independent data to define the features).It also implies that for studies that use an iterative cross-validation scheme, feature selection must be either based on prior knowledge or performed within the training set of every iteration for the classifier to be guaranteed free of circular analysis.Finally, we also identified studies in which no test set was defined (all data were used in feature selection and model construction) and thus no cross-validation was performed.Such studies may suggest potentially useful features but have no test of model generalizability.At the other extreme, we also identify studies that identified an additional completely independent testing dataset (which we refer to as the validation set to distinguish it from the test set), not involved in feature selection, which provides an additional objective, external validation of model generalizability.

Sample Size and Accuracy
For each study, we obtained the total sample size and classifier specificity, sensitivity, and accuracy.Where multiple models were examined, we took the best-performing model.Where accuracy was unreported, we calculated accuracy from ADHD Neuroimaging Biomarkers and Classification

Time Analysis
Finally, we sought to establish whether the methodological factors of concern (small sample size and circular analysis) are current problems, or whether their presence (if established) is restricted to older studies, preceding awareness of these issues in the field.To do so, we analyzed 1) an analogous logistic regression model with accuracy as a probability of a binary outcome and year of publication as a predictor; 2) a linear regression model with sample size as the dependent variable and year of publication as a predictor; and 3) contingency tables for presence of circular analysis (yes/no/unknown) and time windows constructed by binning years of publication by median split (,2013, $2013), and in a second analysis, also the top and bottom 33rd percentiles (#2011, .2014).

Study Set Characteristics
Of the 69 studies reviewed (Table Almost no studies used the exact same set of features, with the exception of studies of TBR.Among algorithms chosen, support vector machines were the most common, used in 26 (37.6%) studies, followed by discriminant linear analysis (13 studies, 18.8%), neural networks (8 studies, 11.6%), and logistic regression (5 studies, 7.3%).Four studies employed receiver-operating characteristic curve analysis (5.8%) to draw conclusions regarding ability of features to discriminate between groups.

Prevalence of Circular Analysis
A total of 15.9% (11 of 69) presented methods that were consistent with circular analysis, whereby feature selection was performed on the full dataset including the test data.Nine studies (13.0%) did not employ any cross-validation.Hence, the reported accuracies were untested with respect to generalizability.In 8 of 69 studies (11.6%), independence was unclear (unknown).That is, the methods provided insufficient information to determine if circular analysis was present.For example, some studies used linear discriminant analysis trained on half the dataset, but t tests were used to determine which features were considered by the linear discriminant analysis.Importantly, it was not specified which data were used to perform the t tests (training sample only or full sample).We note that before active author inquiry, we encountered a total of 17 studies (24.6%, 17 of 69) with methodological detail insufficient to make a determination regarding feature selection.
In sum, we identified 41 studies (59.4%) that met our criteria for independence of the test set relative to training and feature selection.Of these, most (29 of 41 or 70.7%) were studies using functional MRI features [25 as part of the ADHD-200 competition (54)].Only 26.8% (11 of 41) used EEG features.Thus, where an assessment could be performed, circular analysis was more prevalent in EEG studies than MRI, c 2  1, n = 51 = 8.52, p , .004.

Sample Size and Classifier Accuracy
In studies that met independence criteria, the relationship between sample size and accuracy was significant (Wald c 2 = 18.9, p , .001;odds ratio = 0.9987; 95% confidence interval = 0.9983-0.9993)(Figure 2); for a one-unit increase in sample size, the odds of correct classification decreased by 0.12%.This translates into a predicted drop of approximately 5.9% in classifier accuracy when increasing a sample from n = 10 to n = 300, or 25.4% when increasing a sample from n = 10 to n = 1000.A sample size-accuracy relationship was not significant for studies that failed to meet independence criteria (Wald c 2 = 0.03, p = .88)(Figure 2), possibly because of inflated accuracy across sample sizes.Confirming these effects, the mean accuracy of the 25% largest independent test set studies was significantly lower than the mean accuracy of the 25% smallest studies (mean largest = 68.1%,mean smallest = 84.5% [t 18 = 4.4, p , .0001]), and also significantly lower than the nonindependent studies (mean nonindependent = 83.6%[t 18 = 3.3, p , .005]).As a larger portion of MRI than EEG studies used independent testing, we repeated the analysis for each modality to test whether this relationship is largely driven by MRI studies.As expected, for MRI studies, the negative association of sample size and classification accuracy was significant (Wald c 2 = 17.0, p , .001;odds ratio = 0.9988, 95% confidence interval = 0.9983-0.9995).For EEG studies, the relationship was not significant (Wald c 2 = 0.01, p = .91).
a When the study did not provide the accuracy, we estimated it given the sample size, specificity and sensitivity.(18,20) studies, the thresholds were defined based on an external database in the 2008 study (18), and based on the 2008 result in the 2015 study (20).Thus, the Snyder et al. (18,20) studies and the 2001 Monastra et al. study (10) can be considered independent cross-validation and by this definition do not fall under circular analysis.However, these studies had limitations with respect to estimation of specificity.The non-ADHD comparison sample size averaged 16 individuals per age group [i.e., n = 7, 11, and 15 per tested age group in Monastra et al. (10); n = 9, 20, and 33 per tested age group in Snyder et al. (18)].Finally, in the 2015 study of Snyder et al. (20), accuracy based on TBR alone was not reported.In all, test results are either lacking or underpowered for effective assessment of TBR classification generalizability.

DISCUSSION
The aim of our study was to assess the contribution of circular analysis and small-sample bias to accuracy of diagnostic classification studies in ADHD using neuroimaging biomarkers.We found circular analysis in 15.9% of ADHD classification studies, lack of cross-validation in 13%, and insufficient methodological detail to establish its presence in another 11.6%.Our results reveal that accuracy of classification is 60% to 80% in the 59.4% of studies that met our criteria for independence of feature selection, model construction, and test datasets.There was a negative relationship between accuracy and sample size even in the presence of independent testing, suggesting that small-sample accuracies may be subject to bias.

Methodological Factors and Classification Accuracy
A key conclusion from our analysis is that in 28.9% of the studies reviewed, reported accuracy was likely inflated owing to presence of circular analysis or lack of internal validation (test set).In some cases, the use of a full dataset for feature selection was justified by using an analysis thought to be independent from the contrast of ADHD patients versus control subjects [e.g., mean effect across all subjects within a condition (14), principal component analysis (51)].However, the independence of such approaches is difficult to guarantee and can still contribute to bias during testing (43), and therefore should be avoided.External validation, an even stronger test on generalizability, was absent in 55 studies (79.7%), suggesting that our estimates of true accuracy in classification of ADHD may be optimistic still.Time analyses did not support the conclusion that rates of circular analysis are decreasing across publication year.However, our estimate of 15.9% of studies reviewed is nearly a third of that reported in 2008, when 42% of high impact-journal functional MRI studies were subject to circular analysis (43,46), supporting an awareness of these methodological issues in the community.Nevertheless, the frequency of the lack of sufficient methodological detail (24.6% before author inquiry, 11.6% after author inquiry) was high and highlights a need for systematicity in review criteria of classification studies.There are now a number of excellent reviews, many specifically targeting biomarker studies in neuropsychiatry, that provide such guidance (1)(2)(3)41,42).
Replicating recent review findings of Varoquaux (41)  data decreased with sample size.This suggests that bias is at play in small-sample studies, particularly given that, in unbiased analyses, accuracy is known to increase with sample size (47,49,59).Sources of this bias likely include publication bias, with small-sample studies that fail to obtain high classification accuracy being unlikely to be published, leading to underestimation of accuracy variance in small-sample studies.In classification of psychiatric conditions, such as ADHD, a pertinent source of bias may be sample homogeneity in small samples that is not representative of the broader population (48).An important caveat to our observations, the interaction between sample size and accuracy may be additionally affected by choice of cross-validation scheme (e.g., k-fold vs. leave one out), data preprocessing (e.g., control for motion artifacts), and classifier.An exhaustive analysis of these factors fell outside of the scope of the current study, owing to variability in these factors among studies, but a preliminary analysis did not reveal differences in choice of classifier or cross-validation scheme across sample size (see Supplement).It is notable that accuracy did not appear to decrease across year of publication, whereas sophistication in machine learning has certainly improved.The decrease in accuracy with sample size that we observed appears robust to these alternative methodological choices.Critically, the solution to small-sample problems lies in rigorous statistical assessment of classifier accuracy.This can be achieved using the binomial test (for two-class problems)

Unknown
Non-Independent Independent  and permutation testing (50).Permutation testing, in particular, is a reliable, flexible, and readily available tool to assess the significance and variability of a given accuracy (53,60).Reporting of both significance and an estimate of variability, such as confidence intervals, is perhaps the most important recommendation because, independent of availability of larger samples, such reporting continues to be done inconsistently based on a 2017 review of 237 classification studies across brain disorders (1).Finally, although difficult to quantify, it is inherent that the amount of data per subject varies from study to study, and thus the reliability varies depending on the neuroimaging measure employed.This fact underscores further the importance of data quality in addition to data quantity in predictive modeling.

Value of Biomarkers in ADHD Diagnosis and Beyond
This study was motivated by an apparent inconsistency between reports of high classification accuracies and known heterogeneity in ADHD.We found that the subset of studies with independent test sets reported an accuracy in the range of 60% to 80%.The fact that these values were significantly above 50% suggests that neuroimaging-based biomarkers were associated with ADHD and therefore have some value.However, these accuracies are too low to be used without other supporting information in clinical practice because they would result in substantial false positive and false negative rates [also see Loo and Barkley (61)].We also note that the test set was difficult to define in the studies of TBR (10,18,20,55), which is significant because TBR is a Food and Drug Administration-approved adjunct to clinical assessment (19,20).These studies also did not include large control samples to accurately estimate the standard error, which could mean that the specificity of the TBR has been overestimated.Such a conclusion is consistent both with reported variability in the group-difference effect size of TBR (38,62,63) and, in particular, with the observation that decreasing effect sizes of TBR across studies appears to be correlated with a change in TBR in the control sample rather than the ADHD sample (62).However, the low and variable accuracies are consistent with the inherent heterogeneity of ADHD, documented in ADHD in symptom presentation (21), neurocognitive impairment (22,23), persistence (24)(25)(26), treatment response (27,28), and putative mechanistic pathways (29)(30)(31).Given a heterogeneous population, classification models will learn to accurately identify those individuals with features that are shared among subpopulations, but will be less successful in identifying individuals who have features specific to a subpopulation.However, as argued by Schnack and Kahn (48), a drop in accuracy in a new testing sample in this context carries information about the mutual homogeneity of the sample and may help to identify shared versus nonshared features.
What is the future of biomarkers in ADHD?Echoing recent reviews, we suggest that the primary goals within ADHD ought to include parsing of heterogeneity and prediction of future outcomes, rather than diagnosis.Addressing heterogeneity, dimensional analysis approaches [e.g., Research Domain Criterion initiative (64,65)] seek to identify novel subgroups, based on shared neuroimaging (and other feature) profiles.A promising example of this approach is that of Bansal et al. (66), who developed an automated routine to first discover natural groupings based on brain morphology.Using these novel groupings, they achieved classification sensitivity of 93.6% and specificity of 88.5% on an independent testing set including children with ADHD and control subjects.In complement, a shift toward using machine learning and biomarkers to predict future outcomes-development and aging, education, learning, criminality, health-related behaviors, response to treatments-is likely to have a greater impact than prediction of diagnosis on personalized clinical practices than can directly improve patients' lives (67)(68)(69).For instance, brain network connectivity associated with sustained attention performance has been shown to predict ADHD symptoms in an independent sample (70)(71)(72)(73), defining a potential tool for diagnosisindependent assessment of attentional integrity.

Conclusions
In this study, we found that unbiased classification accuracy in ADHD diagnosis in the range of 60% to 80%, too low to be viewed as an independently useful biomarker of disease, is consistent with known heterogeneity in this disorder.These data are also consistent with contributions of circular analysis and small-sample bias to inflation of higher accuracies, thus accounting for the discrepancy.We conclude that steps to resolve these issues, as well as a shift toward accounting for sample heterogeneity and prediction of future outcomes, will be crucial in increasing the utility of classification in ADHD.

Figure 3 .
Figure 3. Classification across publication year.Neither accuracy (top panel) nor sample size (bottom panel) could be predicted from publication year.The relationship between the two (Figure2) was also significant with publication year as a covariate (see text).Frequency of circular analysis also did not vary by year.Shading indicates 95% confidence interval.

Table 1 .
Neuroimaging Classification Studies of ADHD