Estimating sample sizes for pre-dementia Alzheimer’s trials based on the Alzheimer’s Disease Neuroimaging Initiative

This study modeled predementia Alzheimer’s disease (AD) clinical trials. Longitudinal data from cognitively normal (CN) and mild cognitive impairment (MCI) participants in the AD Neuroimaging Initiative were used to calculate sample size requirements for trials using outcome measures including: the Clinical Dementia Rating scale sum of boxes (CDR-sb), Mini Mental Status Examination (MMSE), AD assessment scale-cognitive subscale with and without delayed recall, and the Rey Auditory Verbal Learning task (RAVLT). We examined the impact on sample sizes of enrichment for genetic and biomarker criteria, including cerebrospinal fluid protein and neuroimaging analyses. We observed little cognitive decline in the CN population at 36 months, regardless of the enrichment strategy. Nonetheless, in CN subjects, using RAVLT total as an outcome at 36 months required the fewest subjects across enrichment strategies, with apolipoprotein E genotype ε 4 carrier status requiring the fewest (n=499 per arm to demonstrate a 25% reduction in disease progression). In MCI, enrichment reduced the required sample sizes for trials, relative to estimates based on all subjects.


Background
Studies of the biology of Alzheimer's disease (AD) have identified an array of targets for potential disease-modifying therapies [39] but clinical trials in patients with dementia have been unsuccessful so far [14,24,50,52,53]. Biological substrates of AD can be identified before patients become demented [46] and some AD biomarkers reach peak levels of abnormality prior to diagnosis [29,37]. Failed dementia trials may have intervened too late in the disease process to be effective [62].
Clinical trials of investigational drugs targeting AD biology can enroll patients earlier in the disease, before criteria for dementia are fulfilled. Primary prevention trials enroll volunteers with no clinical or biological signs of AD at baseline but require thousands of participants and take many years to complete, because only a fraction of participants will develop AD [15]. To date, few primary AD prevention trials have been conducted and no agent has been shown to delay or prevent dementia onset. Secondary prevention trials can enroll participants at increased risk for dementia, affording decreased sample sizes and trial lengths. Secondary prevention trials have included individuals with mild cognitive impairment (MCI), a clinical syndrome defined by memory impairment or other cognitive problems, when compared to age-and education-matched norms, in the absence of functional decline [49]. Even some trials enrolling MCI participants have encountered low rates of disease progression [21].
Biological markers of AD predict clinical progression and may be used to identify potential trial participants at greatest risk for dementia. Low levels of amyloid beta (Aβ) or elevated levels of total tau (tTau) or phosphorylated tau (pTau) in the cerebrospinal fluid (CSF; e.g. [40]); evidence of cerebral atrophy on magnetic resonance imaging (MRI; e.g. [6]); and brain glucose hypometabolism observed with fluorodeoxyglucose positron emission tomography (FDG PET) (e.g. [33]) identify MCI patients at increased and more immediate risk for AD dementia. Even in asymptomatic individuals, the presence of biological evidence of AD significantly increases the risk for future cognitive impairment and AD dementia [7,17,47].
Thus, it is likely that using AD biomarkers as enrollment criteria can reduce the number of participants needed and study duration for AD prevention trials. Using AD biomarkers as outcome measures in AD trials can similarly improve trial efficiency [9,27,28,31,35,57,59]. The US Food and Drug Administration (FDA), however, has not accepted any biomarker as a surrogate suitable for use as a primary outcome measure in AD trials. Moreover, FDA guidance outlines the use of clinical measures to achieve marketing approval [30,34]. Therefore, registration trials, even those conducted in very mild disease, continue to use the AD Assessment Scale-cognitive subscale (ADAS-cog) and other clinical scales as primary outcome measures.
The statistical power of predementia trials may be improved by population enrichment strategies using biomarkers. These trials might be able to employ a single primary outcome measure (rather than dual primary outcomes, as is the case in dementia trials [3]). Using the AD Neuroimaging Initiative (ADNI) dataset, we sought to identify the best enrichment strategies for predementia trials in relation to outcome measures to optimize statistical power. We hypothesized that enriching cognitively normal (CN) and MCI trial populations through biomarker criteria would reduce required sample sizes.

ADNI
Data used in the preparation of this article were obtained from the ADNI database (adni.loni.ucla.edu). The ADNI was launched in 2003 by the National Institute on Aging (NIA), the National Institute of Biomedical Imaging and Bioengineering (NIBIB), the FDA, private pharmaceutical companies and nonprofit organizations, as a $60 million, five-year public-private partnership. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), PET, other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of MCI and early AD. Determination of sensitive and specific markers of very early AD progression is intended to aid researchers and clinicians to develop new treatments and monitor their effectiveness, as well as lessen the time and cost of clinical trials.
The Principal Investigator of this initiative is Michael W. Weiner, MD, VA Medical Center and University of California -San Francisco. ADNI is the result of efforts of many coinvestigators from a broad range of academic institutions and private corporations, and subjects have been recruited from over 50 sites across the U.S. and Canada. The initial goal of ADNI was to recruit 800 adults, ages 55 to 90, to participate in the research, approximately 200 cognitively normal older individuals to be followed for three years, 400 people with MCI to be followed for three years and 200 people with early AD to be followed for two years." For up-to-date information, see www.adni-info.org.
The current analyses focused on the first iteration of ADNI, which enrolled a cohort of volunteers who were CN, MCI, and AD dementia at baseline. Clinical and biological data were collected, including magnetic resonance imaging volumetric measures, FDG PET, and CSF protein analysis. The current analyses focused on data from CN and MCI ADNI subjects. CN subjects had no subjective memory complaints at baseline. CN and MCI subjects scored between 24 and 30 on the Mini-Mental State Examinations (MMSE). CN subjects had a global Clinical Dementia Rating scale (CDR) of 0 at baseline. MCI subjects scored 0.5, with a required memory box score of 0.5 or higher at baseline. Subjects were also required to meet criteria for memory performance on the Wechsler Memory Scale-Revised Logical Memory II subscale: CN subjects ≥9 for 16 years or more of education, ≥5 for 8-15 years of education, and ≥3 for 0-7 years of education; MCI subjects ≤8 for 16 years or more of education, ≤4 for 8-15 years education, and ≤2 for 0-7 years of education. ADNI CN subjects could not have impairment in activities of daily living and MCI subjects could not meet criteria for dementia.
All ADNI participants had a modified Hachinski Scale score <4; a Geriatric Depression Scale (abbreviated 15-item version) score <6, were fluent in English or Spanish, had a suitable study partner who could accompany them to study visits, and lived at home. They had no significant neurologic or psychiatric disease, no history of alcohol or substance abuse, no clinically significant lab abnormalities in B12 level, rapid plasma reagin, or thyroid function tests, and no contraindication to neuroimaging. They did not take psychoactive drugs, including antidepressants with anticholinergic properties, or warfarin. They had not participated in a clinical trial of an investigational medication within one month of baseline or for the duration of their participation in ADNI and they were not involved in other studies that included neuropsychological testing that could interfere with the ADNI-related testing.
ADNI was designed to parallel AD clinical trials, employing a variety of psychometric outcome measures that are common in AD registration trials. We examined data collected from all CN and MCI participants at baseline and at 12, 24, and 36 months. We focused on outcome measures that included assessments of memory including the ADAS-cog with (ADAS12) and without (ADAS11) delayed recall component [51]; the MMSE [22]; the Rey Auditory Verbal Learning Task [63] total score summing the number recalled over the initial 5 learning trials (RAVLT total) and the recall score after a 30-minute delay (RAVLT delayed recall); and the CDR sum of the boxes score (CDR-sb) [45]. Longitudinal clinical data and baseline biomarker data were downloaded from the ADNI public database (http:// www.loni.ucla.edu/ADNI/Data/) on May 2, 2011.

Enrichment strategies
Multiple strategies were used to limit the data used to calculate sample sizes. Enrichment was hypothesized to create a sample in which there would be a greater magnitude of cognitive decline over time and/or reduced variance, thus reducing the sample size necessary to detect a drug effect, at a specific level of statistical power. Each enrichment strategy was applied to both the CN and the MCI populations.
Apolipoprotein E (ApoE) ε4 carrier status-ApoE genotype is a well described genetic risk factor for AD [12]. ADNI ApoE genotyping was performed using blood samples, at the University of Pennsylvania AD Biomarker Laboratory. Subjects were divided into those who did and those who did not carry at least one ε4 allele.
CSF protein analysis-CSF collection and analysis have been described elsewhere [60]. CSF protein levels included here are Aβ 1-42 (Aβ), total Tau (tTau), and Tau phosphorylated at threonine 181 (pTau). In addition, the ratio of tTau/Aβ, and pTau/Aβ were examined as enrichment criteria. Criteria for inclusion are described below.
Hippocampal volume-Hippocampal volumes were measured using a machine learning method based on adaptive boosting (AdaBoost), as described previously [43]. Briefly, this automated method performs brain MRI segmentation and quantifies hippocampal volume. It uses a pool of 14,000 features such as image intensity; tissue classification maps of gray matter, white matter, and CSF; and neighborhood based features from each voxel and designs an algorithm that can optimally segment the hippocampus (or another brain structure) from a limited region within brain MRIs standardized against a registered template. A weighted voting algorithm combines "weak learners" into a "strong learner." Prior work has shown this method consistently agrees with expert human rater tracings [42][43][44].
Lateral ventricle volume-Ventricular volumes were assessed using a semi-automated, multi-atlas segmentation technique that was developed at UCLA [10,11]. A small number (n=6) of lateral ventricles from the sample were manually traced and used to create ventricular models that could be converted into parametric surface atlases. Fluid registration of these atlases to every subject was performed, and an averaging technique combined the surface atlases for each image volume. The choice of the number of templates was empirically based on optimizing the False Discovery Rate. This technique distinguishes AD from normal controls and also demonstrates differences in ventricular volume based on ApoE carrier status [10].
Cerebral metabolism-Predefined ROI analysis of FDG PET cerebral metabolism was conducted as described previously [32]. Metabolic signal was intensity-normalized within subjects against the cerebellar vermis and pons. FDG uptake was extracted for left and right temporal lobes as regions of interest. The average glucose uptake for this region across hemispheres was used to produce a single value for each subject at baseline. Only a subset (102 CN and 206 MCI) of participants in ADNI underwent FDG PET imaging.

Biological enrichment criteria cutoff points
To decide upon inclusion criteria for each enrichment strategy, we performed receiver operating characteristic (ROC) analyses, using data from the ADNI AD and CN populations. The inclusion criteria for MRI measures of volume (hippocampus and lateral ventricles) and FDG measures of metabolism were set to the threshold value for those measures that maximized the Youden index (the sensitivity plus the specificity minus 1 [66]) for discrimination between AD and CN groups. Previously determined cutoff CSF criteria based on ROC analyses of neuropathologically confirmed diagnoses of AD and normal controls were applied [60]. Specifically, the following criteria for inclusion were used: Aβ<192 pg/ mL; tTau>93 pg/mL; pTau>23pg/mL; ratio of tTau/Aβ>.39; ratio of pTau/Aβ>.1. These cutoffs have been used in prior ADNI analyses [48]. Baseline CSF samples were available from 200 MCI and 114 CN ADNI participants

Sample size calculations
We examined the mean decline in cognitive outcome measures at 12, 24, and 36 months. Participant data were included for all available outcome measures (i.e., missing values for an outcome of interest did not preclude inclusion for another). Sample size estimates under an assumption of normality and known variance were calculated from the equation: Here, z 1−β = 0.842 to provide 80% power; z 1−α/2 = 1.96 to test at the 5% level; Δμ is the mean change in score on the outcome of interest, relative to baseline, multiplied by the drug effect (0.25) to reflect the estimated mean difference between placebo group change scores and drug group change scores; and σ is the SD of the change scores in the groups (assuming SD is the same in treatment and placebo groups). This sample size equation is well described in the literature and has been used previously by others to estimate sample sizes in AD clinical trials [23,36]. We report sample sizes per trial arm, powered to detect a 25% drug effect (slowing of cognitive decline).
To assist in the comparison of sample size estimates, we calculated the 95% confidence intervals (CI) for the sample size. These confidence intervals were estimated by first calculating the 95% confidence interval for the effect size Δμ/σ through the noncentrality parameter t score [13]. These limits were then used in the equation above to calculate the 95% CI of the sample sizes. In cases where the confidence interval for the effect size crossed 0, the upper bound of the sample size CI is denoted as ∞. We also calculated 95% confidence intervals using bootstrap resampling, using 1000 iterations for each scenario. We found these confidence intervals to be on average 25% narrower than those calculated with our formula. We present only the more conservative estimates. Formal statistical comparisons of sample size outputs were not performed.

Findings ADNI Subjects
Demographic summaries and baseline scores on clinical outcome measures for the included populations are found in Table 1. Data from the entire ADNI sample was used for this study. Of those who underwent LP, 38% met criteria for low CSF Aβ. Of those CN participants who had PET scans, 14.7% met criteria for temporal lobe hypometabolism (ratio of FDG uptake below 1.14).
More than half (53%) of the MCI population were ApoE ε4 carriers. Among biomarker enrichment strategies, CSF strategies included the most MCI patients (for example, 67.5% of participants who underwent LP met CSF Aβ criteria), while a smaller proportion of participants (37.9%) met FDG PET criteria for enrichment (Table 2).

Estimated trial sample sizes: CN population
At 12 months, the CN population demonstrated mean worsening only on the CDR-sb. At 24 months, mean decline was observed only on the CDR-sb and MMSE. Therefore, we focused on trial estimations for the CN population based on the 36-month longitudinal data. At 36months mean worsening was observed for the MMSE, CDR-sb, RAVLT total and RAVLT delayed recall, but not the ADAS11 or ADAS12 (Table 3). Sample size calculations ranged from 1414 participants/arm for the RAVLT total to 50,790 participants/arm for the MMSE (Table 4).
With few exceptions, trial sample sizes based on enriched populations required fewer participants than did trials based on the entire CN population (Table 4). Whereas the CN population as a whole did not demonstrate a mean decline on the ADAS11 or ADAS12 at 36-months, enrichment for persons who met CSF Aβ, CSF pTau, CSF ratio of pTau/Aβ, FDG PET hypometabolism, and hippocampal volume criteria resulted in mean decline (and possible sample size calculations; Table 4) for these outcome measures. In each of these scenarios, trials using the ADAS12 required fewer participants than trials using the ADAS11. Enrichment for ApoE ε4 carriers, CSF tTau, and the ratio of tTau/Aβ resulted in decline on the ADAS12 but not the ADAS11. For six of the nine examined enrichment strategies, trials using the RAVLT total required the fewest participants. Among these, trials enriched for ApoE ε4 carriers required the fewest participants (n=499, CI: 243-1659). Trials enriched for FDG PET or hippocampal volume required the fewest participants when using the CDR-sb as an outcome. Trials enriched for lateral ventricle volume required the fewest participants when using the RAVLT delayed recall.

Estimated trial sample sizes: MCI population
Mean worsening was detected for the entire MCI population at all time points for each of the clinical outcome measures examined (see for example Table 5). Sample size requirements decreased substantially with increasing trial length (data not shown). We chose to focus on 24-month MCI trials (Tables 5 and 6), as this represents a likely scenario for these trials [1], but the results we present were similar at both 12 and 36 months (data not shown). Trials using the CDR-sb required fewer participants than trials using any other outcome measure. Trials using the ADAS12 required fewer participants than did trials using the ADAS11, or any other outcome beside the CDR-sb.
Within enrichment strategies, trials using the CDR-sb consistently produced the lowest required sample sizes ( Table 6). The sole exception was trials enriched for FDG PET hypometabolism, which required the fewest subjects using the MMSE as an outcome (n=314, CI: 179-725). Enrichment using CSF criteria yielded the numerically lowest number of subjects required to detect changes in the CDR-sb for a 24-month trial. In every scenario we examined, biomarker enrichment of the MCI population produced lower necessary sample sizes for each outcome measure.

Implications
Few studies have explored the ability to perform AD prevention trials in populations enrolled with no cognitive complaint. Individuals with no demonstrable cognitive abnormality who meet criteria for AD biomarkers may be defined as having "preclinical AD" [61] or as being "asymptomatic at risk for AD" [19]. Trials in persons who meet these criteria are being planned [2,3]. Similarly, a variety of working groups have proposed inclusion of "prodromal AD," MCI patients who meet AD biomarker criteria, in trials [2,5,20] and trials implementing these guidelines are now underway (www.clinicaltrials.gov).
How to best design successful predementia clinical trials is controversial and requires guidance from research studies. This project sought to identify optimal clinical outcome measures and biological enrichment strategies for use in AD trials enrolling asymptomatic or mildly symptomatic participants. We chose to examine continuous outcome measures, rather than "conversion" outcomes, because such measures are likely to provide greater sensitivity and therefore reduced sample sizes for predementia trials. We found that for most outcome measures, 12-and 24-month trials of cognitively normal participants are not realistic. Decline in outcome measure scores for this population at 36-months, when present, were small and trials to demonstrate a reduction in that decline required very large study populations. A relative exception to this was for trials using the RAVLT total score, which required 1404 participants per study arm. Even so, the CN population did not demonstrate a mean decline from baseline at 12 or 24 months for this outcome measure (data not shown) and longer-term follow-up and confirmation in independent samples of the decline in the RAVLT in CN participants is warranted.
When the CN population was enriched for persons meeting AD biomarker criteria at baseline, decline was observed for outcome measures that did not demonstrate detectable decline for the entire CN population, and the decline observed on the remaining outcome measures was increased in degree. The numeric reductions in the outputs of sample size calculations frequently exceeded 50% in some scenarios (Table 4). Thus, biomarker enrichment increases the efficiency of performing AD clinical trials in asymptomatic patients. The ideal means of enrichment, however, are not yet clear. Determining the specificities and sensitivities of methods to predict future cognitive decline for each biomarker criterion remains an important area of study. In the current exercise, for example, each of CSF Aβ, FDG PET hypometabolism, and hippocampal volume successfully reduced the necessary sample sizes of trials using the CDR-sb as an outcome. Alternatively, hippocampal volume and FDG-PET failed as enrichment strategies for trials using the RAVLT total, while enrichment for CSF Aβ still produced sample size requirements lower than that of the entire CN population. Determining what enrichment strategy is best for what outcome measure may depend on the specifics of the study population.
Not surprisingly, the overall MCI population produced more consistent decline on trial outcome measures, resulting in consistent calculation of trial sample sizes that were reduced, relative to the estimates based on the CN population. In the MCI population as a whole, the CDR-sb required substantially fewer participants than did all other clinical measures, in line with the observations of others [4,41]. Enrichment of the MCI population by any strategy reduced the needed sample sizes for all outcome measures, most likely resulting from the refinement of the total population to those who manifested prodromal AD. For the majority of enrichment strategies, the CDR-sb continued to require the fewest participants. Also consistent was the lower required sample sizes for the ADAS12, relative to the ADAS11 [4,54]. The single exception to this, and to the substantially lower requirements for the CDR-sb than every other outcome measure by every other enrichment strategy, was in the setting of enrichment for FDG PET hypometabolism. When the MCI population was enriched for FDG PET hypometabolism, the ADAS11 required fewer participants than did the ADAS12 and the MMSE required fewer participants than did the CDR-sb (Table 6).
This study has limitations. It is derived entirely from a single data set, which has been used in a large number of studies with similar objectives [38,41,55,56,58] see also [8,65]. Within ADNI, subjects are well-educated, primarily Caucasian, and hold favorable attitudes toward research that may in part result from a high prevalence of a family history of AD. Further work modeling predementia trials based on alternate data sources is necessary. The conduct of ADNI-like studies on other continents may present such an opportunity.
We performed no formal comparisons of sample size outputs. Confidence intervals of the estimates are provided but, as has been seen in other studies [26,41], are wide. We also did not incorporate slope models into our study, focusing instead on change-from-baseline calculations. This decision was based on the on-going debate regarding the appropriate means of incorporating slope analyses into sample size estimates [18,55] and the fact that mean change from baseline is the general practice in AD registration trials. Our methods are also in contrast to the often-used practice of choosing a minimal clinically significant difference (for example, 2 points on the ADAS-cog) and powering a trial to detect such a difference at a given time point. Importantly, considering most of the trial scenarios in our results, the 25% drug effect would not achieve a clinically significant difference (for example, 25% of the mean decline for the ADAS11 at 24-months among the overall MCI population was 0.6 points). It is also true that the 25% drug effect would vary among the examined scenarios, as the overall rate of cognitive decline will vary among the different enriched populations. Thus, we do not propose that sample size decisions for predementia trials be based solely on the results of the current study. Rather, we believe that these results may be useful in considering predementia trial design choices, including primary and secondary outcome measures, enrichment strategies, and trial length.
Our analyses compared single biomarker modalities as enrichment strategies. Others have performed more integrated methods of enrichment as predictors of cognitive decline in the nondemented ADNI population. For example, McEvoy and colleagues showed that enriching for a composite measure of atrophy based on multiple brain regions resulted in a greater reduction in necessary MCI trial sample sizes (for trials using the CDR-SB or the ADAS11) than did genetic enrichment for ApoE genotype [41]. Similarly, Vemuri and colleagues showed that a composite index of structural abnormalities on MRI better predicted clinical progression on the CDR-SB than did clinical or CSF measures in MCI patients [64]. Using clinically available assays, Heister and colleagues demonstrated that combined use of hippocampal volume measures and psychometric testing better predicted conversion from MCI to dementia than did either volumetrics or cognitive testing alone or in combination with CSF measures [25].
Finally, our results consider only the number of participants that must complete an AD prevention trial. We did not examine the important variables of screen failures, participant recruitment, or patient attrition, all of which have significant impact on trial efficiency, perhaps especially in the setting of asymptomatic AD trials. As is seen in Table 2, each enrichment strategy is associated with a high screen failure rate, ranging from 80% for CSF tTau to 51% for the ratio of pTau/Aβ in the CN population. Thus, in the scenarios that we examined, the number of total CN or MCI participants that would need to be recruited to undergo biomarker testing is frequently quite high. For example, a trial using the RAVLT as a single primary outcome enrolling CN participants who meet CSF Aβ criteria would need to enroll 5,450 participants to achieve the necessary sample sizes per study arm (not factoring in study attrition). Were the same study to use the CDR-sb as a single primary outcome, 8,550 participants would need to be screened. Though CSF criteria were frequently met by the CN ADNI population and have been shown to predict disease progression [16,58], asymptomatic participants often cite lumbar puncture as a barrier to trial participation and rated it as the diagnostic modality that they were least likely to be willing to endure in the setting of an AD prevention trial (unpublished results), suggesting that achieving recruitment goals in predementia trials may meet challenges. It is also unclear how participants will interpret the information of being eligible (or not) for preclinical AD trials. Thus, no matter what enrichment strategy might be chosen for use in preclinical trials, educational campaigns to facilitate recruitment and protect the welfare of participants will be critical to ensure the successful and ethical conduct of these trials.
In conclusion, our data fall short of suggesting a specific biomarker enrichment strategy as optimal for the design of preclinical AD trials. The CDR-sb was the seemingly ideal outcome measure when considering trials of MCI populations, whether they are enriched for AD biomarkers or not. The ideal outcome measures for trials of asymptomatic participants remain open to debate, though these results suggest that the RAVLT total score and CDR-sb may preferable to the ADAS11, ADAS12, MMSE, or RAVLT delayed recall. Replication of these results should be pursued in independent datasets and a variety of ongoing studies, including the next phase of ADNI, will contribute to the overall understanding of AD biomarkers and their utility in the setting of AD clinical trials.    Mean changes ± SD on clinical outcome measures at 24 months of the ADNI MCI population.  Required sample sizes per arm for 24-month MCI trials