The auditory event-related potential is a stable and reliable measure in elderly subjects over a 3 year period

Objectives : Valid markers of psychobiological processes, including changes over the lifespan, must be reliable. This study investigated the reliability of the auditory event-related potential (ERP) over a 3 year period. Methods : Predictable and unpredictable rare tones were embedded in common-to-rare sequences at 3 different ratios (2:3, 2:5 and 2:8). Forty-six older (mean age 72.3 years) volunteers pressed a key to the rare tones, and ERPs (Fz, Cz and Pz) and reaction time (RT) were measured. Reliability across years was assessed using 3 methods: (1) determination of the stability of waveform components (P1, N1, P2, N2 and P3); (2) cross-correlation of successive 15 ms epochs of within-subject ERPs; and (3) cross-correlation of 15 ms epochs of between-subject ERPs. Results : With all analyses, the ERP was stable. Analysis of the scored components indicated that P3 was especially stable in the unpredictable rare (2:8) condition. Earlier components were equally stable across all conditions. Analysis of 15 ms ERP epochs indicated signiﬁcant ERP stability 60 ms after stimulation, lasting over 640 ms. Conclusions : Robust within-subject reliability of the ERP strengthens its potential use for detecting preclinical changes in at-risk elderly populations. q 2000 Elsevier Science Ireland Ltd. All rights reserved.

Auditory event-related potentials (ERPs) in the target detection or`oddball' task in young, normal volunteers apparently are reliable within sessions (Sklare and Lynn, 1984;Polich, 1986;Fabiani et al., 1987;Kinoshita et al., 1995Kinoshita et al., , 1996Maeda et al., 1995) over days (Sklare and Lynn, 1984;Fabiani et al., 1987;Kinoshita et al., 1995Kinoshita et al., , 1996Maeda et al., 1995), months (Lewis, 1984;Karniski and Blair, 1989;Kinoshita et al., 1995Kinoshita et al., , 1996Maeda et al., 1995), and 1.8 years (Segalowitz and Barnes, 1993). Reliable P3 components of the ERP also have been reported among 36 schizophrenics after 3 months (r 0:67±0:85; Hegerl et al., 1988) and, incidentally, in a group of about 25 elderly subjects across 4 tasks (0.44±0.73; Miller et al., 1987). Stable early ERP components to light¯ashes have been reported over 10 months (Schellberg et al., 1987) and 5 years (Rothenberger et al., 1987) in children. The apparent reliability of the ERP suggests that instances of instability may indicate that there are signi®cant sources of error or re¯ect the presence of systemic in¯uences (Fabiani et al., 1987;Kileny and Peters-Kripal, 1987;Karniski and Blair, 1989;Simons and Miles, 1990). For instance, in a multiyear study of ERP reliability (van der Wal and Sandman, 1992), sharp decreases in reliability between successive years predicted death for 7 otherwise healthy elderly subjects. Reliability of the ERP in these subjects was greater than 0.80 in consecutive years except for the year before death.
As reviewed by Simons and Miles (1990) and Segalowitz and Barnes (1993), insuf®cient attention has been focussed on issues of reliability and stability in ERP research. This is an important issue which has continuing relevance to the use of ERPs in clinical assessment and in the study of trait and state effects (Segalowitz and Barnes, 1993). Additionally, the research has focussed on relatively short time intervals, with the longest being around 14 months (e.g. Sinha et al., 1992), and with few exceptions (Pollack and Schneider, 1992) on younger age groups. At least two general procedures have been applied to test ERP reliability. The most common method compared the scored peaks (especially P3) of the ERP at two different times. This between-subject procedure produces a measure of the tendency for individuals to maintain their rank within in a group across time (Simons and Miles, 1990). The second method cross-correlated individual ERPs recorded at different times to produce a measure of intra-subject reliability (Lewis, 1984;Fabiani et al., 1987;van der Wal and Sandman, 1992). The crosscorrelation procedure has an advantage of being independent of inevitable sources of error related to detection of ERP components (Lewis, 1984). In the current study, both of these procedures, in addition to a within-subjects crosscorrelation procedure, were used to compute measures of stability and reliability over a 3 year period in an elderly population.

Subjects
Forty-six subjects (29 women) were selected from 159 elderly individuals (entry age range 60±86 years, mean age 72.3 years) participating in a study on effective aging. All subjects accepted into the program were healthy and living independently. Subjects were administered extensive medical examinations prior to entry into the research program and on a yearly basis thereafter. Each subject was evaluated with the ERP at yearly intervals for 3 consecutive years.

Procedure
As described by van der Wal and Sandman (1992), subjects were tested in an electrically-shielded, sound attenuating chamber and reclined in a comfortable chair as electrodes were applied to the scalp. Binaural headphones were placed over the ears and white noise (72 db) from¯oor speakers masked extraneous noise. Subjects were monitored continuously with video and audio equipment. In a dual rare-event procedure (Sandman et al., 1990), target tones were presented at predictable (®xed, 450 Hz) infrequent intervals, at identical infrequent, but unpredictable (random, 600 Hz) intervals, and at frequent (common, 550 Hz) intervals.
Three versions of the dual rare-event procedure were presented to all subjects in balanced order. The 3 versions varied the probability of the infrequent tone. Examples of the 3 versions are: 2:3 FCCFRCFRRFCRF¼ 2:5 FCRCRFCCCCFCCRCFCCCCFRCR¼ 2:8 FCRCCCCCFCCCCCCFRCRCCCFCRCRCCF¼ In these examples, F is the predictable (®xed), R is the unpredictable (random) and C is the frequent (common) tone. The ®xed target always was the third (2:3), ®fth (2:5) or eighth (2:8) tone. The random target occurred at the same probability as the ®xed target, but its position in the sequence was not predictable.
The pure tones were 18 db contrast with background (90 db SPL with white noise background of 72 db SPL measured at the headphone cone: Bruel and Kjaer, Model 2203 sound level meter) and were 100 ms duration with a rise time of 50 ms. Pre-testing at yearly intervals determined that all subjects included in the analysis were able to discriminate the tones. Subjects were given practice to depress a hand-held key each time they heard the random target (matched to the dominant hand) or the ®xed target (matched to the non-dominant hand). They were instructed not to respond to common tones and to keep their eyes closed.

EEG procedures
Recordings were made with a Grass polygraph, Model 79, with ampli®er settings at 0.30 and 100 Hz. Data were digitized and stored for off-line analysis on a MINC 11/23 computer (digital Equipment Corp). Tones were presented by a microcomputer interfaced with a PDP system. Gold cup electrodes ®lled with EC-2 creme (Grass) were placed according to the International 10±20 System at Fz, Cz and Pz referenced to linked mastoids. Electrode impedances were matched within 1 kV and all were below 10 kV.

Event-related potential analysis
The EEG was sampled for 1280 ms at 200 Hz. Baseline was determined by a 280 ms pre-stimulus average. Waveforms for 44 sequences of random, ®xed and common tones were collected and averaged. An analog ®lter conditioned the signal prior to digitization (12 db/octave, 3 db at 100 Hz), minimizing aliasing and phase errors.
The latencies and peak-to-peak amplitudes of the major components were located with an automated system that placed cursors on a minimum and maximum voltage for waveform troughs and peaks, respectively. The ERP waveforms were evaluated by trained technicians to identify the prominent peaks within speci®c latency windows (P1, 30± 80 ms; N1, 70±200 ms; P2, 130±280 ms; N2, 180±360 ms; P3, 250±600 ms). Spectral interpolation was applied for measurement of the latency of waveform peaks. The waveform was approximated as a sum of sinusoids (the Fourier coef®cients) resulting in unlimited temporal resolution and accuracy for frequencies up to 25 Hz (the Nyquist frequency).

Artifact rejection
Trials contaminated by eye movement were automatically rejected by computer software. The ef®cacy of the rejection system was determined by separately testing subjects asked to make lateral eye movements (eyes closed) to a series of 10 tones in each condition (van der Wal and Sandman, 1992). Eye movement was veri®ed by electrodes attached to the outer canthus and suborbit referenced to linked mastoids. The software system correctly identi®ed 98.4% of the lateral eye movements in the EEG. Eye blinks were detected on 100% of the trials but blinks were infrequent in the closed eyes procedure.
Trial sequences were repeated automatically if reaction times to ®xed or random target rare tones exceeded 1500 ms, if subjects pressed a key to the common tone, or because of artifact in the EEG. An artifact was de®ned as a response contaminated with eye blinks or muscle movement or a response exceeding^50 mV. No subject had more than 5 trial sequences repeated and fewer than 2% (less than one trial per subject) of the trials across all subjects were repeated.

Statistical analysis
Reliability over the 3 testing sessions was assessed using 3 methods: (1) determination of the stability of waveform components (P1, N1, P2, N2 and P3) identi®ed as the maximum amplitude within a speci®ed time window; (2) crosscorrelation of successive 15 ms epochs of within-subject ERPs across years; and (3) cross-correlation of 15 ms epochs of between-subject ERPs across years.
The stability of waveform components was determined by calculating, for each measured peak of the ERP, Pearson Product Moment correlations between the pairwise comparisons of the 3 testing sessions (year 1 with year 2, year 2 with year 3, and year 1 with year 3) for both latency and amplitude measures.
The within-subjects correlation analysis measured the stability of the shape of the intra-individual wave over years. Pearson Product Moment correlations of intra-subject waveforms over years were compared separately for every placement, condition and phase. The correlation coef®cient was computed by comparing corresponding points (15 ms epochs) on the waveform for each subject for each year. Correlations were generated for each subject for 3 place-ments (Fz, Cz and Pz) £ 3 phases (2:3, 2:5 and 2:8) £ 3 conditions (random, ®xed and common), and for each comparison (year 1 with year 2, year 2 with year 3, and year 1 with year 3), resulting in 81 correlations per subject per 15 ms epoch. These data were summarized in two ways. First, the total number of signi®cant correlations for each condition was tested against chance. Second, the correlations were converted to z scores, summed and averaged, and then reconverted to r values representing averaged correlations (see Table 2).
The between-subjects analysis measured the reliability of the test to produce changes over years. In this analysis, Pearson Product Moment correlations of ERP waveforms across subjects over years were compared separately for every placement and condition. This correlation coef®cient was computed by comparing 15 ms epochs across subjects for each year (year 1 with year 2, year 2 with year 3, and year 1 with year 3). Correlations were generated for 3 placements, 3 conditions, and for each year comparison. These correlations were computed for the 2:8 phase only, resulting in 27 total correlations per 15 ms epoch.

Results
Fig. 1 shows averaged ERPs for the ®xed, random and common conditions at Fz, Cz and Pz grand averaged over subjects for the 2:8 phase. As illustrated in Fig. 1, the random target elicited the largest peak-to-peak amplitude of the P3 wave. The peak-to-peak P3 component was attenuated in the ®xed condition and absent in the common condition.
3.1. Between-subject reliability of scored ERP components 3.1.1. Intra-and inter-rater reliability In order to determine scorer reliability, two trained raters blindly scored the latency and amplitude of P3 waveforms in the random 2:8 phase. Each rater scored the same 46 records on two different occasions without knowledge of previous results, yielding both inter-and intra-rater coef®cients. Both intra-rater (0.67±0.86, P , 0:0001) and interrater (0.67±0.77, P , 0:001) reliability estimates of P3 amplitude were highly signi®cant. Rating of P3 latency was less consistent (r 0:37±0:69, P , 0:01). Fig. 2 presents reliability coef®cients calculated for 1 and 2 year intervals for the amplitude of the ERP recorded from Fz, Cz and Pz during the ®xed and random conditions. Highly reliable P3 amplitudes were apparent especially in the random condition for all 3 rare stimulus ratios for the 1 and 2 year intervals. For each interval, the reliability of P3 in the random condition increased as the probability of a rare target decreased (i.e. changed from 2:3 to 2:8). Inspection of the correlations revealed the opposite pattern in the ®xed condition because the least frequent targets resulted in the lowest correlations (poorest reliability). Fig. 2 also shows the reliability coef®cients calculated for 1 and 2 year intervals for the latency of the ERP recorded from Fz, Cz and Pz during the ®xed and random conditions. The latency of P3 was not reliable over 1 or 2 year intervals in any condition, stimulus probability or electrode location. Although several reliability coef®cients for other components reached statistical signi®cance for latency, the variance shared between test intervals for P3 latency rarely exceeded 25% and was universally much lower than for amplitude.

Reliability of early compared to late ERP components
Although the focus of the study was on the reliability of the P3 wave, the reliability of amplitude for the other major ERP components (P1, N1, P2 and N2) also was examined and is presented in Fig. 2. The reliability of N1 and P2 amplitudes was signi®cant for both testing intervals, all placements, conditions and stimulus probabilities. The amplitude of N2 was less consistent, achieving the highest reliability in the 2:5 condition at Cz.
The differential reliability of N1 and P3 to the ®xed and random rare targets can be seen in Fig. 3. The reliability of N1 is virtually identical for both targets, regardless of stimulus probability. Reliability of P3 is dependent on stimulus probability (it increases as stimulus probability decreases) and predictability (reliability is very high for unpredictable targets but low for predictable targets). The differential effect of the ®xed and random target conditions on the N1 and P3 components was tested in an ANOVA (component £ target type).

Stability of reaction time (RT)
The reliability of RT over both 1 and 2 year intervals was statistically signi®cant (Table 1). However, the correlation coef®cients were lower than for P3, especially comparisons with the ®rst year. For all 3 phases, in both target conditions, reliability between years 2 and 3 was highest. Unlike measures of P3, reliability of RT was comparable in both the ®xed and random target conditions.

Years 1 and 2
Correlations of 15 ms epochs on the ERP waveform across subjects are plotted in Fig. 4 for each electrode site and for the ®xed, random and common conditions for the 2:8 phase. Over Fz, the correlations for the random target were signi®cant after 96 ms and reached highly signi®cant levels (r . 0:80) between 430 and 450 ms. For the ®xed target, signi®cant reliability was reached by 60 ms and maximal reliability was evident around the N1 (140±190 ms) component. The common tone also was highly reliable but the values were slightly lower. The pattern was similar for Pz waveforms. However, signi®cant values for the random target were not achieved until 128 ms and the correlations were lower than for Fz. The correlations reached maximal levels around the N1 and P3 complexes, dropped, and then increased (r . 0:50) after 700 ms. The pattern at Pz for the ®xed tone was identical, but slightly lower than at Fz. Reliability of the common tone was somewhat less over  Pz, especially around the N1-P2 transition. Similar patterns characterized the Cz placement.

Years 2 and 3
Virtually the same pattern of stability existed between years 2 and 3 as with years 1 and 2. The earliest evidence of stability was to the ®xed target (60 ms) compared to stability by 90 ms for the random target and the common tone. For all 3 placements, but especially for Pz and Fz, early (90±160 ms) latency waves in the ®xed condition were highly reliable. Response to the random target also was highly signi®cant early (130±160 ms), middle (200± 270 ms) and late (384±512 ms) at all placements. Although responses to the common tone at all placements was signif-icant, the magnitude of correlation was lower than the target tones (Fig. 4).

Years 1 and 3
Although the majority of correlations were highly significant (P , 0:01), the pattern was different and the degree of association was less than correlations between ERP responses in contiguous years. Maximal stability for the response to the ®xed target was apparent over the N1 complex, at all electrode locations, as with previous analyses. Responses to the random target remained stable, especially beyond 384 ms. Conversely, the reliability of the common tone dropped below P , 0:01 after 384 ms at all 3 locations (Fig. 4).

Within-subject reliability of ERP epochs across years
For the analysis of the 2:8 phase, 1145 correlations of a total 1242 possible (91%) were signi®cant at the P , 0:01 level. Binomial distribution indicated that this was a highly signi®cant outcome. Responses to all tones were equally reliable (random, 378 of 414 (91%) signi®cant at P , 0:01; ®xed, 385 of 414 (93%); and common, 382 of 414 (92%)). Inspection of reliability between year 1 and years 2 and 3 showed that correlations were lower than estimates between years 2 and 3. In order to determine if the age of the subject was related to the magnitude of the reliability coef®cients, pairwise correlations between age and the relationship between years 2 and 3 were computed for the 2:8 phase and the random condition for Pz. The correlation was not signi®cant (r 0:04).

Years 1 and 2
For the 2:8 phase across electrode placements, 373 of 414 (91%) of the correlations for waveform shape were signi®cant at P , 0:01. A high proportion of the correlations over years were signi®cant for the random (121 of 138 (88%)), ®xed (128 of 138 (93%)) and common (124 of 138 (90%)) tones. Lower reliability was associated with posterior place-  ments for the ®xed and random targets, but not the common tone. Inspection of the correlations for individual subjects showed that the source of ERP instability was a combination of 3 subjects with low reliability over all placements and 8 subjects with one unreliable electrode placement.

Years 2 and 3
In the last 2 years of the study, 402 of 414 (97%) correlations were signi®cant at P , 0:01. Of the 12 non-signi®cant correlations, 9 were shared by two subjects.

Years 1 and 3
Of the 414 possible correlations between years 1 and 3, 374 (90%) were signi®cant at P , 0:01. Stability was comparable for the ®xed (123 of 138 (89%)), random (125 of 138 (91%)) and common (126 of 138 (91%)) tones. As in the previous analysis, lower stability was associated with posterior placements. Table 2 presents the averaged correlations between years 1 and 2, 2 and 3, and 1 and 3 for the 2:8 target probability condition. The reliability at all placements and conditions for all tones were highly signi®cant. Inspection of the correlations showed that reliability coef®cients were highest between contiguous years, especially years 2 and 3. However, the overall stability of the waveforms was remarkably high, as was apparent from the shape of the averaged waveform over years (Figs. 1 and 4). In Fig. 5, individual ERPs over Pz for all 3 years are illustrated for one subject with highly reliable (r . 0:95) waveforms (Fig.  5A), one subject with moderately reliable (r 0:68) waveforms (Fig. 5B), and one subject with unreliable (r . 0:04) waveforms (Fig. 5C).

Discussion
The purpose of this study was to explore the stability of the ERP, especially the P3 component, in an elderly group over a 3 year period. To our knowledge this is the ®rst  formal, multi-year evaluation of reliability of the ERP in an elderly population. This is surprising because the ERP, and especially P3, is often proposed as a valid index of information processing and brain maturity (Pfefferbaum et al., 1984a;Rosler et al., 1986;Hillyard and Picton, 1987;Donchin and Coles, 1988). The stability of the ERP was determined over a 3 year period with 3 different methods: (1) between-subject reliability of ERP components; (2) between-subject reliability of discrete 15 ms epochs of the waveform; and (3) within-subject reliability of discrete 15 ms epochs of the waveform. For each of the 3 methods, the ERP waveform, including P3, was highly reliable over years, especially for the amplitude measures. These results provide support for the use of P3 as a valid measure of cognitive activity in the elderly.

Component reliability
The reliability of ERP components is initially dependent upon their reliable identi®cation. Kramer (1985) concluded that trained technicians were capable of reliably recovering the structure of simulated ERPs. In the current study, both inter-and intra-rater reliability estimates showed that two trained scorers showed signi®cant agreement in their scoring of P3 amplitude, and that each rater was consistent across two scoring sessions. Our ®nding that ratings of P3 latency were less consistent than amplitude agrees with previous ®ndings (Sklare and Lynn, 1984;Polich, 1986;Fabiani et al., 1987;Segalowitz and Barnes, 1993;Kinoshita et al., 1995), and probably re¯ects the variability of P3 over the different experimental conditions, and the long duration of this component. The lower amplitude P3 elicited by the ®xed condition and as a function of the probability of the target, for example, would make the selection of the appropriate point to determine its latency more dif®cult.
Although the reliability of each of the major components of ERP (P1, N1, P2, N2 and P3) was evaluated in this study, the procedures were designed to elicit P3. All 3 rare-event phases resulted in stable P3 components over 3 years, but the 2:8 phase produced the largest and most reliable P3. The amplitude of the late P3 component in this phase was as stable over years as the earlier components, N1 and P2. Typically, the late components are more variable than early ones because they are controlled more by endogenous factors, such as evaluation of the environment and decision making, than by exogenous factors, such as the physical features of external stimulation (Donchin, 1979(Donchin, , 1981Hillyard and Picton, 1987;Donchin and Coles, 1988). The stability of the P3 response observed in this study argues for its validity as a neurological marker of individual differences in cognition. Our ®nding that P3 was more reliable for rare events argues for the use of this methodology when P3 is used to study trait and state effects.
Similarly, relative to the random condition, the stability of P3 declined in the ®xed condition for all phases, probably re¯ecting the smaller and less detectable P3 responses. Larger amplitude P3 responses have previously been observed to be unpredictable as compared to predictable stimuli (Hillyard and Picton, 1987;Sandman et al., 1990), and the greater reliability of the ERP elicited by rare or target tones compared with predictable or frequent tones may result from the more favorable signal-to-noise ratio of the larger amplitude components (Fabiani et al., 1987). The ®xed condition is certainly not inherently unreliable, however, because the exogenous early components remained very stable (e.g. Fig. 2). Another possibility is that subjects pro®ted from repeated exposures to the predictable ®xed condition resulting in diminished P3 amplitudes across years. This possibility is supported by the results for RT because, despite increasing age through the course of the study, RT decreased over years. Measures of RT also were stable over years, but not as stable as ERP amplitudes. The stability of the ERP amplitude measures over 3 years provides support for their utility in studies of individual differences in attention and cognition.

Waveform stability
The advantage of waveform analysis was that it removed decision errors related to interpreting and scoring of components. In the ®rst analysis, the stability of the ERP between subjects was examined every 15 ms. The results of this analysis indicated the extent that an amplitude occurring at a discrete 15 ms interval in 1 year predicted amplitude at the same interval in a subsequent year. Signi®cant correlations determined that within the 3 year period, subjects retained their relative position at each 15 ms interval. Remarkable consistency was detected for conditions, phases and placements as early as 60 ms after stimulation and was sustained for 700±800 ms.
In the second analysis of waveform stability, withinsubject correlations of waveforms were computed. A significant correlation in this analysis indicated that subjects retained the shape of their waveforms over years. This analysis was sensitive to variation over years in individual subjects, and as such may be most useful as a marker of change. The results indicated that most subjects had very signi®cant correspondence between waveforms over years. Average correlations were greater than 0.80, supporting the shape of the ERP as a very reliable measure of individual differences. Because non-linear change over time, or low reliability over repeated testing, has been proposed as an index of cognitive decline, dementia, and even death in elderly subjects (Rosen et al., 1986;Storandt et al., 1986;Wilson and Kaszniak, 1986;van der Wal and Sandman, 1992), longitudinal assessment of waveform stability also may be a sensitive indicator of early stages of insidious processes.
The ability of the ERP to provide information about brain function independently of a motor response makes it especially useful in the assessment of dementing processes, since some individuals are not able to provide a behavioral response such as reaction time. The ®ndings of this study that P3 is a reliable index of information processing in the target detection task support its use as a valid measure of brain function in aging and age-related disease processes.