The development of noninvasive brain imaging techniques over the last 20 years has led to rapid growth in our understanding of brain function and structure. A major challenge for developmental researchers has been to develop infant-friendly neuroimaging methods. In particular, the development of near infrared spectroscopy (NIRS) for the study of functional brain imaging (fNIRS) in infants has been a welcome addition to the very limited choice of methods currently suitable for the use in awake infants. Over the last decade, fNIRS has become established as an easy-to-use, relatively transportable, and low-cost brain imaging technique. For many years, the primary choice for functional imaging in awake infants has been electroencephalography (EEG), a noninvasive technique with high temporal resolution but relatively poor spatial resolution. A major advantage of fNIRS compared with EEG is that it is less susceptible to data corruption by movement artifacts and offers more highly spatially resolved images of activation allowing the localization of brain responses to specific cortical regions. fNIRS is similar to fMRI in that it can measure the hemodynamic response to neuronal activation. The spatial resolution and depth sensitivity are lower than that of fMRI;1,2 however this has not prevented the technique from finding widespread use as a neuroimaging tool where other techniques are not practically applicable. Specifically, the use of fNIRS to study functional brain activation in infants is a rapidly growing research area.3,4 To date, the technique has been used to address developmental topics such as object processing,5 social communication,67.–8 human action processing,9,10 and face processing,11 and it has recently been extended to research on atypical trajectories of brain development, such as in developmental disorders.12,13
A recent shift in the use of fNIRS has been toward the study of the infant brain on an individual level.1415.–16 This form of analysis is particularly important in prospective longitudinal studies of infants at risk, as it enables the comparison of brain activity with behavioral and demographic data across a variety of measures. Furthermore, the assessment of individual differences in infants’ responses is necessary for the discovery of early warning markers in infants at risk for compromised neurodevelopment17 and consequently for the development of prodromal interventions. However, in order for us to accurately measure individual differences in brain activation, it is essential to first identify the factors influencing reliability and then to quantify their contribution to measurement variability. Hence, reliability is a crucial issue in functional activation measurements, as the ability to detect individual differences will be compromised if the reliability of the method is questionable.
Studies of retest reliability in adults have been conducted with other imaging techniques such as fMRI18,19 with a wide range of reported values of reliability depending on the number of participants in the study, the number of task runs, and the tasks used to test reliability.20 Reliability studies have also been conducted with EEG2122.–23 in adults, showing strong reliability of imaging measurements. Test–retest studies on fNIRS have been published on adults in muscle24,25 and brain function2627.28.29.–30 studies. However, to our knowledge, there are no fNIRS reliability studies published with infants. Comparisons of group fNIRS data across different publications can be difficult because of variations in stimuli, testing designs, probe placements, criteria for data rejection, signal processing, and statistical analysis methods.31 Longitudinal studies in the same individuals can allow for standardization of some of these sources of variation and therefore provide more appropriate data from which to draw conclusions about the reliability of fNIRS data. Once known, these measures of reliability in young populations will allow us to establish whether the technique provides sufficiently robust measures of individual differences to establish longitudinal associations in human development. Given that the number of published infant fNIRS studies now exceeds 100, it is surprising that test–retest reliability analyses have thus far not been undertaken. However, this may in part be due to the fact that infants can rapidly habituate to repeated stimuli or task demands, and in contrast to adults cannot be asked to attend on demand, making repeated sessions vulnerable to lost trials and poor compliance.32 Further, infants are capable of remembering events and retain memories of these from very early in life. At 4 to 8 months, they retain the memory of a single task for a few weeks or longer with reminders.33 Thus, it is safer to increase the retest interval to a few months rather than to a few weeks with a test and retest study with young infants in order to ensure best repeatability of the construct and improve participant compliance with the study. In support of this approach, previous test–retest data from adults shows that stimulus-specific decreases in the cortical response with repeated exposure are evident when a short retest interval is used (3 weeks) but not a long (up to 53 weeks) interval.29 While there are also limitations of collecting data from more distant test sessions (several months apart), this has the critical advantage of better data quantity and quality at the second test session. Therefore, we investigated the test–retest reliability of measuring hemodynamic brain responses using fNIRS with a cohort of infants who were participating in a longitudinal study over a 9-month period.
The current work aimed to investigate the following questions. First, how replicable are the significant group effects across two data acquisition sessions? This will be assessed by how many fNIRS channels show significant hemodynamic responses during a functional paradigm, and how similar the spatial group maps of activation are across two time points. Second, how replicable are the significant hemodynamic responses within individual infant data across two time points? This will be assessed by measuring the similarity in spatial maps at the individual participant level. And third, how replicable are the measured signal changes (of the hemodynamic time course as a whole) at group and individual levels in repeated sessions? This will be assessed by comparing time courses and variability of the acquired data between the sessions.
Materials and Methods
The data for this analysis was retrospectively selected from a group of infants who were enrolled in a longitudinal fNIRS study in The Gambia.34 The number of participants recruited for the original study was 42, and the 13 infants included in the study were selected based on availability of valid data for two data acquisition sessions. From the original 42 infants recruited, 18 were excluded from Session 1 due to insufficient number of valid trials according to looking time measures (seven infants), experimenter error (seven infants), or tiredness/fussiness (four infants). Of the remaining 24, one infant died before Session 2 took place, one family moved away from the region, and a further nine participants were excluded from Session 2 due to an insufficient number of valid trials as assessed by looking time (four infants) or tiredness/fussiness (five infants). Session 1 was conducted when the infants were 4 to 8 months of age (), whereas Session 2 was conducted when the infants were 12 to 16 months of age (), and the average retest interval was 8.5 months ().
Participants were identified from the West Kiang Demographic Surveillance System.35 All infants were born full term (37 to 42 weeks’ gestation) and with normal birth weight. Ethical approval was given by the joint Gambia Government/MRC Unit The Gambia Ethics Committee, and written informed consent was obtained from all parents/carers prior to participation.
Details of the experimental design are described in previous publications.14,34,36 Infants wore custom-built fNIRS headgear consisting of an array over the right hemisphere (see Fig. 1), containing a total of 12 channels (source–detector separations; 2 cm) and were tested with the UCL optical topography37 system. Note that measurements were restricted to the right hemisphere as (1) our funding only allowed for a restricted number of sources and detectors with respect to the NIRS system used in the UK and (2) we localized the channels to one hemisphere to ensure we could measure the full extent of the temporal lobe. This system uses near-infrared light of two different wavelengths (780 and 850 nm). Before the infants began the study, head measurements were taken to align the headgear with 10 to 20 coordinates.14 Measurements from this group of infants showed that the average head circumference was 41.3 cm () in Session 1 and 44.2 cm () in Session 2. The headgear was placed over the right hemisphere with the source optode between channels 4 and 7 centered above the preauricular point (directly over T4 according to the 10-20 system). The angle of the positioned array was guided by the headband, which was placed on the head so that it touched the top of the ear (where the ear joins the head) and lay over the brow line of the infant (through Fp1 and Fp2). According to the head measurements of the 4 to 16 months in the current study, in this position the most anterior optode was positioned approximately over F8. Though the head circumference for this age range is smaller in this Gambian population compared with WHO standards, the relative increase in size between the two age points is similar.
The experimental protocol was identical in both sessions. This experimental design had been successfully used in previous studies in the UK, to investigate responses to auditory and visual social stimuli in typically developing infants and to compare responses with infants at risk for developmental disorders.12,14 Infants sat on a parent’s lap in front of a screen. The parent was instructed to refrain from interacting with the infant during the stimuli presentation unless the infant became fussy or sought their attention. The conditions alternated one after the other, with a period of baseline between each. Three types of conditions [visual-social (silent) V-S, auditory vocal V, auditory non-vocal N-V] were presented in the same order across infants in a repeating loop (V-S, N-V, V, V-S, V, N-V) of trials (single presentation of a condition). For the current work, we focused on one of the three experimental conditions—auditory vocal—which consisted of full-colour, life-size videos of human motion (i.e., “Peek-a-boo”) displayed for 9 to 12 s (average 10 s), accompanied by human vocal sounds (i.e., yawning, crying, laughing) for a duration of 8 s. Each trial consisted of four different sounds presented for 0.37 to 2.92 s each, interleaved by short silence periods of 0.16 to 0.24 s. Vocal stimuli were chosen from the Montreal Affective Voices (for more detail, see Ref. 38) and the stimuli of the voice functional localizer.39 We measured activation during presentation of this experimental condition compared to the baseline condition, which consisted of nonhuman still images (i.e., cars and houses) presented randomly for a pseudorandom duration (1 to 3 s) for 9 to 12 s (average 10 s) with silence. The trials were presented until the infants became bored or fussy as judged by the experimenter who was monitoring their behavior. On average, participants looked for 5.61 experimental auditory-vocal trials in Session 1 and 6.54 in Session 2 (no significant difference between the two sessions).
Behavioral Data Processing
Each session was videorecorded in order to code offline infant behavior and compliance with the study. A researcher unfamiliar with the study’s aims carried out behavior coding from these videos. Due to resource limitations at the time of testing, videos were recorded differently at Session 1 and Session 2. As a result, in Session 1, it was possible to synch them with the start and end of the study, but not with the start of each individual trial, as was done in Session 2. During Session 1, the whole session was coded (but without a record of trial) and data was considered valid when the infant watched for session (in addition, the experimenter noted invalid trials online during the study when the infant looked away), whereas in Session 2 in addition to this coding, the trial transitions were also videoed so the session could be coded trial by trial and data considered valid if the infant watched for of each individual trial (as used in previous work9). In Session 2, 12 out of 13 of the infants show the same validity coding for session coding versus trial by trial coding [we included 1 infant in Session 2 who had valid data in the experimental condition under consideration (vocal) but not in the other two experimental conditions]. Furthermore, online experimenter coding was highly reliable, with trial by trial experimenter coding matching the video coding in 10 out of 13 of the infants in Session 2, with 1 invalid trial not coded by the experimenter online in three infants. Therefore, we can be confident that the session video recording and experimenter coding in Session 1 were sufficient.
fNIRS Data Processing
Changes in and HHb chromophore concentration () from baseline to experimental condition were calculated and used as hemodynamic indicators of neural activity.40 The same differential pathlength factor (DPF) was used across the two age points41 (), as the variability of DPF with age for each wavelength was minimal.
The data was low-pass filtered and divided into blocks that consisted of 4 s of prestimulus onset baseline, followed by the experimental trial and, after that, a whole trial of baseline (9 to 12 s in length). Each block was detrended by fitting a straight line between the average signal value in the prestimulus onset period and the average signal value on the last four seconds at the end of the block, which correspond to the last part of the subsequent baseline trial. The detrending procedure brings the start and end points of each block to zero, so the and HHb values reflect increase or decrease from that reference value.3 Measurements for each infant were analyzed, and trials, channels, or participant data were rejected from further analysis in a two-step preprocessing protocol: first, by looking time measures, and second, by the quality of the signals as assessed by artifact-detection algorithms (which either excluded the data of whole channels per infant or data from individual trials within a channel, according to the magnitude of the artifact).3,42 Criteria for channel rejection included: (1) measuring the coefficient of variation (CV) of the signal (channels were excluded if the CV of the attenuation measurement for each wavelength exceeded 10%, possibly due to movement of the arrays and hat) and/or (2) high-frequency noise beyond the limits of physiological effects, where the normalized high-frequency power is greater than 35% of the total power of the signal.43 For each infant, the channels that survived these rejection criteria were analyzed for trial selection. The trial selection analysis identified sharp changes in the signal caused by sudden movements. This was applied following data conversion from attenuation to concentration data. Trials that contained changes in concentration that exceeded a predefined range ( during baseline and during the experimental trials where artifacts in the signal may occur in addition to activation), were removed from the data set. These thresholds were set according to experience with the current array design over the past 8 years. The minimum number of valid experimental trials for each channel was 3. At group level, the grand averaged hemodynamic responses () of all infants were calculated and the maximum change (or amplitude) in (increase in chromophore concentration) and/or HHb (decrease in chromophore concentration) was assessed during the experimental condition relative to baseline within a time window selected between 8 and 16 s poststimulus onset for each trial. This period of time was selected to include the range of maximum concentration changes observed across infants for and HHb. Two-tailed t-tests were used to test the statistical significance of the change. Either a significant increase in concentration or a significant decrease in HHb is commonly accepted as an indicator of cortical activation in infant work.3 During the channel by channel t-tests and subsequent spatial reliability analyses, if and HHb were either to increase or decrease significantly in unison, the signal was considered inconsistent with a hemodynamic response to functional activation40 and not reported in the analyses (for further discussion of physiological changes reported in infant fNIRS work, see Refs. 3 and 4). To identify these channels, the statistical analyses were reviewed and those channels with an increase or decrease in both chromophores were excluded. For the group level, no channels evidenced this pattern in either session. For the individual level, in Session 1, three participants had one channel excluded and one participant had eight channels excluded from the activation maps; and in Session 2, two participants had channels excluded (one channel and four channels). This exclusion criterion was not applied during the signal reliability analysis. Throughout the text, the terms “significant increase of ,” “significant decrease in HHb,” or “significant channel” will be used considering these criteria. To resolve statistical problems of multiple measurement sites for these group analyses, we applied the false discovery rate (FDR) test for multiple comparisons.44,45 The channels that did not survive the test are highlighted in Table 1 with an asterisk. results were unaffected by FDR correction; however, none of the channels with significant HHb decrease survived the test.
The results from the channel-by-channel t-test (two-tailed) analysis for the contrast between the experimental condition and the baseline for Sessions 1 and 2. For each contrast, the results for the significant increase in HbO2 and/or decrease in HHb concentration are displayed. Significant signal change is highlighted in bold.
|Ch||Session 1||Session 2|
Channel tests that would not have survived false discovery rate correction for multiple comparisons.
At the single participant level, statistical significance of signal change within each channel was calculated by two-tailed t-test during the 8- to 16-s time window identified at group level. This analysis assessed the average hemodynamic change within a 6-s window centered on the observed maximum change per trial. By using the average within this secondary window, we aimed to reduce potential bias of artifacts in the data, as at this level the analysis considered single trial time courses instead of the average of several time courses. Significant activation was then defined using the same criteria as for the group analysis.
Alignment Measures of fNIRS Headgear Placement
As the precision of repositioning the fNIRS arrays may be subject to some error, it was essential that we made precise measures of the position of the fNIRS array on each individual at each data acquisition session. These were then analyzed with an objective alignment system, referenced to external landmarks on the infant’s skull (as recommended by Ref. 31), in order to record error in fNIRS array placement across the two sessions. To investigate the efficacy of headgear placement across sessions, the position of the arrays on the infants was photographed and head measurements were taken. Due to warping on the images, only linear displacement measurements of the center point of the reference optode (the middle optode on the lower row of the array) in relation to displacement in direction and were used to quantify error (see Fig. 2). The alignment grid was overlaid on each photograph (as shown in Fig. 2), and the position of the reference optode in relation to the overlaid axis was recorded. The “zero” error position was taken as the position when the center of the reference optode was aligned with the dorsal to ventral axis (defined by the position of the tragus and the place at which the ear curves up and away from the head; see Fig. 2) and the lower edge of the headband was aligned with the anterior to posterior axis (defined by the position of the ear when the top of the ear joins to the head and the highest point of the eyebrows on the photo; see Fig. 2). The diameter of the optode is 10 mm. Therefore, using a scaling factor from actual size (of the optode) to the photo image we were able to calculate how far the optode had deviated from the zero error position for each infant. One limitation of this approach is that errors were only measured in the and directions, therefore errors in array rotation were not calculated.
Defining a Region of Interest
As the test–retest analysis was conducted with data from a functional brain activation study, we assessed reliability both over the whole array and within a region of interest (ROI) chosen to assess responses within specific brain regions known to be active during social stimulus paradigms.12,14,46 We used a standardized scalp surface map of fNIRS channel locators to reliably locate cortical ROI covered by our fNIRS array.47 This map has been designed to identify ROI within the frontal and temporal lobes for the study of the social brain network in 4- to 7-month olds (with a head circumference ranging from 38 to 45 cm). Though this standardized map may be more applicable for our infant dataset from Session 1 (when they are matched for age), given that the head circumference is smaller in the Gambian cohort compared with the UK infants tested at our lab, we believe that the map can also guide the ROI selection for Session 2. Using the measurement grid of scalp surface channel locations provided by this map,47 we defined an ROI of channels that were most likely over the superior temporal sulcus region (because the channels were located over either the middle or superior temporal gyri in 75% to 100% of the 55 infants with MRI-fNIRS individual coregistered data47). These channels were 7, 8, 9, and 11. We also included channel 12 in the ROI, although the standardized map did not include a channel with the position equivalent to it. However, extrapolation of the position of this channel on the map shows that it would be most likely positioned in the most posterior part of the temporal cortex, but still in the region of the STS. This would be particularly true for the participants at Session 2, when the infants are older. The ROI is shown in Fig. 1.
Infant compliance with the study was measured using the percentage of time spent looking at the screen (looking time) over the total duration of the session (see Methods). Paired t-tests were used to compare performance between the two sessions.26,29
Further to these significant threshold based analyses, signal reliability of the hemodynamic response was assessed in two additional steps. First, signal reliability was assessed with the Pearson correlation coefficient of the signal hemodynamic time course between the two sessions. At the group level, a Pearson correlation coefficient was conducted on the average hemodynamic time course (averaged across trials, 240 time points according to a time resolution of 10 Hz) to assess the reproducibility of the shape and timing of the signal across channels and participants. At the single participant level, Pearson correlation was calculated using the mean signal change averaged across trials (240 time points) and channels within the ROI per sessions for each participant. Second, for group-level analyses, signal reliability was also calculated with the intraclass correlation coefficient (ICC, one-way random effects49). In this work, at group level is a measure of the ratio of between participants’ variance over the total variance and informs about the reproducibility of a single measurement (i.e., for a single participant); and is a measure of between-session variance over total variance and represents the reproducibility of the mean of repeated measures (i.e., or the replicability of session measurements50). ICC values are interpreted as follows: a value of 1.0 would indicate nearly perfect agreement, a value of 0 would indicate there is no agreement, while a negative value should be treated with caution and is thought to be unreliable.51,52 Reliability measures were considered reasonable in previous adult fNIRS and fMRI test–retest studies.18,29,53
The 13 infants who participated in this study had valid fNIRS data from both sessions (see Methods for measures of validity). First, infant attentiveness and engagement with the study was evaluated, and the common measure of percentage of looking time over total duration of the whole study session (which includes all experimental and baseline conditions) was 77.78% () in Session 1 and 87.06% () in Session 2. The difference in percentage of looking time between the sessions was not significant (pairwise t-test, , ; see Fig. 3). If we focus these analyses on the auditory-vocal experimental condition, the number of trials obtained in Session 2 was on average longer (average number of auditory-vocal trials played in , ; , ; , ). However, the average number of trials per participant that achieved looking time criteria (as specified in the Methods) was similar in both sessions (; ; , ). Second, the artifact-detection algorithms revealed that the data were largely free of artifact. Across the data from both sessions only one infant had two channels excluded; the remaining infants had a complete set of valid channels. Within the data that achieved looking time criteria, the average number of trials excluded per channel within individual infants in Session 1 was 1.23 () and in Session 2 was 1.00 (). Seven infants in Session 1 and 10 infants in Session 2 did not have any trials excluded by the automatic detection of artifacts in any channel; furthermore, 11 infants in both sessions had none or only one trial removed on any of the sessions. Channels 1 and 2 (affecting Session 1 only) were the channels excluded from the participant with channels excluded from the analysis. Overall, automatic artifact detection and exclusion of corrupted trials affected both sessions similarly, as the average percentage of included trials in the analysis after automatic artifact detection for () and (; , ).
In an initial group analysis, the maximum hemodynamic changes were identified within the time window of interest (see Sec. 2) in response to the experimental condition (auditory-visual social stimuli) versus the baseline (silence with nonsocial visual stimuli) and analyzed channel-by-channel (t-test, two-tailed). This analysis revealed significant increases in and significant decreases in HHb across a wide number of channels (see Fig. 4 and Table 1).
In an initial individual infant analysis, trial-by-trial significant increases in response to the experimental stimulus versus baseline (average responses per trial within the time window of interest; see Sec. 2) were detected in at least one channel, across the whole fNIRS array, in 13 of the infants (100%) at Session 1 and 10 of the 13 infants (77%) at Session 2. Twelve of the 13 infants in Session 1, and all 10 of the infants in Session 2, revealed a significant response in at least two channels. Significant HHb decreases were detected in six out of 13 (46%) infants at Session 1 (four of the six with at least two channels with significant responses), and in 11 of the 13 infants (85%) at Session 2 (six of the 11 with at least two channels with significant responses). Taking into account that the number of significant channels was higher for than HHb across the group of infants (Session 1: average of 3.54 channels with increase, 1.15 channels with HHb decrease; , ; Session 2: average of 4.77 channels with increase, 2.38 channels with HHb decrease, , ), and that all channels with a significant increase passed the FDR test for multiple comparisons, while none of the channels with HHb decrease did, we decided to base our reliability analysis on the most robust measure. Hence, in this work, we mainly focus on changes. However, as it is strongly recommended that both and HHb are included when reporting activation,3,40 we will also include some measures of HHb reliability where possible (i.e., when activation-related HHb signal changes were observed).
Reliability of fNIRS Headgear Placement
Placement of the fNIRS array on the individual infant’s head did not vary significantly across sessions. In Session 1, in relation to the reference zero position (see Fig. 2; further details in Methods) the reference optode was on average, 2.2 mm () more anterior and 1.9 mm () more inferior; and in Session 2, was 0.1 mm more anterior () and 1.2 mm () more superior. The position of the reference optode therefore differed on average by 2.2 mm along the anterior–posterior -axis (n.s., , ) and 3.1 mm along the superior–inferior -axis (significant difference, , ). Although the latter difference was significant, 3.1 mm is a comparatively small divergence in relation to the resolution of the fNIRS measures at source–detector separations of 20 mm.
Reliability at Group Level
The reliability of the significant changes in and HHb concentration (in response to the experimental condition versus baseline) across the sessions was first assessed at the group level. Spatial replicability at the group level was high. For , seven channels were significant at Session 1, and nine channels were significant at Session 2. For HHb, one channel was significant in Session 1 and three channels in Session 2. Eight out of the nine channels with a significant hemodynamic response (in either or HHb) at Session 2 also showed a significant response at Session 1. Intersession measures of the size () and the spatial overlap () of significant channels showed a high degree of replicability in detection of increase (; ); however, in terms of detection of HHb change, spatial replicability was much lower (; ).
Replicability measures of size and spatial overlap increased further when significant changes in both and HHb were taken into account (; , see Table 2).
Spatial reliability at group level. AS1 = number of significant channels at S1; AS2 = number of significant channels at S2; Aoverlap = number of the same channels significant at both sessions; Rquantity = an intersession measure of the size of the response (number of significant channels) at both sessions; Roverlap = an intersession measure of the spatial overlap of significant channels at both sessions. Results are given for the HbO2 response for all available channels in the sensor array [whole array (HbO2)], the HHb response in the sensor array [whole array (HHb)]; both the HbO2 and HHb response for all available channels in the sensor array [whole array (HbO2 and HHb)], and the HbO2 responses from channels within the ROI.
|Whole array ()||7||9||7||0.875||0.875|
|Whole array (HHb)||1||3||0||0.5||0|
|Whole array ( and HHb)||8||9||8||0.941||0.941|
Following this, analyses were undertaken on those channels within the superior temporal sulcus region ROI (defined in Methods). All channels within the ROI showed significant activation on both sessions, therefore, size and spatial overlap measures in this region are 1. For , the intersession correlation coefficient of the group hemodynamic time course (averaged across infants and channels within the ROI) was 0.896 (see Fig. 5). Inspection of the correlation coefficients within each channel revealed a high degree of correlation in all channels of the ROI except for channel 7: correlation coefficient in , whereas the range of correlation coefficients for the remaining channels is 0.831 to 0.968. If the ROI correlation coefficient is reanalyzed with channel 7 excluded, it increases to 0.919. For HHb, the intersession correlation coefficient of the group hemodynamic time course was 0.777. Inspection of the correlation coefficients within each channel revealed a wider range from 0.152 to 0.907, and consistent with the results, the lowest correlation coefficient was found in channel 7.
Signal reliability was measured at the group level for the ROI with ICC measurements calculated using the maximum hemodynamic change (averaged across all ROI channels) per participant, for each session. represents a measure of intersession reproducibility and represents a measure of intrasession reproducibility. The ROI analysis revealed an of 0.461 and an of 0.299 (see Table 3). At the channel level, ICC was calculated using the average of change per channel within the ROI for each participant and revealed reasonably similar and measures in four of the five channels. The output from channel 11 should be treated as unreliable, as a negative ICC value was found.51,52
Signal reliability at group level for the ROI and across the channels within the ROI. Here, ICCaverage is a measure of intersession reliability; ICCsingle is a measure of intrasession reliability (across participants). At ROI-level, ICCs were calculated using the average of the maximum HbO2 change across all ROI channels per participant. At channel level, ICCs were calculated using the average of the maximum HbO2 change per participant at each channel.
ICC measures were not calculated for HHb given the low number of channels with significant HHb change at group and individual levels.
Reliability at Single Participant Level
Good spatial reliability was found at the single participant level for change. Measures of spatial reliability were calculated using data from the 10 participants with at least one significantly active channel on both sessions, initially considering the whole array. was 0.66 on average (ranging between 0.22 to 0.92) and was for eight of these 10 infants. across the whole array was, on average, 0.45, and individual values ranged between 0.22 and 0.77; was 0.5 or above in four of the 10 infants. Within the ROI, average size reliability () was 0.78 (ranging from 0.40 to 1), and in nine out of the 10 participants was above 0.5; in the ROI was on average 0.55 [ranging between 0 (one infant) and 0.8], and above 0.5 in six out of the 10 infants (see Table 4). Detection of significant HHb change in both Sessions 1 and 2 was achieved in four out of 13 infants (all four infants had significant change in both sessions), and in three of them, the channels with significant HHb change were ROI channels.
HbO2 and HHb spatial reliability at single participant level. This includes the 10 participants who had significant HbO2 and/or HHb responses in at least one channel on both sessions. Results are shown for all channels (whole array) and for the five channels in the ROI.
|Whole array||ROI||Whole array||ROI|
Signal reliability of the hemodynamic time course across sessions at the participant level was measured using Pearson correlation coefficient of the signal change averaged across trials and channels within the ROI. This ranged from to 0.91 (see Table 5). Six participants showed a correlation above 0.5 (and a further three above 0.4), indicating that their response across the channels within the ROI was consistent across the two sessions. By contrast, four participants revealed negative (or zero) correlation, indicating that their response across the channels within the ROI was not consistent across sessions.
Signal reliability of the time course at single participant level. Pearson correlation coefficient of mean HbO2 change within the window of interest (8- to 16-s postexperimental stimulus onset) between Session 1 and Session 2 of the channels within the ROI.
In this work, we have investigated the reliability of using fNIRS to study brain activation over repeated sessions with the same infants, in terms of both reproducibility and similarity in the response. These infants were part of a longitudinal study investigating brain responses to the presentation of auditory-visual social stimuli compared with a silent non-social baseline. Previous research has demonstrated that these types of auditory-visual social paradigms have been associated with activation in the superior temporal sulcus region in early infancy,9,42 childhood,54 and adulthood.55 In the current work, this paradigm was used to assess the reliability of finding similar patterns of significant changes in and HHb over two sessions. The first session was conducted when the infants were 4 to 8 months of age, and the second, 8.5 months later when they were 12 to 16 months old. Clearly, there is the potential for developmental effects to confound our measure of reliability, as the shape, timing, location, or magnitude of the hemodynamic response may change with age during infancy. However, the choice of paradigm used for test–retest in these analyses was designed to minimize these effects, by focusing on a primary functional contrast—auditory stimuli versus silence. Age of participant was not thought to play a significant role, as recent functional imaging studies have revealed that activation patterns to human vocalizations (versus silence) are similar from 3 months of age into adulthood.46,56,57 Furthermore, though the paradigm included multimodal stimuli, the addition of visual stimuli alongside auditory was not thought to greatly impact the patterns of vocal auditory versus silence activation, as previous studies have found similar results with or without the inclusion of visual input.14,34,58
The significant hemodynamic group effects within the ROI were striking in their similarity across test sessions, as was reliability analyzed across the whole fNIRS array. The number of significantly active responses within channels and the spatial overlap of these channels were highly similar across sessions. Therefore, at group level, spatial localization and magnitude of the responses were similar at both test points, making us confident that the fNIRS measurements at group level are robust to potential between-sessions effects such as infant compliance with the study and fNIRS probe positioning on the infant head. These results are in line with long-term fNIRS reliability studies in adults with sessions spread a year apart, though of course the impact of development would be less of an issue there.29
As we anticipated, within individual infants, the test–retest results were more variable. Overall, the average individual infant measures of spatial reliability across the whole fNIRS array were at an acceptable level for 29 and improved substantially when we focused on our superior temporal sulcus region of interest. For 90% of the infants, the analysis of the number of channels with significant responses revealed values at or over 0.5. However, there were greater differences when the spatial overlap () of the significant responses was taken into account, with a wider range across the infants. Therefore, while the magnitude of the response (in terms of number of significant channels) can be seen to be reliable across time, the spatial overlap of the response is more difficult to assess. However, recall that considerable time elapsed between testing sessions and changes in head size, brain morphology, and functional specialization of the response with age may have more impact within individuals than when averaged across a group. For comparison, in adult fMRI studies, mean reliability in spatial overlap at individual level reported values ranging from as low as 0.21 (from a delayed recognition study repeated 1 week apart including six participants) to as high as 0.856 (from a word-generation study repeated 1 week apart including eight participants, as reviewed by Bennet and Miller20). Careful consideration must be taken when comparing changes in signal amplitude across participants or for the same participant across sessions. Location of fNIRS source–detector pairs relative to the site of activation as well as anatomical characteristics such as scalp and skull thickness can have a considerable effect on the amplitude change detected due to partial volume effects.59 Improvement of single participant measurements can be achieved by using tomographic reconstruction together with anatomical information in models for data analysis. In this work, our reliability results may have been improved had we used an optimal looking time scoring protocol in Session 1 (as we did in Session 2), which would have allowed an accurate exclusion of trials with poor signal (due to lack of attention to the screen) and high noise (with possibly subthreshold movement artifacts).
Our choice to primarily investigate changes was based on its higher signal-to-noise ratio compared to HHb.60 Furthermore, as the SNR of HHb is lower, the results will be more susceptible to data confounding, such as movement artifact in the data, discrepancies in array placement, and developmental change. While many infant fNIRS studies report significant responses, far fewer report HHb responses, sometimes through choice, but often because they do not find significant responses.3 This is consistent with the low number of significant group hemodynamic HHb changes seen in the current work. Interestingly, in contrast to the analyses investigating the location and magnitude of the significant hemodynamic changes, the time-course correlation coefficients showed that both the and HHb signal evidenced highly reliable grand-averaged time course data across the two sessions. Future measurements of retest reliability that include HHb reliability within participants should seek to increase the SNR of the signal by increasing participant numbers, designing protocols that elicit strong differential activation in the region of interest, or reducing potential sources of variability in the signals. Furthermore, rather than using fairly basic level statistical tests, more sophisticated analysis techniques such as general linear modeling of the shape of the hemodynamic response may be more sensitive to smaller signal changes and enrich HHb data output in developmental fNIRS studies.
Challenges of Gathering Test–Retest Data in Infants
As we outlined earlier, an aspect of infant development which may impact on the measurement of significant activation at each session is head growth. In other work co-registering individual infants’ fNIRS to MRI, we found that age (and not head circumference) is a predictor of changes in fNIRS channel position over underlying anatomy within the range of 4 to 7 months.47 These findings suggest that growth in head volume (rather than circumference) and changes in the shape and complexity of underlying brain regions may be significant. For example, the shape of the STS may change over age, the depth of the sulci may increase, and therefore the size or shape of the ROI needed to investigate these areas may need to change according to the individual infant’s brain morphology. While (1) the co-registered fNIRS-MRI data48 shows that the location of the channels within our ROI (STG/MTG) is highly consistent across infants and (2) we have designed the ROI to be of sufficient size to accommodate some individual differences in morphology, we acknowledge that in lieu of individual MRI data, we treat the measures of individual reliability with more caution than those of group reliability.
Furthermore, in the current study, we assessed long-term reliability across several months of age. In future work, it would be important to investigate short-term reliability to determine whether the variability in reliability within infants is reduced when age is not a major factor. However, this approach in itself brings considerable challenges, as outlined above.
In conclusion, in this work we demonstrate that (1) spatial mapping and size of activation in infant fNIRS studies has a high degree of reliability and (2) there is strong time course signal reliability within channels of a predefined ROI for group analyses. This work also shows that spatial localization and size of activation in infant populations can be done at the single participant level with an acceptable degree of reliability when a specific region of interest is targeted. Signal reliability results at the single participant level suggest that statistical power may be diminished due to variability of the data at this level. Functional NIRS is, therefore, a highly suitable technique for infant studies, and its reliability at the single participant level can be improved further by adopting strategies that reduce signal variability such as accurate positioning of sensor arrays over regions of interest, regression techniques to examine residual signals at the surface of the head, improving resilience of the sensor arrays to signal artifacts, and accounting further for the changes in brain morphology in the developing brain.
We would like to thank the parents and infants who took part in this study as well as the field workers at the MRC Keneba Field Station without whom this work would not have been possible. We thank our collaborators Prof. Andrew Prentice and Dr. Sophie Moore (MRC International Nutrition Group, London School of Hygiene & Tropical Medicine); Dr. Momdou K. Darboe and Dr. Rita Wegmuller (MRC International Nutrition Group, MRC Keneba, MRC Unit, The Gambia); Dr. Maria Papademetriou and Mr. Drew Halliday (Department of Medical Physics and Bionegineering, UCL); and Ms. Katarina Begus (Centre for Brain and Cognitive Development, Birkbeck, University of London). This study was supported by a Bill & Melinda Gates Foundation Phase One Grand Challenges Exploration Grant OPP1061089, core funding MC-A760-5QX00 to the International Nutrition Group by the Medical Research Council UK and the UK Department for International Development (DfID) under the MRC/DfID Concordant agreement, a UK Medical Research Council (G0701484) grant, and a grant from The Simons Foundation (no. SFARI201287 to M. H. J.).
Anna Blasi is a research fellow at the Centre for Brain and Cognitive Development, Birkbeck, University of London. Her research interests are centered on functional aspects of human physiology. Her research career started with models of the cardiovascular system and the effects of disease. Through her work at UCL, KCL, and Birkbeck, her research interests have shifted toward the use of functional imaging (fNIRS, fMRI) to study brain function and neurocognitive development in early infancy.
Sarah Lloyd-Fox is a research fellow at the Centre for Brain and Cognitive Development, Birkbeck, University of London. Her work focuses on the use of fNIRS to investigate the developing brain in infancy. Her research projects focus on investigating social cognition, human action perception, autism, and most recently, the application of fNIRS in novel settings, such as resource-poor countries to be able to study the effects of compromised development, such as undernutrition.
Mark H. Johnson is a Medical Research Council scientific programme leader and Director of the Centre for Brain & Cognitive Development, Birkbeck (University of London). He is also a Fellow of the British Academy and the Cognitive Science Society. He has published over 250 papers and 10 books on brain and cognitive development in human infants and other species. His laboratory currently focuses on typical and atypical functional brain development during infancy and childhood.
Clare Elwell is a professor of medical physics in the Department of Medical Physics and Bioengineering at UCL. She leads the near infrared spectroscopy (NIRS) research group developing novel optical systems for monitoring and imaging the human body and brain. Her research projects include studies of autism, acute brain injury, sports performance, migraine, malaria, depression, and, most recently, the effects of malnutrition on brain development with the first infant functional brain imaging study in Africa.