Domain adaptation for robust workload level alignment between sessions and subjects using fNIRS

Abstract. Significance: We demonstrated the potential of using domain adaptation on functional near-infrared spectroscopy (fNIRS) data to classify different levels of n-back tasks that involve working memory. Aim: Domain shift in fNIRS data is a challenge in the workload level alignment across different experiment sessions and subjects. To address this problem, two domain adaptation approaches—Gromov–Wasserstein (G-W) and fused Gromov–Wasserstein (FG-W) were used. Approach: Specifically, we used labeled data from one session or one subject to classify trials in another session (within the same subject) or another subject. We applied G-W for session-by-session alignment and FG-W for subject-by-subject alignment to fNIRS data acquired during different n-back task levels. We compared these approaches with three supervised methods: multiclass support vector machine (SVM), convolutional neural network (CNN), and recurrent neural network (RNN). Results: In a sample of six subjects, G-W resulted in an alignment accuracy of 68%±4% (weighted mean ± standard error) for session-by-session alignment, FG-W resulted in an alignment accuracy of 55%±2% for subject-by-subject alignment. In each of these cases, 25% accuracy represents chance. Alignment accuracy results from both G-W and FG-W are significantly greater than those from SVM, CNN, and RNN. We also showed that removal of motion artifacts from the fNIRS data plays an important role in improving alignment performance. Conclusions: Domain adaptation has potential for session-by-session and subject-by-subject alignment of mental workload by using fNIRS data.

tissue with temporal sampling rate of on the order of 10 Hz. 1 Over the past three decades, fNIRS has been used in several brain imaging applications, including noninvasive imaging of cognitive tasks and brain functional activation, [1][2][3][4] and brain computer interface (BCI). 5 Memory-based workload classification using fNIRS measurements has been demonstrated to be an ideal approach for a realistic adaptive BCI to measure human workload level. 6 In this paper, we study the problem of classification of fNIRS corresponding to different conditions of an n-back task (i.e., subjects are required to continuously remember the last n ∈ f1;2; 3; : : : g of rapidly changing letters or numbers). We performed fNIRS measurements on prefrontal cortex (PFC), which has been found to be a relevant area for memory-related tasks by positron emission tomography and functional magnetic resonance imaging. 7,8 Most n-back classification studies in the literature are based on supervised methods on fNIRS signals in within-session and withinsubject basis (i.e., within single trial of data acquisition on a single subject). [9][10][11] While those studies showed promising results, subject-and session-dependent systems are not realistic for an interface system that can adapt to different users with a wide range of physiological conditions. With the aim of use in BCI, workload classifications based on fNIRS data across experiment sessions (session-by-session alignment) and across subjects (subject-by-subject alignment) are necessary.
There are several challenges that hamper accurate workload classification using fNIRS data. We outline them below and propose methods to mitigate them.
The first challenge, which is the main focus of this paper, is to deal with session-by-session and subject-by-subject variations in classification of n-back tasks. These problems are related to what is referred to as domain adaptation in machine learning. [12][13][14] More specifically, data from different sessions or different subjects are referred to as belonging to different domains, and the changes in data distributions across different domains (the session or subject that the data belongs to) are considered as a domain shift. 15 Due to this phenomenon, the knowledge we learned from one domain cannot be applied directly to another one. To address this problem, recent advances in the theory and methods of optimal transport (OT) 16 and metric measure space alignment [17][18][19] could be used to align data with a known labeled n-back condition from one session or one subject to the unlabeled data from a different session within the same subject or from a different subject. Though OT has been applied for domain adaptation with potential performance, 20,21 it has some limitations when two sets of data used for alignment do not share the same metric space in which case a meaningful notion of distance between two spaces does not exist. For example, for session-by-session alignment, data from some of the fNIRS channels are removed from one of the two sessions due to a poor signal-to-noise ratio (SNR). This will cause two sessions' data to be embedded in different dimensions in the two domains. A naïve solution is to remove the corresponding channels from the other session to guarantee that the two sessions have the same dimension. However, this has the disadvantage of causing loss of information. In this paper, we proposed that using Gromov-Wasserstein (G-W) 18,22 and fused Gromov-Wasserstein (FG-W) barycenter 23 would alleviate this problem and provide algorithms to align across domains for fNIRS n-back task classification.
The second challenge is motion artifacts in fNIRS signals. Motion artifacts in fNIRS are commonly due to the coupling changes of any source or detector from the scalp during the experiment. This causes sudden increases or decreases in measured light intensity and can affect the measured fNIRS signals. From a machine learning perspective, motion artifact detection and correction help remove any misleading correlation from the subject behavior (twitching, head movement, etc.) to what the classification model learns from fNIRS data. For example, a classification model may recognize when a subject presses a button as a requirement during the experiment by detecting spikes in the measured signals due to the subject's head movement, instead of detecting real hemodynamic responses from the brain signals. A number of approaches, inspired by statistical signal processing methods such as adaptive filtering, independent component analysis (ICA), and time-frequency analysis, have been proposed to remove or correct for motion artifacts in fNIRS signals. [24][25][26][27][28][29][30] Most of these techniques either depend on the use of auxiliary reference signals (e.g., accelerometry, etc.) or extraoptical channels or require certain assumptions on the characteristics of motion artifacts and cleaned fNIRS signals. In this paper, we used an off-the-shelf method based on sparse optimization for automatic detection and removal of spikes and steps anomalies, namely transient artifact reduction algorithm (TARA). 31 We will apply the method TARA in the hope of improving the classification accuracy of n-back tasks.
The main contributions of this paper to the classification of different n-back task conditions include: (1) applying G-W to align fNIRS data during each n-back task condition across different experimental sessions for every single subject (session-by-session alignment); (2) applying FG-W barycenter to align fNIRS data during each n-back condition between different subjects (subject-by-subject alignment); and (3) demonstrating that alignment accuracy could be improved by applying motion artifact removal with TARA as a preprocessing step on fNIRS data.

Subjects and Experiment Design
Six healthy human subjects (one female, five males; age range: 23 to 54 years) participated in this study. The Tufts University Institutional Review Board approved the experimental protocol, and the subjects provided written informed consent prior to the experiment.
During the n-back task, subjects were instructed to watch a series of rapidly flashing random one-digit numbers (from 0 to 9) shown on a computer screen placed ∼50 cm in front of the subject. Subjects must continuously remember the last n numbers (n ¼ 0, 1, 2, and 3) and were asked to press the space bar if the currently displayed number (target) matched the preceding n'th number. In the 0-back task, the subject pressed the space bar whenever numeral "0" appeared. With increasing n, the task difficulty is expected to increase, as the subjects must remember an increasing number of preceding digits and continuously shift the remembered sequence. The experiment was designed such that the targets appeared with 25% to 35% chance (i.e., 65% to 75% nontargets) in each task (chosen randomly). We measured the task performance by counting the number of missed targets (when the subject did not press the space bar for a target) and the number of wrong reactions (when the subject incorrectly identified a nontarget stimulus as a target).
Each subject performed a total of four separate experiment sessions in two days: two sessions per day, one in the morning shift (9 to 12 a.m.) and one in the afternoon shift (1 to 4 p.m.). The order of the n-back tasks was randomized among sessions, but the randomization order was kept the same among subjects (i.e., only four random sequences were used and each subject was shown each of the four after all of their sessions). A session started with 155 s of initial baseline with a countdown timer displayed on the screen. At the beginning of a task, an instruction was shown to inform the subject that the upcoming task was 0-, 1-, 2-, or 3-back. A task consisted of 100 displayed digits each lasting 2 s, during which the stimulus was displayed for 1.5 s and followed by a resting time of 0.5 s where a black screen was shown. Therefore, each task was a total of 200 s in length. Subsequently, the subject entered 30 s of baseline (rest) after finishing the task while the performance accuracy of the preceding task was displayed on the screen. This process was repeated for the four values of n. At the end of the fourth task, the subject rested for a 155-s baseline after which the experiment was completed. Figure 1

Data Acquisition
During the entire experiment session, optical data were collected continuously with a cw fNIRS device (NIRScout, NIRx Medical Technology, Berlin, Germany). Eight light-emitting diode source pairs (at two wavelengths of 760 and 850 nm) and seven detector fiber bundles connected to photodiode detectors were arranged on a conformable fabric headset. The fNIRS headset can be quickly fixed to the forehead to enable high quality measurements of the PFC within the range of several minutes. A total of 20 channels at 3-cm source-detector distances were collected. A schematic diagram of the arrangement is shown in Figs. 1(b) and 1(c). Light intensities were collected at a sampling rate of 7.81 Hz. Linear detrending was applied to the collected changes in light intensity with respect to baseline to remove slow temporal drifts. Then, the detrended normalized intensities were converted into Δ½HbO 2 and Δ½Hb using the modified Beer-Lambert law. 32 We assumed the wavelength-dependent differential pathlength factors (DPFs), which account for the increase in photon pathlength due to multiple scattering, equal to 9.1 and 8.0 for 760 and 850 nm, respectively. 33 During the experiment, continuous arterial blood pressure (ABP) was collected with a beatto-beat finger plethysmography system (NIBP100D, BIOPAC Systems, Inc., Goleta, California). ABP measurements were converted into mean arterial blood pressure (MAP, in units of mmHg) and heart rate (HR, in units of beats per minute, bpm).

fNIRS Data Preprocessing by TARA
Measured fNIRS data were checked manually to remove those noisy channels contaminated by high frequency noise (>1 Hz). Examples of removed and retained channels from two subjects are shown in Fig. 2, and the numbers of remaining channels are reported in Table 4 in the Appendix. The whole session will be removed if more than 60% of channels are identified as noisy. To further remove motion artifacts from the retained channels, we used the TARA algorithm 31 in which measured time series data are treated as a linear combination of a low-pass signal, motion artifacts, and white noise. The algorithm focuses on two types of motion artifacts: transient pulses (spike-like signals) and step discontinuities, and assumes both of them appear infrequently. A sparse optimization problem is then formulated to jointly estimate two types of motion artifacts. We refer the reader to Ref. 31 for more details. We used the code provided by the authors 34 and chose parameters for our fNIRS data as shown in Table 5 in the Appendix. Once the motion artifacts are detected, they can be removed from the original signal to obtain the cleaned data.

Domain Adaptation for fNIRS
After the removal of the channels with poor SNR and motion artifacts, a small time duration w is chosen as the window size to divide the remaining n-back data (Δ½HbO 2 and Δ½Hb) into N nonoverlapping small segments. Here, we use w ¼ 60 samples (∼8 s). To concretely describe the proposed method, next we will set some notations that are used throughout the paper.
Notation: We will use lower-case boldface letters x to denote vectors and upper case boldface letters X to denote matrices. Unless otherwise stated, unbolded lower case letters denote scalars. fðX s m;i ; y s m;i Þg N i¼1 stands for the collection of segmented data set of subject s in its m th session, where N is the number of segments, integer s ∈ ½1;6, and integer m ∈ ½1;4. The i'th segment is denoted as X s m;i ∈ R d×w , where d is the number of channels and w is the window length. y s m;i ∈ ½0;3 is the corresponding n-back task label for subject s in session m and segment i, y s m ¼ vecðy s m;i Þ is an N-dimensional vector of the label. The remaining notation will be introduced as needed.

Optimal transport theory and Gromov-Wasserstein matching
Consider two discrete sets of points fx i g i∈1 · · · n , x i ∈ R d in a metric space X with a metric d X , and fy j g j∈1 · · · m , y j ∈ R d in another metric space Y with the metric d Y . The main idea behind aligning two sets of points is by viewing them as two empirical distributions E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 1 ; 1 1 6 ; 1 6 0 where δ x i and δ y j are Dirac functions at the position of x i and y j and a i and b j are the corresponding probabilities. Without further information, a i and b j will be set as 1 n and 1 m , respectively. The OT problem is proposed to find a plan T ∈ R n×m that is the solution to (2) with the i; j'th element C i;j being the cost of associating (moving) the point x i to the point y j . This is also known as the Kantorovich's relaxation 35 for the original Monge problem. 36 To reduce the computational cost of solving the linear program Eq. (2), an entropic regularization term is usually added to Eq. (2), leading to E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 3 ; 1 1 6 ; 6 4 3 min T∈Uða;bÞ hC; Ti − λHðTÞ; (3) where This entropic OT problem 37 can be solved efficiently using the Sinkhorn Algorithm 38 or its variations such as the Greenkhorn algorithm, 39 both of which can achieve a near-linear time complexity. 40 This approach has been used in domain adaptation 20,21 for transfer of data in different domains.
Though widely used for domain adaptation, classic OT lacks the ability of mapping two different metric spaces. When the points have different dimensions, i.e., x i ∈ R d 1 and y j ∈ R d 2 , where d 1 ≠ d 2 , a distance between x i and y j may not be meaningfully defined. Thus, instead of seeking a distance matrix between elements in different domains, the G-W method compares the dissimilarity between the pairwise distances in each domain. It poses a weaker assumption that if x i should be aligned to y j and x i 0 should be aligned to y j 0 , then for two distance matrices C X ∈ R n×n and C Y ∈ R m×m in space X and Y, C X i;i 0 and C Y j;j 0 should be similar. 17 Formally, the G-W distance is defined as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 4 ; 1 1 6 ; 4 5 2 where L is a cost function, which typically can be chosen as a quadratic function or Kullback-Leibler divergence. For our method, a squared loss function is applied. Equation (4) is a nonconvex problem related to a quadratic assignment problem. 18 The problem in Eq. (5) can be solved by projected gradient descent algorithm wherein each iteration solution is found by running Sinkhorn Algorithm. 22

Metric for G-W alignment for fNIRS data
For electroencephalogram (EEG) and fNIRS processing, mean and covariance of the time segments have been considered as useful features. 41,42 Here, we use these features to compute the inner metric matrix of each session. Specifically, for data fX s m;i g N i¼1 from the m'th session of subject s, we compute its covariance matrices fP s m;i g N i¼1 and mean vectors fh s ; t e m p : i n t r a l i n k -; e 0 0 6 ; 1 1 6 ; 1 7 7 where ρ hellinger ð·Þ is the matrix version of the Hellinger distance, 43 written as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 7 ; 1 1 6 ; 1 3 2 where A and B are positive definite matrices. Since the number of channels d selected for different sessions' data are not necessarily the same, we normalize by the number of channels in each session.

Domain adaptation for session-by-session alignment
We assume the label is given for one session's data and aim to infer the label for all other sessions belonging to the same subject. Using the metric defined in Eq. (6), we show the pseudocode for the session-by-session alignment in Algorithm 1. Since we only consider data within the same subject, the upper index for the subject will be dropped in the algorithm.

Subject-by-Subject Alignment
When targeting subject-by-subject alignment, we assume data and the corresponding labels for all sessions of one subject are given and denote this subject as the source subject. Then, we will use this information to predict the labels of fNIRS data for all four sessions of other subjects (target subjects). Transferring labels between different subjects is a bigger challenge since there is a larger shift in domain. Directly using the same G-W alignment as discussed above will lead to a large variance in alignment accuracy. More importantly, we will lose the advantage of knowing all the features and structural information from multiple sessions of the source subject. To address this problem, we consider a recently proposed method named FG-W. 23 By computing an FG-W barycenter, which is the Fréchet mean of the FG-W distance, we summarize all the given information into a new representation of the source subject and then follow the same routine as session-by-session alignment to achieve the label alignment.

Fused Gromov-Wasserstein barycenter
FG-W, unlike the G-W, combines both feature and structural information and shows its advantage in graph classification. 23,44 Consider two sets of tuples fðx i ; f i Þg i∈1 · · · n in space ðX ; ΣÞ and fðy j ; g j Þg j∈1 · · · m , in space ðY; ΣÞ, here, x i ∈ R d 1 and y j ∈ R d 2 are the data points, f i and g j are their corresponding features, which are both in space Σ and share the same dimension. With a slight abuse of notation, we will use the same symbol as Eq. (1) to denote their empirical distribution E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 8 ; 1 1 6 ; 2 2 9 The FG-W distance between such two distributions with both data and the corresponding feature information included is then defined as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 9 ; 1 1 6 ; 1 5 4 FGWða; bÞ ¼ min where α ∈ ½0;1 is a trade-off parameter, q ≥ 1, ρðf i ; g j Þ stands for the cost of matching feature f i to feature g j which in our case corresponds to the labels, i.e., scalar value n in the n-back task. For multiple distribution setting like those related to multiple sessions, a natural extension of FG-W distance is its barycenter, which inherits the advantages of FG-W that leverages both Algorithm 1 Alignment between session m and session n.
Input: Source data and label fðX m;i ; y m;i Þg N i¼1 , target data fX n;i g N i¼1 Output: Target label fy n;i g N i¼1 1: Calculate inner distance matrices C m and C n using Eq. (6) for fX m;i g N i¼1 and fX n;i g N i¼1 .
2: Solve Eq. (5) to get the transport plan T between session m and session n.
3: Choose the largest value of each column of T as 1 and set others to be 0 to get the coupling matrix T cp 4: Get target label fy n;i g N i¼1 by calculating T T cp vecðy m;i Þ structural and feature information. The FG-W barycenter can be obtained by minimizing the weighted sum of a set of FG-W distances. Let fC k g K k¼1 be a set of distance matrices, where C k ∈ R N×N , ff k g K k¼1 , f k ∈ R N is the corresponding feature vector. Here, K will correspond to the number of sessions for each subject in our case. We assume the base histograms fa k g K k¼1 and the histogram a associated with the barycenter are known and fixed as uniform distributions. By calculating the Fréchet mean of the FG-W distance, we aim to find a feature vector f and a distance matrix C that represents the structure information, such that where P k ζ k ¼ 1 are the weights for sessions and are chosen evenly for each session. q ≥ 1 for the loss of features and a squared loss between features is used for our method. This problem can be solved by block coordinate descent algorithm described in Ref. 23. Note that after solving Eq. (10), only the distance matrix C and feature vector f will be used to form the new representation of the provided K sessions.

Metric for FG-W barycenter alignment
Unlike the metric defined in Eq. (6) for session-by-session alignment, we removed the L2 norm of the mean difference from the distance when considering the metric for subject-by-subject alignment. This is because the differences of the mean values are usually the same within the same subject but vary across different subjects. It is worth mentioning that after removing the L2 norm of the mean difference, the covariance matrices themselves can be viewed as points in a Riemannian space. 45 Formally, for the m'th session of subject s, the distance matrix C s m is defined using its covariance matrices fP s m;i g N i¼1 , P s m;i ∈ R d×d , with the i, i 0 'th element ðC s m Þ ii 0 computed via

Domain adaptation for subject-by-subject alignment
The algorithm for subject-by-subject alignment is shown in Algorithm 2, where we only take two subjects (each with four sessions) as an example, but the algorithm can be easily adapted to all other subjects with different number of sessions.

Comparison with Supervised Machine Learning Methods
To further demonstrate the potential of domain adaptation methods, we compared our method with a convolutional neural network (CNN), a recurrent neural network (RNN), and a multi-Algorithm 2 Alignment between subject s and subject t .
2: Solve Eq. (10) using ½ðC s 1 ; y s 1 Þ; ðC s 2 ; y s 2 Þ; ðC s 3 ; y s 3 Þ; ðC s 4 ; y s 4 Þ to get the inner distance matrix and corresponding label vector of the barycenter for subject s, denoted as fC s bary ; y s bary g.

3:
Repeat steps 2 to 4 in Algorithm 1 using fC s bary ; y s bary g and C t 1 , C t 2 , C t 3 , C t 4 , respectively, as input to get the labels fy t f1: : : 4g;i g N i¼1 for target data.
class support vector machine (SVM)-based classifiers applied without any domain adaptation techniques. For the CNN model, we adapted the architecture structure of EEG-NET. 46 Due to paucity of the amount of data in our case compared to the original paper, 46 we simplified the structure to three convolutional layers followed by two dense layers. Details of the CNN structure can be found in Appendix, Table 3. To compare with the session-by-session alignment using the G-W method, Δ½HbO 2 and Δ½Hb data were first separated and then stacked along a new dimension as input to the CNN. Since the removal of some noisy channels will lead to different input data points being in different dimensions, thereby causing a mismatch between input data and the fixed model structure, we replaced the discarded channels with the average of data from the remaining channels (separately for Δ½HbO 2 and Δ½Hb). We used data from one session as input to train the model with Adam optimizer 47 using cross entropy loss and tested the remaining sessions. The model was trained until severe overfitting occurred (300 epochs in our case) to guarantee the best test accuracy can be achieved within the training process. Test accuracy was recorded during the whole training process (i.e., after each training epoch) and the best result was selected among them. The training and testing processes were conducted five times and the average test accuracy was reported. To compare with subject-by-subject alignment using the FG-W method, data from all four sessions of one subject were combined and used as input and the classification model was trained to predict the task labels for the other five subjects in the same manner as discussed above.
For the RNN model, we used a basic three-layer long short term memory 48 model with the hidden size set as 20. The training and testing data were prepared in the same way as the CNN method except that Δ½HbO 2 and Δ½Hb data were not separated, but were input together. For both session and subject prediction, the training and evaluation procedure followed the same routine as CNN.
Before applying SVM, a dimension reduction technique was applied to the segmented multichannel fNIRS data. Here, we used uniform manifold approximation and projection (UMAP) 49 to compress each piece of segmented data 50 into a 50-dimensional vector, with the distance matrix calculated using Eq. (6) for session-by-session alignment and Eq. (11) for subject-bysubject alignment. During the training procedure for session-by-session alignment, only one session's data was used for testing and all the remaining data were used for training. Hyperparameters were selected by leave-one-session-out cross-validation. This was similar for subject-by-subject alignment, where one subject's data (including all the sessions) was used for testing while all other data were used for training. Hyperparameters were again selected by leave-one-subject-out cross-validation.
The Student's t-test was used to investigate differences between alignment accuracy from G-W/FG-W methods with 25% chance levels (25% stands for the chance to assign any session as 0, 1, 2, or 3-back), between G-W/FG-W methods using raw and cleaned data from TARA, and between G-W/FG-W and comparison methods stated above. All values are reported as mean ± standard error weighted by the standard deviations of the alignment accuracy values from six subjects unless otherwise noted. Figure 3 shows summary of subject performance analysis with the average percentages of wrong and missed responses, respectively, across four sessions and six subjects for each n-back task condition. The difficulty level, in terms of the amounts of wrong and missed responses, increases significantly for the 3-back task as compared to other n-back tasks (p < 0.05, paired-sample t-test). Next, the numbers of wrong and missed responses for the 2-back task in the four experiment sessions are significantly higher than those for the 1-and the 0-back tasks (p < 0.05, paired-sample t-test). Finally, there was no significant difference in the difficulty level between the 0-and the 1-back task in terms of wrong and missed responses (p > 0.05, paired-sample t-test). Figure 4 shows the examples of average time courses of changes in MAP and HR from three subjects (1, 2, and 4) across different measurement sessions of the 2-back task. We observe a greater variability in task-evoked changes in HR and MAP across subjects than across sessions. In particular, subjects 1 and 2 show negligible changes in MAP and HR during the task with respect to the initial baseline, with individual measurements from different sessions following the same trend. On the other hand, all the measurements from different sessions from subject 4 show  totally different responses as compared to subjects 1 and 2. For subject 4, MAP increases during the middle of the task and then returns to the baseline in the last minute. The HR measurements from this subject feature an initial increase and immediate decrease at the onset of the task. Figure 5 shows the effects of TARA in removing motion artifacts in the fNIRS signals (Δ½Hb). As shown in the figure, the original signal is contaminated by the motion artifacts with spikes and steps. After applying the TARA algorithm, most of the motion artifacts have been removed. As compared to applying a low-pass filter to the original signal, TARA does not bring any further distortion to the cleaned signal and is more effective at removing step artifacts. The effect of this motion artifacts' removal as a preprocessing step before applying alignment algorithms is also shown in Table 1 and Fig. 8. An improvement for session-by-session alignment accuracy (by an average of 3 AE 3% across six subjects; p < 0.005, paired-sample t-test) and subject-by-subject alignment accuracy (by an average of 5 AE 2% across six subjects; p < 0.0005, paired-sample t-test) can be seen after applying TARA on fNIRS signals.

Session-by-Session Alignment
A low-dimensional UMAP visualization of the alignment for two sessions' data is shown in Fig. 6 for subject 4. In Fig. 6, the low-dimensional projection was generated individually from the distance matrix of each session's data. Therefore, the positions of the two groups of sessions' data are assigned randomly and their relative distances are not their true distances. Figure 7 shows the confusion matrices of session-by-session alignment for four n-back tasks (0, 1, 2, and 3) of six subjects. Numbers reported in the confusion matrix are the average alignment accuracies of all the possible combinations of two out of all four sessions for each subject. Values in the main diagonal of each confusion matrix represent correct alignment between predicted and true labels, while the other values represent the misalignment results. Correct alignment results are significantly greater than chance level of 25% (p < 0.0001, one-sample t-test).  Table 1 Average session-by-session and subject-by-subject alignment accuracy (%) using G-W and FG-W, respectively, as compared with SVM, CNN, and RNN. G-W and FG-W barycenter alignment methods applied to both original data (org) and data cleaned by TARA algorithm, using data including and excluding data from subject 3. For other methods, only cleaned data were used as input. Averages and standard errors across all subjects are reported (Avg. The averages and standard deviations of session-by-session alignment accuracy for six subjects are summarized in Table 1 and Fig. 8. For each subject, the value reported is calculated based on the alignment or prediction accuracy for all possible combination of session pairs. As compared to SVM, CNN, and RNN, alignment accuracy of G-W is greater by an average of 43% AE 5%, 7% AE 4%, and 5% AE 5%, respectively (p < 0.005, paired-sample t-test). Figure 7 shows the confusion matrices of subject-by-subject alignment for four n-back tasks of six subjects. Each number in the reported confusion matrix is the average of alignment accuracies of different tasks from the source subject to five other subjects as the targets. Correct alignment results are significantly greater than chance level of 25% (p < 0.0001, one-sample t-test). Average subject-by-subject alignment accuracy is shown in Table 1 and Fig. 8. Each reported average accuracy value is the average of the alignment accuracy when considering one subject as the source and five other subjects as the targets. For each subject pair, accuracies are calculated between source subject and all sessions within the target subject and averaged to obtain the accuracy between source and target subject. As compared to SVM, CNN, and RNN, alignment accuracy of FG-W is greater by an average of 22% AE 2%, 15% AE 5%, and 15% AE 5%, respectively (p < 0.0005, paired-sample t-test).

Subject-by-Subject Alignment
Since data from subject 3 have poor SNR and are severely affected by motion artifacts in half of the fNIRS data (see appendix Table 4), we could treat this subject as an outlier. Alignment accuracy without using data from subject 3 is reported in Table 1 and Fig. 8.

Combining n-Back Tasks in Session-by-Session and Subject-by-Subject Alignment
The analysis of subject performance (Sec. 4.1) showed significant differences in the number of missed targets and wrong reactions depending on the n-back task conditions. Particularly, the subject performance suggests that 0-and 1-back tasks could be combined together in the alignment since they show similar brain activation behaviors. In this section, we showed that by combining data from 0-back together with 1-back and 2-back together with 3-back tasks, the alignment accuracy increased abruptly for both session-by-session and subject-by-subject alignment, as shown in Table 2. As compared to the results reported in Table 1, the session-by-session alignment accuracy increased by an average of 22% AE 2%, and the subject-by-subject alignment accuracy increased by an average of 33% AE 3%.

Discussion
In this study of six subjects, we showed that fNIRS signals measured from 20 channels on the PFC can be used to robustly discriminate subjects' mental workload between different n-back task levels across sessions within one subject and across different subjects. One limitation of our study is a small number of subjects. However, with the current number of subjects (six subjects), this paper still achieved the goals of demonstrating: (1) an alignment accuracy greater than that of chance (25%) for the majority of session-session and subject-subject combinations; and (2) greater accuracies than that obtained from multiclass SVM-, CNN-, and RNN-based models.  We thereby showed the potential of fNIRS as a modality for BCI and user state monitoring that can adapt to different users with various physiological states. Future works will address the extension of this study with a larger sample of subjects to further investigate the variability between sessions and subjects. In regards to data preprocessing, we show that motion artifact removal in fNIRS signals is an important step for the following mental workload alignment. Specifically, we report that using TARA to remove motion artifacts from fNIRS signals increased alignment accuracy by an average of 3% AE 3% for session-by-session alignment and by 5% AE 2% for subject-by-subject alignment (p < 0.005). Future work could include addressing different types of artifacts that could arise in fNIRS time series, which were not considered by TARA, such as oscillatory transients. In addition, possible future improvements in TARA may be to investigate an automatic way for selection of regularization and nonconvexity parameters in TARA algorithm across subjects.
We introduced two approaches, G-W and FG-W barycenter, for session-by-session and subject-by-subject alignment of mental workload during n-back task. We proved that our methods could be generalized across different sessions and subjects' data. In particular, for sessionby-session alignment, we used labeled fNIRS data with known n-back task conditions from one session to align with other unlabeled sessions from the same subject using the G-W method. We showed that most of the unlabeled sessions' data could be mapped correctly to their true labels, with the alignment accuracy ranging from 54% to 77% (with 25% representing chance alignment). Meanwhile, with multi-class SVM and simplified CNN and RNN models, the n-back task classification accuracy was lower (by an average of 43% AE 5%, 7% AE 4%, and 5% AE 5%, respectively). Note that CNN and RNN required the same amount of data as the proposed methods for training, while SVM required more data (from more than one session) for training. Similarly, for subject-by-subject alignment, we used labeled data from one subject as the source data for alignment. Labels and structural information of the source data were combined to generate a new representation (i.e. the FG-W barycenter). Following the same routine as the session-by-session alignment, we were able to use the barycenter from the source subject to predict the labels for data from different sessions for other subjects, with the alignment accuracy from 50% to 71% (also with 25% representing chance alignment). From the corresponding SVM, RNN, and CNN methods, n-back task classification performance achieved lower accuracy than the FG-W method (by an average of 15% to 22%). Again, CNN and RNN were trained from data from one subject (source data), while SVM was trained from data from five subjects for classification. Moreover, our methods of G-W and FG-W do not require the two subsets of data used for alignment to have the same dimension. Thus, they do not require data interpolation due to removing noisy fNIRS signals as for CNN and RNN methods. However, we note that even though G-W and FG-W methods are free from the dimension requirement for data, they could not achieve satisfying results when a large amount of data is missing (e.g., in the case of subject 3 when around half of the channels were discarded in the preprocessing step).
We found relatively higher alignment results for session-by-session alignment (average of 68% AE 4%) than subject-by-subject alignment (average of 55% AE 2%). One source of variation in fNIRS data across experiment sessions and across different subjects could come from the variability in systemic physiology, as seen in the variability in the task-evoked changes in MAP and HR (see Sec. 4.2). We observed that the variability of these two physiological measurements are larger across subjects than across sessions. This may explain a greater accuracy results for session-by-session alignment than subject-by-subject alignment. Another source of variability across sessions and subjects may also come from the variation in fNIRS optode placement on the subject's head. We anticipate the optode placement variation to be greater across subjects than across sessions due to different head geometry from different human subjects. From our results, we found that the new representation of the barycenter of the source subject still aligned well to data from other subjects even though subject-by-subject alignment was a more challenging problem. This is indicative that representations of different subjects may still share similar underlying structures even from different domains. Future work will explore generating barycenter from source data from multiple subjects' information for subject-by-subject alignment to account for the across-subject variations in the barycenter. Based on our alignment results shown in confusion matrices in Fig. 7, the misalignment in session-by-session and subject-by-subject alignments are relatively high between 0-back and 1-back, and between 2-back and 3-back tasks. In particular, the misalignment is the highest between 2-and 3-back tasks (when the 2-back task is the true label and the 3-back is the predicted label and vice versa), ranging from 19.8% to 43.5%. The second highest misalignment is between 0-and 1-back tasks, ranging from 6.8% to 31.8%. Similarly, for subject-by-subject alignment, the highest misalignment came from 0-and 1-back tasks, ranging from 15.2% to 46.5%. The second highest misalignment is between 2-and 3-back tasks, ranging from 14.2% to 38.9%. This gave us an idea of combining 0-with 1-back tasks, and 2-with 3-back tasks in the alignment. Substantial increases in alignment performance (by an average of 22 AE 2% for session-by-session and 33 AE 3% for subject-by-subject alignment) suggests that future works could study workload classification between rest to low workload level (0-and 1-back tasks) versus high workload level (2-and 3-back tasks).
Finally, single-distance cw fNIRS measurements of intensity from source-detector pairs at 3-cm distance were used in this study. These measurements have been known to be more sensitive to hemodynamic changes in superficial tissues (i.e., scalp and skull) than in the brain. 51 Previous study 52 has shown that tasked-evoked superficial artifacts may arise during brain activation task due to systemic changes in peripheral physiology rather than the cerebral hemodynamics. This also confirms the claim that variations in our alignment results across sessions and subjects could be partially due to variability in systemic physiological origins. For the purpose of our aim, it is desirable to increase the sensitivity of our measurements to brain tissue to probe hemodynamic changes associated with brain activation. One approach, namely the dual-slope method, involves a simple implementation of a certain arrangement of sources and detectors to localize sensitivity of NIRS measurements to a deeper region, 53 thus suppressing confounding signals from superficial tissue. This approach could also help remove instrumental drifts and motion artifacts from measured signals as dual-slopes are unaffected by changes in optical coupling. Future extensions of this work may involve implementing the dual-slope configuration in such experiments as those described here. Another approach to correct for extracerebral contamination is to acquire measurements in a multidistance arrangement to incorporate short (<1 cm, sensitive to extracerebral tissue only) and long (>2.5 cm, sensitive to both extracerebral and brain tissues) source-detector separations 54 and apply a processing method such as adaptive filtering 55 to remove global interference from systemic physiology from fNIRS measurements.

Conclusions
To illustrate that fNIRS signals can be effectively used to identify subjects' mental workload between different n-back task levels across different sessions and subjects, we proposed two domain adaptation methods, G-W and FG-W, for session-by-session and subject-by-subject alignments, respectively. The proposed methods can achieve alignment accuracies greater than the chance level of 25%. At the same time, the proposed methods do not require the same subset of fNIRS channels or further data interpolation for classification across all subjects and sessions as opposed to some other supervised methods such as CNN and RNN. This will alleviate the pressure from having to exclude fNIRS channels that were noisy in one session but not in others, or from having to interpolate the signals to replace those noisy channels. Besides adapting the domain adaptation method, we explored the effect of using the TARA signal processing algorithm for removing motion artifacts and found an improvement in the alignment accuracy results. In the future, we plan to explore the effect of our method on a larger sample of subjects and make it applicable for multiple source subjects.

Appendix
Tables 3, 4, and 5 specify the structure of CNN, the number of retained channels for all subjects and the parameters used in TARA for motion artifact removal, respectively. " * " stands for multiplication operator and ReLU means the rectified linear activation function. Table 3 CNN architecture, where d = number of channels (20 in our case), w = number of time points (60 in our case), T 1 ; T 2 = length of time points after applying the filter and C = number of classes (four in our case).  Table 4 Number of retained channels for six subjects. The total number of channels is 20. "0" indicates when the particular session is removed.  Table 5 Values of TARA parameters (f c , cut-off frequency for the low-pass filter; d , order of the filter; θ and β, regularization parameters for TARA; σ, noise standard deviation). Parameter values were chosen differently for Δ½HbO 2 and Δ½Hb due to different noise level. For each subject, values of σ and the choice of β vary among sessions, as reported in square brackets "[]". "-" indicates when TARA is not applied or when the session is removed.