Deep learning in fNIRS: a review

Abstract. Significance Optical neuroimaging has become a well-established clinical and research tool to monitor cortical activations in the human brain. It is notable that outcomes of functional near-infrared spectroscopy (fNIRS) studies depend heavily on the data processing pipeline and classification model employed. Recently, deep learning (DL) methodologies have demonstrated fast and accurate performances in data processing and classification tasks across many biomedical fields. Aim We aim to review the emerging DL applications in fNIRS studies. Approach We first introduce some of the commonly used DL techniques. Then, the review summarizes current DL work in some of the most active areas of this field, including brain–computer interface, neuro-impairment diagnosis, and neuroscience discovery. Results Of the 63 papers considered in this review, 32 report a comparative study of DL techniques to traditional machine learning techniques where 26 have been shown outperforming the latter in terms of the classification accuracy. In addition, eight studies also utilize DL to reduce the amount of preprocessing typically done with fNIRS data or increase the amount of data via data augmentation. Conclusions The application of DL techniques to fNIRS studies has shown to mitigate many of the hurdles present in fNIRS studies such as lengthy data preprocessing or small sample sizes while achieving comparable or improved classification accuracy.


Introduction
Over the last two decades, functional near-infrared spectroscopy (fNIRS) has become a wellestablished neuroimaging modality to monitor brain activity. 1 The ability of fNIRS to quantify cortical tissue hemodynamics over a long time, with relative high-spatial sampling and temporal resolution, has enabled its adoption in numerous clinical settings. 2,3 fNIRS offers the unique advantage to be employed in freely mobile subjects with less restrictions than electroencephalography (EEG) or functional magnetic resonance imaging (fMRI). This permits the deployment of fNIRS in naturalistic scenarios and in patient populations that are typically not considered suitable for EEG or fMRI imaging. 4 Still, fNIRS faces numerous challenges for increased clinical adoption due to experimental settings, 5 variations in statistical results, 6 etc. Of importance, current trends in fNIRS aim to improve spatial resolution via increased spatial sampling, improve cortical sensitivity using data processing to remove unwanted physiological noise, improve quantification by anatomical coregistration, and increase robustness via artifact identification and removal. 7 Current algorithmic implementations, however, require a high level of expertise to set up parameters that can be system-and/or application-specific but also greatly impact the interpretability of the processed data. Moreover, the computational cost of these methods does not lend itself to bedside implementations. Following a ubiquitous trend in the field of data processing and analysis, new approaches leveraging developments in deep learning (DL) have been recently proposed to help overcome these caveats to a large extent.
The successes of DL methodologies across all biomedical engineering fields promise the development of dedicated data-driven, model-free data processing tools with robust performances, user-friendly employability, and real-time capabilities. DL methods are increasingly utilized across the biomedical imaging field, including biomedical optics 8 and neuroimaging modalities including fMRI, magnetoencephalography (MEG), and EEG. 9 Following this trend, DL methodologies have also been recently used for fNIRS applications. In this review, we provide a summary of these current efforts. First, we introduce the basic concepts of DL including training and design considerations. Second, we provide a synthetic summary of the different studies reporting DL models in fNIRS applications. This section is divided into subfield, namely brain-computer interface (BCI), clinical diagnostic, and analysis of cortical activity. We then provide a short discussion and future outlook section.

Deep Learning Methodology
DL can be viewed as black-box version of parametric machine learning techniques. Traditional machine learning techniques might make several assumptions about raw data distributions. Most notable is that data can be mapped to distinct classes (categorical data) or a regression line (continuous data) by a suitable transformation of input data. In parametric methods, weights are used to reduce multidimensional input data into separable space. These weights are learned by minimizing the objective function, trying to reduce error in prediction of score. With the introduction of hidden layers between input and output space, it is argued 10 that several abstractions can be learned that help in better distinction. These models are termed artificial neural networks (ANNs).
Inspired by biological neurons, ANNs generate a map between the input training data (x) and the output (y) using simple nonlinear operations performed at nodes, or "neurons," that form a computational graph. 11 The weights (Θ) of the edges of the graph are updated, i.e., "trained" by minimizing a loss function Lðy;ŷÞ that measures the difference between the model output (ŷ) and the true output (y). Network training is accomplished efficiently using the chain rule of differentiation in an algorithm known as backpropagation using gradient descent. 12 The number of nodes in each layer of the graph defines the width of the network, whereas the number of layers defines its depth.
A deep network is one with sufficient depth, though there is no consensus on how deep the network has to be considered a "deep neural network." (DNN) The entire network can be viewed as a differentiable function that learns a relationship between the input and the output using multiple levels of function composition, with the initial layers learning low-level features of the input, and the deeper layers extracting higher-level features. The advent of high-performance computing and the availability of large-scale data are fueling the current rapid advances in DL.

Deep Learning Architectures
DL derives its versatility from the available cell/nodal operations. These operations include linear transformations, filters, and gates that have been inspired by other domain-specific tools, which compute abstract features in the hidden layers. In general, the choice of these operations defines the application of a network. The three most commonly used types of networks used in fNIRS studies are shown in Fig. 1.
Dense networks such as multilayer perceptrons (MLPs) use linear transformation as cell operations and are synonymous to ANNs. Convolutional neural networks (CNNs) use convolution operations, where a fixed-size filter is used for convolution over the input image or feature map. These are shown to be highly capable of recognizing digits, 13 objects, 14 and images in general. Similarly, long-short-term-memory (LSTM) networks can be unfurled in the time domain to learn time series data, including handwriting 15 and semantics for language translation. 16 LSTM cells use gates to maintain a cell-state memory. 17 Appropriate data are passed through the cells in successive timesteps, avoiding the vanishing gradient problem (see Sec. 1.2.1) for long time series data. 18 A fundamental attribute of neural networks is introducing nonlinearities using so-called "activation functions"-based on the idea of the firing of biological neurons. The most common types of activation functions are listed in Table 1. The sigmoid function compresses values in the range of 0 and 1, activating large positive values while zeroing out large negative ones. The sigmoid function can also be used at the output for binary classifiers, for example, in Refs. 19 and 20. Dolmans et al. 21 used the sigmoid output for seven-class classification, although the softmax activation is widely used for multiclass classification. The softmax activation uses exponentiation followed by normalization to assign probabilities to the outputs. Higher softmax outputs at the output layer may be interpreted as confidence in the prediction. 22 The rectified linear unit (ReLU) activation is most widely used for the hidden nodes since it is computationally inexpensive and helps resolve the vanishing gradient problem. 22 It has proven to be efficient and effective for CNNs. It deactivates all nodes with negative outputs, which, although effective, deactivates those nodes throughout the training. This issue, also known as dead-ReLU problem, is solved by a recent activation function called exponential linear unit (ELU), which has also been used by Mirbagheri et al., 19 Saadati et al., 23 Ortega and Faisal, 24 whereas some also used another variant called leaky ReLU. 25,26 The final ingredient of neural networks is the loss function, which depends upon the output type. For classification tasks, the cross-entropy loss function 27 measures the difference in the probability distributions between ground truths and network predictions. Multiclass classifications [28][29][30][31] use categorical cross-entropy as the loss function, whereas binary classification problems use binary cross-entropy. For regression 32 or reconstruction 33 problems, the mean squared error loss function is used. These loss functions may be modified to implement constraints or regularization, e.g., a linear combination of MSE, variance, and two other metrics for a denoising autoencoder (DAE) in Ref. 34.

Training deep networks
After deciding on functions and architecture suitable to the problem, it is important to understand the underlying considerations involved to help the network learn the input-output map. It is common to normalize or scale the input to keep the parameters within tractable bounds. Two common methods include minmax scaling, where data are scaled between 0 and 1, and standardization, where data are scaled to have zero mean and unit variance. Though some networks have been shown to learn well without normalization 35 if the input data do not have a vast range, scaling is recommended to help convergence. Weight initialization also aids in convergence. 36 Techniques include the He and Glorot 36 initializations, where weights are drawn randomly from a normal or uniform distribution with predefined statistical moments. Moreover, dropout 37 is usually introduced between layers to randomly switch off a certain fraction of nodes so the network does not overfit the training data.
Gradient updates during backpropagation can either explode 38 or vanish if depth is too large; 22 ReLU and LSTM cells were developed primarily to overcome this problem. In addition, skip connections between deep layers help in propagating gradients backward, avoiding vanishing gradients. 39 These can be additive, e.g., ResNet, 39 or augmentative, e.g., UNet. 40 On the other hand, gradient clipping and use of the SELU 38 activation function have been proposed to solve the exploding gradient problem.
Deep networks have multiple hyperparameters that are not actively learned or updated during training but set based on the literature or using a systematic search process. Weights are successively updated backward from the output layer by computing the mean gradients from a batch of samples. This batch size is a hyperparameter, commonly set to 32, although it depends on Table 1 Activated outputs of input "x " from each of the activation functions is shown in the table. Also shown are the bounds/range of activated values.

Activation function
Output values Bounds the architecture and hardware capabilities. Larger batch sizes demand more computation and memory for a single iteration, whereas smaller sizes imply slower convergence, e.g., Fernandez Rojas et al. 41 used 64 batches in each update, whereas Wickramaratne and Mahmud 42 used a batch size of 8. Another critical hyperparameter that controls the convergence rate is the learning rate that multiplies the loss-gradient in the gradient descent algorithm. Learning rate is usually set to values of the order of 10 −2 at the start, and it can be decreased further for finetuning. For example, Ortega and Faisal 24 used an initial learning rate of 0.03, which decays at a factor of 0.9 after each epoch. An epoch is a point at which all nonoverlapping batches in the dataset have been exhausted for training. The number of epochs, also another hyperparameter, is decided strictly based on when the network begins to converge. The most common optimizer algorithm for gradient descent is Adam, 43 which uses an adaptive learning rate aided by momentum that helps network parameters to converge efficiently. Other algorithms also used in fNIRS applications are SGD 44 and RMSprop. 42 A notorious problem with training DNNs is the change in the distribution of inputs in every layer, which calls for careful initialization of weights, learning rate schedule, and dropout. Ioffe and Szegedy 45 introduced a technique called batch-normalization that helps mitigate this problem termed as "internal covariance shift." Batch normalization reduced training steps by a factor of 14 times in the original study while requiring less stringent conditions of initialization and learning rates. Hence, during batch training in CNN, it is often recommended to use batch normalization layers after each convolution. 19,24 Most biomedical applications have a limited number of subjects and limited training data (see Table 2 for a summary of the data set characteristics, including number of participants, number of channels, and cortical areas monitored for all studies summarized herein). Hence, it is essential to take proper measures to avoid using a large network with a small dataset to prevent overfitting. The network will reduce the training error but lose the ability to generalize to unseen datasets. To mitigate this, it is good to start with the simplest network possible, have fewer weights, and iteratively improve the networks by adding the number of layers/nodes. Furthermore, overfitting can be avoided by regularizing weights (which adds a regularization loss to overall loss function), adding dropout layers, etc.
Since fNIRS data are sequential time series data, studies where segments of obtained data, such as resting state, 46,47 are sufficient for analysis utilize a sliding window approach. Here, a fixed-length data segment is extracted at fixed intervals, allowing the overlap between subsequent windows. This is not applicable if the entire trial duration has to be analyzed, in which case each trial will have to be a single sample.

Model evaluation
Although various modeling tools are available for the analysis of collected data in psychological and behavioral science, there is a growing concern about reproducibility of results using the settings reported in studies. 91 Although direct replication or even conceptual replication 92 might not be possible for many neuroimaging studies involving human subjects, studies now rely on simulated replication as the next best approach. 93 This entails partitioning the data into a number of subsets and assessing model performance on each of the subsets after it is trained with the rest of the data. Performance metrics on n-subsets, a.k.a test sets, are averaged to get the mean performance metric, which is representative of the expected model performance on any unseen subset of data. Since the test set is "held out" from training, this method, also known as "cross-validation" (CV), can be deemed as a crude measure of the generalizability of the model. 93 The type of subsets chosen depends on experimental settings, research question, or the limitations of the dataset. The subsets selected can either be a shuffled subset of trials, one single trial, or, if available, trials of each participant subject. These CV schemes are called k-fold CV, leave-one-user-out (LOUO) CV, and leave-one-subject-out (LOSO) CV. Leave-one-supertrial-out is another available rigorous CV technique. 94 Many of the reviewed papers carry out a 10-fold CV, whereas a few execute LOSO CV. 48,49 The most common metrics used for evaluation are accuracy, specificity, and sensitivity.

Inputs to deep networks
After the preprocessing steps (see Sec. 2.1), changes in oxy (ΔHbO) and deoxy (ΔHbR) hemoglobin, as well as total change (ΔHbT) are obtained in time-series format. From these, data samples are segmented based on task settings. Data are segmented trialwise for task-based experiments, and for resting state data or long trials, data are segmented using sliding windows as mentioned previously. Many studies opt for the discrete probability distribution of concentration changes and extract statistical features such as mean, slope, variance, skewness, kurtosis, max, and range in the form of manual features. In other cases, where the network is allowed to learn and extract features itself, the data are fed in the forms of either spatial maps 20,28,50 or time series themselves. In some studies, segments of data are converted to other forms such as Gramian angular fields 32 or spectrogram maps. 30 A schematic in Fig. 2 summarizes the sample extraction procedure employed in the studies. While the general trends and techniques used for DL fNIRS studies have been examined, the applications and some application-dependent techniques are further discussed in the next section.

DL Applications in fNIRS
In this section of the review paper, we summarize and discuss the key findings of the literature search as well as how the techniques discussed above are currently being used. To properly understand the scope of this review, the literature search methodology is first described. The primary method of searching for relevant articles was via the PubMed search engine. Papers published between 2015 and May 2022 in which DL was used in conjunction with fNIRS data were considered for this review. Due to very few fNIRS papers being published using DL prior to 2015, the search was restricted to this time frame. Using the terms "deep learning" and "fNIRS" resulted in 35 results, 29 of which were relevant, while the terms "deep learning" and "NIRS" only produced 12 articles, none of which were both relevant and absent from the previous search. When searching the terms "neural network" and "fNIRS," 146 results were found, with many using the term neural network to refer to the neuroscience phenomenon being studied through the use of DL. As a result only three additional relevant papers were found via this search. In addition to this, using the same search terms in Google Scholar produced tens of thousands of results, many of which were not relevant. Due to the impracticality of a comprehensive search of these results, the search was truncated after multiple pages of results yielded no relevant articles. From this Google Scholar search, an additional 25 articles were found, yielding a total of 63 articles that were considered in this review.
In recent years, the use of DL techniques in fNIRS studies has increased, and due to the versatility of fNIRS, DL has been applied to many different applications of fNIRS. While some of the studies that used DL used it for feature extraction or data augmentation, in most of the papers considered, DL was used as a classifier. As a result, those studies in which DL were used as a classifier are further subdivided into categories based on the application of fNIRS in the study. A comprehensive summary of the applications and DL architecture employed is Table 3, while the details of the experimental setups for the fNIRS studies are summarized in Table 2. Because fNIRS is of interest in studies on BCI, many of the studies found used DL classifiers for the classification of tasks for BCI applications. Other studies have used DL techniques as diagnostic tools, to detect various physiological and mental pathologies based on cortical activity. Finally, some studies, such as those using DL techniques to assess skill level and functional connectivity, were not common enough to be placed into a category of their own; however, these papers all focused on the analyses of cortical activity using DL techniques, and as such are grouped together. Figure 3 shows provided for visualization of the number of papers collected from each category for the given year. While some of the papers mentioned may fall within multiple categories, additional context from the paper such as the primary focus of the paper assisted in determining how the paper was categorized. This did not stop relevant papers from being discussed in more than one subcategory.

Preprocessing
Like most noninvasive neuroimaging modalities, raw fNIRS signals typically contain confounding physiological signals and other noise that originate from outside of the cerebral cortex, such as the hemodynamics of the scalp and changes in blood pressure and heart rate. 95 Most studies use some method of preprocessing to try to address this. While many of the papers presented here use band pass filtering or butterworth filtering, various other methods are employed throughout the literature. Independent component analysis (ICA) denoising, 12 wavelet filtering, 29 and correlation-based signal improvement filtering 51 are all methods used to remove some of the undesired physiological trends in fNIRS signals. Another recommended method is short separation regression, a method in which signals with only confounding physiological signals are simultaneously collected alongside fNIRS signals and the trends in these signals are removed from the fNIRS data. 96 Other algorithms such as Savitsky-Golay filtering 97 and temporal derivative distribution repair 98 are methods used to correct motion artifacts and baseline drifts in fNIRS signals. Many of these techniques are commonly used in fNIRS studies to improve the quality of the signal as well as ensure that the signal being assessed originates from the brain. Other recommended techniques involve prewhitening the data or decorrelating via PCA to remove any correlation between fNIRS signals prior to analysis, 96 further facilitating the analysis of signals originating in a region of interest.

Feature Extraction and Data Augmentation
One of the most notable problems with fNIRS data is the extensive manual feature extraction and artifact removal typically required for data to be analyzed, preventing many fNIRS studies from being applied in real time. As a result, more effective methods of preprocessing fNIRS data are being explored. One of the most promising benefits of DL is the ability to quickly and automatically learn and extract relevant features in fNIRS data. Some studies have already explored this benefit of DL. Tanveer et al. 20 reported the use of a DNN to extract the features that were fed to a K-nearest neighbors (KNN) classifier to detect the drowsiness of subjects during a virtual driving task. Using the features extracted from the DNN, the KNN was able to achieve a classification accuracy of 83.3%. Despite feature extraction typically being computationally expensive, even taking hours with a powerful GPU, the DNN exhibited a mean computation time of 0.024 s for 10 s time windows, a speed that would allow for feature extraction of a 30-min signal        While there have been other papers interested in using DL to extract features to feed into another classifier, there have also been papers that take raw fNIRS data and use the same neural network for feature extraction and classification. Despite end-to-end neural networks being seen as a more ideal solution than manual feature extraction, difficulties with low generalizability make them less commonly used. One study by Dargazany et al. 25 used an MLP (with two hidden layers) with raw EEG, body motion and fNIRS data to achieve a reported classification accuracy of 78% to 80% on a motor task with four classes, despite no denoising or preprocessing of data being done. Another study by Fernandez Rojas et al. 41 used raw fNIRS data as the input for an LSTM network, achieving a classification accuracy of 90.6%. To assess generalizability, a 10-fold CV was used, with the classifier achieving an accuracy of 93.1%. Both studies have provided evidence toward the claim that DL techniques are capable of removing the need for manually extracted features from fNIRS data, further progressing toward the real-time end-toend BCI. Despite the fact that the previously mentioned studies were able to achieve high accuracies without denoising or motion artifact removal, some studies are performed when a subject's head is in motion. In such studies, fNIRS data can be heavily compromised by specific artifacts in the raw data that could bias any classification task. While many algorithms are used to try to remove motion artifacts, Lee et al. attempted to use a CNN to recognize and remove motion artifacts without relying on the parameters that must be defined to use many of the popular motion artifact removal algorithms. 53 In this study, the raw fNIRS time series and the estimated canonical response were used as inputs to the network and the resulting CNR of the DL output was 0.63, outperforming wavelet denoising, which achieved a mean CNR of 0.36. Another study done by Gao et al. 34 used subjects who were performing a precision cutting surgical task based on the fundamentals of laparoscopic surgery (FLS) program, which required a large range of motion. With a DAE and the process shown in Fig. 4, 93% motion artifact removal in simulated data and 100% artifact removal in real data were reported, outperforming all comparable artifact removal techniques, including wavelet filtering and principal component analysis. Another study to look at DL for the removal of motion artifacts, Kim et al. 54 compared a CNN to the performance of wavelet denoising and an autoregressive denoising method. Using simulated data in a manner similar to Gao et al. to determine the ground truth, Kim et al. found that the CNN resulted in a mean square error (MSE) of ∼0.004 to 0.005, while the next best method, the combination of wavelet and autoregressive denoising, resulted in an MSE of ∼0.009. Despite these studies all using different metrics to measure the effectiveness of the network, there is clearly an interest in finding an effective method of removing motion artifacts from fNIRS data. It is clear that DL techniques have shown promise for the ability to process and extract features from fNIRS data, but due to a lack of large variety of open-source fNIRS datasets, there has been a recent interest in using DL to generate more fNIRS data for training models. Data augmentation is a technique in which new data are generated to reduce the need for large labeled datasets for many machine learning and DL algorithms that require a lot of labeled data. To be useful, this generated data must not be identical to any of the training data, but must also be realistic, i.e., it must remain within the distribution of the original dataset. 55 While this is a challenge, some DL techniques such as generative adversarial networks (GANs), have been used to accomplish this. Wickramaratne and Mahmud 55 have used a GAN to augment their fNIRS dataset to increase the classification accuracy of finger-and foot-tapping tasks. Without augmented data, a classification of 70.4% was achieved with an SVM classifier, and a CNN classifier achieved an accuracy of 80% when trained only on real data. Using a GAN to generate training data, the accuracy of the CNN classifier increased to 96.67% when trained on real data as well as 110% generated data. Similarly, Woo et al. 56 used a GAN to produce activation t-maps, which, when used to augment the training dataset of a CNN, increased the classification accuracy of a finger tapping task from 92% to 97%, showing that a network that is already performing well may benefit from data augmentation. While the previous studies have used GANs to generate an image representation of fNIRS data, only one study directly used a GAN to augment the dataset with raw time series fNIRS data. Nagasawa et al. 26 used a GAN to generate fNIRS time series data to increase the classification accuracy of motor tasks, as shown in Fig. 5. They reported that when augmenting the 16 original datasets with 100 generated datasets, the accuracy of the SVM classifier increased from around 0.4 to 0.733 while the accuracy of the neural network classifier increased from around 0.4 to 0.746. While still a relatively recent development in fNIRS studies, generated datasets using GANS have demonstrated the ability to increase the classification accuracy of commonly used classifiers such as CNN or SVM classifiers, once again demonstrating the versatility of DL techniques in fNIRS research.

Brain-Computer Interface
One of the most promising applications for fNIRS research is BCI. Many studies on BCI use traditional machine learning or DL techniques to identify a certain task based on cortical activation. Hennrich et al. 57 used a DNN to classify when subjects were performing mental arithmetic, word generation, mental rotation of an object and relaxation. The reported accuracy of the DNN with two hidden layers was 64.1%, which is comparable to the 65.7% accuracy of the shrinkage LDA despite the LDA using custom-built features while the input for the DNN was denoised and normalized fNIRS data. Kwon and Im 58 used similar inputs for their CNN, denoised and baseline corrected ΔHbR and ΔHbO 2 data, to classify mental arithmetic from an idle fixation task. This CNN achieved a classification accuracy of 71.20%, which surpassed the 65.74% accuracy of the shrinkage LDA classifier that used feature vectors as inputs. Wickramaratne and Mahmud 42 also used a CNN to classify mental arithmetic from an idle fixation task in 2021. Like Kwon and Im, a shrinkage LDA with feature vectors was used as the input. Unlike the previous study however, the inputs for the CNN were Gramian angular summation fields (GASF), which are a type of image that is constructed from time series data that maintains some temporal correlation between points. With this CNN and GASF inputs, a classification accuracy of 87.14% was achieved, once again outperforming the shrinkage LDA, which achieved an accuracy of 66.08%. Similarly, Ho et al. 59 also used a two-dimensional (2D) representation of fNIRS data to try and discriminate between differing levels of mental workload. In this study, Ho et al. found that using a CNN with spectrograms generated from fNIRS data achieved a classification accuracy of 82.77%. This was outperformed by a deep belief network, which is a type of network similar to an MLP. The deep belief network, using manually extracted features from the HbR and HbO 2 signals achieved an accuracy of 84.26%. In a similar mental workload task, Asgher et al. 60 found that using an LSTM with similarly extracted features from HbR and HbO 2 time signals resulted in a mental workload classification accuracy of 89.31%. While many of the studies presented compared the performance of DL techniques to that of traditional machine learning or other algorithms, Naseer et al. 61 compared the classification accuracy of an MLP to that of a kNN, Naive Bayes, SVM, LDA, and QDA algorithm. On a two-class mental workload task, the MLP achieved an accuracy of 96% while the QDA, Naive Bayes, and SVM classifiers all achieved similar accuracies of about 95% and the LDA and kNN algorithms performed much worse, with accuracies of 80% and 65%, respectively. All classifiers in this study used manually extracted features such as skewness and kurtosis of the fNIRS as inputs Fig. 5 Framework of GAN and data augmentation. (a) The generator creates the data from the random variables z, and the critic evaluates the generated and original (measured) data. (b) After the training process, the data generated by the generator (referred to as generated data) are combined with the original fNIRS data as augmented data. (c) Trial-averaged waveforms for the four tasks considered in a CV hold. The red lines denote measured original data and the blue lines denote generated data using WGANs. The shaded area represents 95% confidence intervals. Adapted from Ref. 26. into the algorithms. Hakimi et al. 62 also used manually extracted features from fNIRS time series to perform two-class classification between stress and relaxation states and with a CNN, achieved an accuracy of 98.69%, further reinforcing that DL can give very high classification accuracies with mental workload tasks.
While the classification of mental tasks such as arithmetic or word generation is commonly used, many forms of BCI are designed to help those who have restricted or reduced motor function. Because of this, another popular type of experiment found in BCI studies involves executing motor tasks. Trakoolwilaiwan et al. 29 were able to achieve an accuracy of 92.68% with a CNN in a three-class test to distinguish the finger tapping of the left hand, right hand, and both hands at rest despite the CNN being used as both a feature extractor and classifier. The CNN outperformed the SVM and ANN classifiers, which reported accuracies of 86.19% and 89.35%, respectively, which were given the extracted feature inputs commonly used in BCI fNIRS studies (signal mean, variance, kurtosis, skewness, peak, and slope). Since much of the interest in BCI applications involve aiding those who lack the ability to move, some studies focus not on the cortical activations of movement but rather on the cortical activations of imagining movement. In one of these motor imagery studies, Janani et al. 30 had subjects perform a hand clenching or foot tapping task, and shortly after, imagine themselves performing the same task. Two different types of input images were used to see from which method a CNN classifier would more effectively extract features. The first type of input image turned all of the data points within a 20-s window into an M × N matrix, where M is the number of data points and N is the number of channels. The second type of input used a short-time Fourier transform to turn the one-dimensional data into 2D time-frequency maps of each channel that were stacked on top of each other to form an input image. One study by Erdoĝan et al. 63 similarly performed classification between motor imagery versus motor execution using an MLP with a classification accuracy of 96.3% between finger tapping and rest and 80.1% accuracy between finger tapping and imagined finger tapping when using manually extracted features from fNIRS data. Hamid et al. 64 attempted to distinguish between a treadmill walking task and rest using bandpass filtered fNIRS data and an LSTM. The LSTM achieved an accuracy of 78.97%. When compared with traditional classifiers that had statistical features manually extracted, the kNN achieved the next best performance of 68.38% accuracy. The SVM and LDA were also outperformed by DL, achieving accuracies of 66.63% and 65.96%, respectively, despite using manually extracted features as inputs. This demonstrates the ability for DL to perform well on fNIRS data without requiring the extensive processing or feature extraction typically used in conjunction with other classifiers for fNIRS data. For many of these BCI motor tasks, being used for prosthetics would be more applicable. In these situations, more fine motor control using fNIRS would need to be assessed. Khan et al. 65 addressed this by performing six-class classification between rest and each finger on the right hand of the subjects, achieving an accuracy of 60%. Ortega and Faisal 24 attempted to distinguish between a left-and right-hand gripping task using a PCA to reduce dimensionality of the denoised time series data before feeding the segmented time series into a CNN-based architecture. The resulting accuracy of this study was 77%. To further investigate this, Ortega et al. 33 used a CNN with attention and simultaneously recorded EEG signals to try to reconstruct the grip force of each hand during the task. This resulted in an average fraction of variance accounted for of 55% when reconstructing the discrete grip force profiles, demonstrating that not only can the DL techniques distinguish which hand was performing a motor task, they also display progress toward using fNIRS and EEG signals to determine the amount of force exerted during that motor task. Ortega and Faisal 66 then attempted to use this architecture to determine force onset and which hand was providing more force. This resulted in a force onset detection of 85% but only a hand disentanglement accuracy of 53%, showing that there is still progress to be made toward the complex decoding and reconstruction of motor activities.
In the motor imagery tasks, the first input image method achieved an accuracy of 77.58% using HbO 2 data, while the second method achieved an accuracy of 80.49% using HbO 2 data. It could be noted that HbR and HbT were also tested but for both methods, HbO 2 showed consistently higher results. While HbO 2 is commonly used in fNIRS studies due to higher SNR, Yücel et al. 95 and Herold et al. 96 reported that trends found in HbO 2 signals but not in HbR signals may be due to higher sensitivity of HbO 2 to systemic signals not originating in the brain. For this reason, it is generally recommended that both HbR and HbO 2 signals are assessed in fNIRS studies. Other traditional classifiers were also used and the closest results were achieved by the metacognitive radial basis function network, which achieved a classification accuracy of 80.83%. Another study that focused on motor imagery, Ma et al. 35 used a type of time series data that included a one-hot label as the input for the neural networks. Of the DL techniques used, the fully connected network (FCN) and residual network (ResNet) achieved the highest average accuracy of 98.6%. Of the traditional machine learning techniques tested, the SVM had the highest classification accuracy of 94.7%. Interestingly enough, for the DL networks, the mean classification accuracy achieved with HbO 2 þ HbR þ HbT data, 98.3%, was the same as the accuracy when only HbR+HbT data were input, which was attributed to the feature extraction capabilities of the DL techniques. This study also evaluated the accuracy of individual channels, with single-channel classification accuracy ranging from 61.0% to 80.1% with the three highest accuracy channels being found in the somatosensory motor cortex and primary motor cortex.
While many studies have found considerable success in distinguishing between certain tasks using cortical activations, most studies use simple tasks in controlled environments and focus on distinguishing tasks from each other and resting state. For a practical BCI, more complex tasks and classification methods will need to be considered. Zhao et al. 67 addressed this by having participants perform a task in which they would pick up a table tennis ball with chopsticks and lift it about 20 cm in the air, using their nondominant hand. An LSTM was used to try to determine when the task was completed, and ΔHbO 2 data were used as the input, resulting in an accuracy of 71.70%, outperforming the 66.6% accuracy of the SVM that was given mean, variance, kurtosis, and skew features of the fNIRS data as inputs.
One commonly used technology for BCI studies is EEG, due to the high temporal resolution and portability and noninvasiveness. Because it has a high temporal resolution but low spatial resolution, it is common to combine EEG measurements with fNIRS, due to both being portable and noninvasive. In a motor imagery study, Ghonchi et al. 68 used fNIRS to augment the EEG data being collected. Three types of DL networks were used as classifiers, a CNN for its capability to extract special features, an LSTM for its ability to extract temporal features, and a recurrent CNN (RCNN) for its ability to extract both temporal and spatial features. The RCNN achieved the highest classification accuracy of 99.6% when both EEG and fNIRS data were used. Interestingly, the accuracy of both the CNN and LSTM increased when fNIRS data were added, jumping from 85% and 81% to 98.2% and 95.8%, respectively. This indicates that despite EEG data having higher temporal resolution, fNIRS data still contribute both spatial and temporal information. A 2016 study by Chiarelli et al. 69 also found an increase in classification accuracy when combining EEG and fNIRS data. When performing two-class classification on a motor imagery task with an MLP, the average accuracies of EEG and fNIRS data alone were 73.38% and 71.92%, respectively, but when using both modalities, accuracy increased to 83.28%, further reinforcing that the simultaneous acquisition of EEG and fNIRS can provide more relevant information than either modality on their own. Cooney et al. 70 found that when combining fNIRS and EEG data, they were able to distinguish between multiple combinations of overt speech with a CNN classifier, achieving an accuracy of 46.31%. When tested on imagined speech, the classifier achieved an accuracy of 34.29%, which is also higher than the random chance value of 6.25% for 16 possible combinations, which shows promise in the use of EEG and fNIRS for assisting patients who may be unable to verbally communicate. Sun et al. 71 used a CNN to try to distinguish between rest mental arithmetic and motor imagery tasks using both EEG and fNIRS data by generating tensors of the fused EEG and fNIRS data. For the motor imagery tasks, this resulted in an accuracy of 77.53% and for the mental arithmetic tasks, this resulted in an accuracy of 91.83%. Using the same dataset, Kwak et al. 72 applied a branched CNN architecture that used the fNIRS data to generate spatial feature maps, which were then fed to the EEG maps to try to obtain higher spatial resolution than ordinary EEG and higher temporal resolution than fNIRS. The resulting classification accuracies were 78.97% for motor imagery and 91.96% for mental arithmetic tasks, which are only slight improvements over the tensor fusion methods of Sun et al. Khalil et al. 73 also used a fusion of fNIRS and EEG data to distinguish between rest and a mental workload tasks. With a CNN, an accuracy of 68.94% was achieved when trained on data from 5 of the 26 participants. When training on 16 participants then performing transfer learning to an additional five participants, accuracy increased to 94.52%, exemplifying not only how important larger datasets are for DL but also how transfer learning can be used to address this. It may be of interest to explore the use of transfer learning from other fNIRS datasets, which use different hardware systems, different tasks, or different fNIRS channel arrangements, since many studies rely on collecting fNIRS data specific to their own applications. As a result, it would be important to see if transfer learning can help extract meaningful features from fNIRS data independently of the part of the brain being recorded or the hardware being used. While many studies have only recently begun looking to use fNIRS for real-time BCI applications, current studies in the field have found use in DL techniques for increased classification accuracy and automatic feature extraction.

Diagnostic Tools
One promising use of DL techniques with fNIRS is in clinical applications, most notably as a diagnostic tool. Xu et al. 46 recorded resting state fNIRS data from the bilateral frontal gyrus and bilateral temporal lobe of children. Using a single channel in the left temporal lobe, a CGRNN classifier was able to successfully classify autism spectrum disorder (ASD) in children with 92.2% accuracy, 85.0% sensitivity, and 99.4% specificity for 7 s of resting-state HbR data. As shown in Fig. 6, multiple HbO 2 and HbR channels showed statistically significant differences between the group with ASD and the control group. Xu et al. 74 used a CNN classifier with the attention layers and achieved an accuracy sensitivity and specificity of 93.3%, 90.6%, and 97.5%, respectively, using similar resting state data to classify between ASD and typically developing (TD) subjects. Xu et al. 75 further explored using fNIRS data to detect autism is subjects and found that using a CNN+LSTM classifier and HbO 2 data, they were able to achieve a singlechannel accuracy, sensitivity and specificity of 95.7%, 97.1%, and 94.3%, respectively, for the detection of ASD. Aside from autism, the use of fNIRS to diagnose other psychological disorders has been studied. With an average classification accuracy of 96.2%, Ma et al. 76 were able to distinguish bipolar depression from major depressive disorder in adults during a verbal fluency task using an LSTM. Wang et al. 77 managed to distinguish between healthy subjects and those diagnosed with major depressive disorder with an accuracy of 83.3%. This was accomplished using long recording times of 150 min each from a relatively large sample size of 96 subjects. As can be seen in Table 2, this is a larger sample size than any of the other fNIRS studies presented, which lends to confidence in the generalizability of this neural network to new subjects. Chao et al. 78 used a cascade forward neural network (a network similar to an MLP) to perform and achieved an average classification accuracy of 99.94% between depressed and healthy subjects when a fear stimulus was presented across 32 subjects. Chou et al. 79 used an MLP network to classify between subjects with first episode schizophrenia and healthy subjects, achieving a classification accuracy of 79.7% with only about 160 s of recorded data from each subject.
EEG technology is commonly used in clinical settings, however, some studies have used EEG alongside fNIRS for diagnostic studies. Sirpal et al. 44 used an LSTM architecture and fNIRS data to detect seizures with 97.0% accuracy, and with EEG data, achieved 97.6% accuracy, but with combined EEG and fNIRS data, accuracy increased to 98.3%, once again displaying how hybrid EEG-fNIRS recordings can increase classification accuracy, even when accuracy is already high. Rosas-Romero et al. 80 also combined fNIRS and EEG signals to detect epilepsy. Despite only having recordings from five subjects, a CNN was able to detect pre-ictal segments fNIRS and EEG data with an average accuracy of 99.67% with fivefold CV.
Another use for fNIRS is to help detect mild cognitive impairment (MCI), the prodromal stage of Alzheimer's disease. Yang et al. 81 employed three different strategies using CNNs to detect MCI. Using an N-back, Stroop, and verbal fluency task (VFT), a CNN trained on concentrations changes of HbO 2 achieved accuracies ranging from 64.21% in the N-back task to 78.94% in the VFT. When using activation maps as the inputs for the CNN, the accuracy ranged from 71.59% in the VFT to 90.62% in the N-back task. The final strategy employed, using correlation maps as inputs showed lower accuracies than the activation maps, with the highest accuracy being 85.58% for the N-back task. Yang et al. 47 used temporal feature maps as inputs for the CNN classifier, resulting in average accuracies of 89.46%, 87.80%, and 90.37% with the N-back, Stroop, and VFT, respectively. In another study done in 2021 by Yang and Hong,82 pretrained networks were used to distinguish between subjects with MCI and the healthy control group. Using resting state fNIRS data, the network with the highest accuracy, VGG19, achieved an accuracy of 97.01% when connectivity maps were used as the input, outperforming the conventional machine learning techniques, with LDA classifier reporting the highest accuracy of 67.00%. Ho et al. 83 attempted to use fNIRS and DL techniques to distinguish not only between healthy and prodromal Alzheimer's afflicted subjects but also subjects with asymptomatic Alzheimer's disease and dementia due to Alzheimer's disease. Not only did this study try to distinguish between different stages of Alzheimer's disease, the study used a notably large sample size of 140 participants, which was larger than any other study reported in this review as shown in Table 2. The 86.8% accuracy of the CNN-LSTM network when fivefold cross-validated not only shows the ability of the network to distinguish between a wide range of subjects with and without Alzheimer's disease but also shows the ability of this network to distinguish between different stages of Alzheimer's disease, making this a very promising tool for clinical use.
Yet another clinical application for fNIRS measurements with DL techniques was explored by Fernandez Rojas et al., 41 where raw fNIRS data and an LSTM were used to distinguish between high and low levels of pain as well as whether the pain was caused by a hot or cold stimulus with an achieved accuracy of 90.6%. Being able to assess the intensity of pain as well as the cause of it could be exceptionally useful in instances where patients are unable to communicate, such as with nonverbal patients. This further displays the usefulness of fNIRS with DL techniques as a robust and accurate diagnostic tool.

Analysis of Cortical Activations
Outside of BCI and diagnostic tools, fNIRS data still have many uses. Understanding the functional connectivity of the brain is an essential part of understanding the mechanisms behind numerous neurological phenomena. As a result, there is an interest in using neuroimaging to understand the functional connectivity of the brain. Behboodi et al. 84 used fNIRS to record the resting-state functional connectivity (RSFC) of the sensorimotor and motor regions of the brain. In this study, four methods were used, seed-based, ICA, ANN, and CNN. Unlike the first two methods, very limited preprocessing was used for the ANN and CNN, with both using filtered HbO 2 data to form the connectivity maps. Each connectivity map was then compared with the expected activation based on the physiological location of each detector, which was used as the ground truth, using an ROC curve, as shown in Fig. 7, with the CNN achieving an area under the curve (AUC) of the receiver operating characteristic curve (ROC curve) of 0.92, which outperformed the ANN, ICA, and seed-based methods with each reporting an AUC of 0.89, 0.88, and 0.79, respectively. Another study that investigated RSFC, Sirpal et al., 85 collected EEG and fNIRS data and attempted to use the EEG data and an LSTM to recreate the fNIRS signals. Using only the gamma bands of the EEG signal was found to have the lowest reconstruction error, below 0.25. The reconstructed fNIRS signals were further validated by comparing the functional connectivity of the signals constructed using only gamma bands and those using the full spectrum EEG signals with the functional connectivity of the experimental fNIRS data. Despite the signals formed using gamma bands only having a lower reconstruction error, the root MSE of the full spectrum EEG signals was consistently lower.
Other types of fNIRS studies have also been done that utilized DL. Some studies have analyzed the cortical activations of subjects to predict emotions. Bandara et al. 86 used music videos from the DEAP database 99 to classify the emotional valence and arousal of subjects using a CNN +LSTM architecture and fNIRS data recorded from the prefrontal cortex. Using the subjects' self-assessments as ground-truth, a classification accuracy of 77% was reported. Another study collecting fNIRS signals from the prefrontal cortex, Qing et al. 87 used a CNN to determine the preference levels of subjects toward various Pepsi and Coca-Cola ads, achieving an average three-class classification accuracy of 87.9% for 30 s videos. 15 and 60 s videos showed similar accuracies of 84.3% and 86.4%, respectively. Similarly, Ramirez et al. 90 attempted to decode consumer preference toward 14 different products. With a CNN, Ramirez et al. were able to distinguish between a strong like and strong dislike of the presented product with an accuracy of 68.6% with fNIRS data, 77.98% with just EEG data, and 91.83% with combined fNIRS and EEG data. Hiwa et al. 88 studied the use of fNIRS and CNNs for the identification of a subject's gender, achieving an accuracy of ∼60% when using the filtered fNIRS data from only five channels.
While most studies have focused on using cortical activations to classify when a specified task is being completed, a few studies have begun looking into predicting the skill level of the subject at a given task. Andreu-Perez et al. 89 classified the expertise of subjects watching 30 s clips of the video game League of Legends using fNIRS data and facial expressions. The fully connected deep neural network (FCDNN) and deep classifier autoencoder (DCAE) were compared against many traditional machine learning techniques, including SVM and kNN in a threeclass test to determine the skill level of the subject watching. Using only fNIRS data, the DL classifiers had the two highest accuracies of 89.84% and 90.70% for the FCDNN and DCAE, respectively, while the most accurate machine learning technique, SVM, only achieved an accuracy of around 58.23%. When the predicted emotion scores were also included, the accuracy of the FCDNN and DCAE improved to 91.44% and 91.43%, respectively. Most of the machine learning techniques also saw minor or no improvements, with the SVM still achieving an accuracy of 58.69%. The XGBoost classifier saw a large increase in accuracy when emotion scores were added, increasing from 50.79% to 71.55%, which was still about 20% lower than the accuracies of the DL techniques used. Another study by Gao et al. 32 looked to predict the surgical skill level of medical school students who were being assessed on tasks based on the FLS protocols. In this study, subjects would perform an FLS precision cutting task while fNIRS data were recorded from the prefrontal cortex. The subjects would be assessed and scored in accordance with FLS procedures. A brain-NET architecture was used to predict the FLS scores of each participant based on the features extracted from the recorded fNIRS data. The brain-NET model reported an ROC AUC of 0.91, outperforming the kernel partial least squares, random forest, and support vector regression methods that were also tested when the dataset was sufficiently large. Figure 8 shows the R 2 value of each methodology as the size of the dataset increases. The ROC curve can also be seen along with a graphic of the experimental setup in Fig. 8. From the wide range of studies published, there are many different applications of fNIRS currently being researched, and the use of DL techniques has become of interest in recent years.

Discussion and Future Outlook
Over the last two decades, machine learning has become increasingly popular for processing neuroimaging data due to its benefit over traditional analysis methods. 101 Of importance, ML methods enable fully processing spatiotemporal data sets and allow for inferencing at the single subject/trial level. Among all ML approaches, DL is becoming increasingly utilized over the last half decade with great promise. 102,103 Following similar trends, DL models have found increased utility in fNIRS applications, ranging from simplifying the data processing pipeline to performing classification or prediction tasks. DL models are expected to outperform ML methods due to their potential to directly extract features from raw data (no need to perform prior feature extraction) and learn complex features in a hierarchical manner. This seems to be further supported by the findings of this review in which, out of the 32 papers reporting on a comparative study of DL techniques to traditional machine learning techniques, 26 have been shown outperforming the latter in terms of classification accuracy. Such trends have also been reported over a large range of biomedical applications, but we cannot exclude a publication bias due to the specific nature of the review topic. Still, the application of DL to fNIRS is in its very early stages and faces many challenges.
First, the implementation of DL models is an expert field. The selection of the main architecture as well as the number of layers, the activation function, are still dependent on the user expertise but greatly influence the performance and applicability of any DL model. If this design flexibility enables powerful implementations, it leads to a wide range of architecture and hyperparameters employed in the set of work reviewed herein. Hence, there is still not a consensus on which architecture and hyperparameters are optimal for a specific problem and building on current work requires some level of technical expertise to assess which implementation would be optimal. This may be circumvented in the future with the advent of network automated designed via neural architecture search methods, 104 but these have not been yet applied to the field of fNIRS. Moreover, following the principles of the no-free-lunch theorem, it is expected that prior knowledge on the problem at hand should guide in the selection of the ML/DL algorithm. Hence, beyond technical expertise in DL, ones need also to have a good grasp of the neurophysiology as well as instrumentation characteristics used in the application to design optimal models. This leads to another significant challenge, which is associated with the data-driven nature of DL model training and validation. The lack of sufficient training data is a common challenge in the application of DL methods in neuroimaging. This is even more challenging for fNIRS applications that are less ubiquitous than MRI or EEG, which benefit from publicly available repository. As we are still far from being able to model the complexity of brain functions and dynamics, the DL models can be trained only on experimental data conversely to many other fields in which efficient in silico data generators are available. 105 This is highlighted in Table 3, which shows that almost all reviewed work depended on proprietary data and with relatively small number of subjects. Moreover, in numerous scenarios, the data quality can be poor such that a subset of the spatiotemporal data is missing or inadequate (for instance, compromised by motion artifacts, shallow physiological variations). As previously mentioned, data augmentation approaches have been implemented successfully to alleviate this challenge. Another approach is to leverage new developments in transfer learning that optimally refine well-trained networks on large data sets to smaller one. But still, these methods are expected to work well within homogeneous settings. As the field benefit from an increased number of fNIRS systems with varying optode characteristics and associated electronics, the raw characteristics of the acquired signals can greatly vary (SNR, CNR, sampling rate) and hence, limit generalizability. Another issue for generalizability is that overfitting may be especially prevalent in fNIRS studies where fNIRS data tend to be highly correlated. 95 Moreover, in many instances, the data set is imbalanced for available classes. Hence, it is crucial for eliciting confidence in the results to report on CV results. Herein, most reviewed work used k-fold CV, typically 5-or 10-fold CV. Still, in many applications, the data set is comprised of multitrials per subject. Hence, it is important to assess the potential bias associated with each subject. This can be performed using LOUO CV and LOSO CV, respectively. Still, such well-established methods were used only in 8 of the 63 papers considered in this review.
Last, despite demonstrating high performances, the DL implementations reported herein are suffering from the black-box issue. In other words, the extracted high-level features from the data inputs that lead to high task performances during training and validation are not accessible and hence, cannot be interpreted. However, in the last few years, various eXplainable AI (XAI) tools such as class activation maps, 106 Grad-CAMS, 107 and saliency maps 108 have been proposed to impart understandability and comprehensibility. 109 Such tools can, for instance, provide visual map(s) that highlight the main data features leveraged for the model decision. Such a map can then correlate the extracted features with known neurophysiology, specific application characteristics (for instance, hand switching during surgery), and/or correlate with other biomarkers, such as videos, gaze measurements, and motion tracking devices. For this reason, XAI tools are wellpoised to lead to the discovery of new spatiotemporal features that will advance our neuroscience knowledge at large. This is exemplified by the recent report of differences in activation maps between expert and novice surgeons while executing a certification task, with activation maps obtained via a dot-attention method in a DL classifier model. 110

Conclusion
We reviewed the most recent published work relevant to the application of DL techniques to fNIRS. This literature review indicated that DL models were mainly sued for classification tasks based on fNIRS data and that in most of the cases, DL model prediction accuracy outperformed traditional techniques, including established ML methods. Another subset of work reported on developing DL models to reduce the amount of preprocessing typically done with fNIRS data or increase the amount of data via data augmentation. In all cases, DL models provided very fast inference computational times. These characteristics have a transformative power for the field of fNIRS at large as they pave the way to fast and accurate data processing and/or classification tasks. Of note, DL models, when validated and established, offer the unique potential for real-time processing on the bedside at minimal computational cost. While the deployment of DL models that are widely accepted by the community face numerous challenges, the findings reviewed here provide evidence that DL will play an increased role in fNIRS data processing and use for a wide range of bedside applications. Moreover, as an ever-increased number of studies are made available to the community, it is expected that the next generation of DL models will have the possibility to be tested and validated in various scenarios.

Disclosures
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.