Bundle geodesic convolutional neural network for diffusion-weighted imaging segmentation

Abstract. Purpose Applying machine learning techniques to magnetic resonance diffusion-weighted imaging (DWI) data is challenging due to the size of individual data samples and the lack of labeled data. It is possible, though, to learn general patterns from a very limited amount of training data if we take advantage of the geometry of the DWI data. Therefore, we present a tissue classifier based on a Riemannian deep learning framework for single-shell DWI data. Approach The framework consists of three layers: a lifting layer that locally represents and convolves data on tangent spaces to produce a family of functions defined on the rotation groups of the tangent spaces, i.e., a (not necessarily continuous) function on a bundle of rotational functions on the manifold; a group convolution layer that convolves this function with rotation kernels to produce a family of local functions over each of the rotation groups; a projection layer using maximization to collapse this local data to form manifold based functions. Results Experiments show that our method achieves the performance of the same level as state-of-the-art while using way fewer parameters in the model (<10%). Meanwhile, we conducted a model sensitivity analysis for our method. We ran experiments using a proportion (69.2%, 53.3%, and 29.4%) of the original training set and analyzed how much data the model needs for the task. Results show that this does reduce the overall classification accuracy mildly, but it also boosts the accuracy for minority classes. Conclusions This work extended convolutional neural networks to Riemannian manifolds, and it shows the potential in understanding structural patterns in the brain, as well as in aiding manual data annotation.

manifolds with some simple form of orientation invariance, and we take DWI as the main application. There are a series of proposals trying to generalize a R 2 convolutional neural network to curved spaces, yet in our case, rotational invariance is a desirable property we want in the design and our goal is to be able to understand spherical patterns up to rotations. We propose a general architecture for extracting and filtering local orientation information of data defined on a manifold that allows us to learn similar orientation structures which can appear at different locations on the manifold. Reasonable manifolds have local orientation structures-rotations on tangent spaces. Our architecture lifts data to these structures and performs local filtering on them, after which it collapses them back to obtain filtered features on the manifold. This provides both rotational invariance and flexibility in design, without having to resort to complex embeddings in Euclidean spaces. Our contribution in this work is as follows: • Instead of using Fourier-type methods such as irreducible representations as is done in literature, we directly perform convolution numerically on the surface as is done in classical CNNs in image analysis, which is far more light-weight; • We lift the spherical function locally with SOð2Þ actions instead of lifting it to the full SOð3Þ group as is usually done in literature, which makes our method a more general case that is applicable on manifolds that are not spheres; • We provide an explicit construction of the architecture for DWI data and show very promising results for this case including learning and generalizing patterns of a dataset from only one scan.
This work is an extension of our previous publication. 1

Related Work
The importance of the extraction of rotationally invariant features beyond fractional anisotropy 2 has been recognized in series of DWI works. Caruyer and Verma 3 developed invariant polynomials of spherical harmonic (SH) expansion coefficients, and discussed their application in population studies. Schwab et al. 4 proposed a related construction using eigenvalue decomposition of SH operators. Novikov et al. 5 and Zucchelli et al. 6 argued their usefulness for understanding microstructures in relation to DWI. There is though a vast growth in literature on deep learning (DL) for non-flat data or more complex group actions than just translations. Masci et al. 7 proposed an NN on surfaces that extracts local rotationally invariant features. A nonrotationally invariant modification was proposed in Boscaini et al. 8 On the other hand, convolution generalizes to more group actions than just translation, and this has led to group-convolution neural networks for structures where these operations are supported, especially Lie groups themselves and their homogeneous spaces. [9][10][11][12][13][14][15] Global equivariance is often sought but proved complicated or even elusive in many cases when the underlying geometry is nontrivial. 16 An elementary construction on a general manifold is proposed by Schonsheck et al. 17 via a fixed choice of geodesic paths used to transport filters between points on the manifold, ignoring the effects of path dependency (holonomy when paths are geodesics). The removal of this dependency can be obtained by summarizing local responses over local orientations, which is what was done by Masci et al. 7 On the other hand, Cohen et al. 18 lifted spherical functions to the 3D-rotation group SOð3Þ and used a generalization of Fourier transform on it to perform convolution. To explicitly deal with holonomy, Sommer and Bronstein 19 proposed a convolution construction on manifolds based on stochastic processes via the frame bundle, but it is, at this point, still very theoretical.
A number of works have applied DL to DWI as well, due to the unique structure of the data, as orientation responses. Golkov et al. 20 built multilayer perceptrons in q-space for kurtosis and NODDI mappings. Wasserthal et al. 21 proposed a U-net inspired structure for tract segmentation, while Sedlar et al. 22 proposed a spherical U-net for neurite orientation. To take into account the spherical structure of the DWI data and the homogeneous structure of the sphere, Chakraborty et al. 23 proposed an rotation equivariant construction inspired by Cohen et al. 18 for disease classification. Müller et al. 24 propose a sixth-D, 3D space, and q-space NNs with roto-translation/ rotation equivalence properties.
In this work, we are interested in rotationally invariant features, thus we take a path closer to Schonsheck et al. 17 and Masci et al., 7 we actually lift functions to functions on the bundle of tangent space rotations of our manifolds, a two-dimensional manifold, as opposed to Cohen et al. 18 where the lifting results in functions on SOð3Þ-a three-dimensional manifold. Then, we add one or more extra local group convolution layers before summarising the data and eliminating path dependency. The proposed construction thus applies to oriented Riemannian manifolds, and no other structure (e.g., homogeneous or symmetric space) is used.

Method
All along this section our reference to Riemannian Geometry can be found in the Do Carmo's classical book Riemannian Geometry. 25 CNNs are generally described and implemented in terms of correlation rather than convolution, and we follow this convention as well in this section. Bekkers et al. 14 used the fact that SEð2Þ acts on R 2 to lift 2D (vector-values) images to R 2 × S 1 via correlation kernels. This is not, in general, the case when R 2 is replaced by a Riemannian manifold, where there is no obvious way to define these operations. One can, however, overcome this situation via a somewhat more complex construction. Therefore, we assume in the sequel that we are given a complete orientable Riemannian manifold of dimension n, this will be the sphere S 2 in our case. We assume that the injectivity radius iðMÞ of M is strictly positive. As usual, the tangent space at a point x ∈ M is T x M. An image is a function f ¼ ðf 1 : : : ; f N c Þ ∈ L 2 ðM; R N c Þ, where N c is the number of channels.
Operations will be performed by lifting the function to tangent spaces and kernels are defined on tangent spaces. The exponential map

Lifting layer
We first define transportable filters on tangent spaces to replace CNN's kernels. These filters will also be called kernels. To start with, a "pointed kernel" will be a function k ¼ ðk 1 ; : : : ; k N c Þ ∈ L 2 ðT x 0 M; R N c Þ, at a "base point x 0 ." We assume that SuppðkÞ ∈ B x 0 ð0; rÞ, 0 < r ≤ iðMÞ, the ball of center 0 and radius r in T x 0 M. A piece-wise smooth path γ∶½0;1 → M, joining x 0 to x defines, via the Levi-Civita connection of M, a parallel transport P γ ∶T x 0 M → T x M, and this is an isometry. We set k γ ≡ k ∘ P −1 γ . In general, another smooth path δ∶½0;1 → M joining x 0 and x defines another parallel transport P δ ∶T x 0 M → T x M and P γ ∘ P −1 δ is a rotation R of T x M, i.e., an element of SOðT x MÞ. It follows that k δ ¼ k γ ∘ R. The γ-lift of f by k is the function E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 1 ; 1 1 6 ; 2 6 1 Note that because SuppðkÞ ∈ B x 0 ð0; rÞ, this integral is defined on B x ð0; rÞ and Exp x is a diffeormorphism from this domain to the geodesic ball Bðx; rÞ ⊂ M. Now we choose, for each x in M, a smooth path γ x that joins x 0 and x. As M is complete, we can, for instance, choose a family Γ ¼ ðγ x Þ x of minimizing geodesics. The mapping E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 2 ; 1 1 6 ; 1 6 0 (2) lifts a M-image to the bundle of rotations of M (we refer to Gallier et al. 26 chap. 9 for a definition of bundles in differential geometry), denoted by SOðTMÞ in the sequel (SOðTMÞ

Group correlation layer
The object F defined in Eq. (2) is a function on the total space of the bundle ðSOðTMÞÞ (Gallier et al. 26 chap. 9), supposed square-integrable (F ∈ L 2 ðSOðTMÞÞ). The situation is more complex than the one described in Bekkers et al., 14 as there is actually no reason that one can find a "continuous family" of paths x 0 ↝x, ∀ x ∈ M. An important example to us: if M is the sphere S 2 , one can take γ x to be a minimizing geodesic between x 0 and s. It is unique, except when x ¼ −x 0 , where there are infinitely many of them.
Let K be an element of L 2 ðT x 0 MÞ. The parallel transport of K along the path γ is with dR the bi-invariant Haar measure on SOðT x MÞ. In general, we consider objects that are a bit more complicated. Instead of F being a section of L 2 ðSOðTMÞÞ, it is taken as a section of L 2 ðSOðTMÞÞ N l , meaning we have N l channels, FðxÞ ¼ ðFðxÞ 1 ; : : : ; FðxÞ N l Þ ∈ L 2 ðSOðT x MÞ; R N l Þ, and K also has N l channels, K ¼ ðK 1 ; : : : ; K N l Þ ∈ L 2 ðSOðT x 0 Þ; R N l Þ and we replace Eq. (4) with E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 5 ; 1 1 6 ; 1 5 2 FðxÞ⋆K γ x ðSÞ ¼ The group correlation layer at level l takes a section F of L 2 ðSOðTMÞÞ N l , and uses N lþ1 kernels ðK  (2). The function is first mapped onto the tangent space of the point of interest via the exponential map, and κ ð2Þ is convolved with the mapped function to get F 2 . Group correlation is then performed on the resulting image, followed by the projection layer, from which we get rotationally invariant responses. The bottom row shows the same process but with a different kernel parallel transport, illustrating that the responses of the convolutional layers are simply rotated. In the figure on the right, the bottom row shows S 2 with a regular icosahedric tessellation and a tangent plane at one of the vertices and five sampled directions. The disk represents the kernel support. The middle row shows the actual discrete kernel used, with the 2π∕5 rotations and the top row is represents the lifted function on the discrete rotation group.

Projection layer
The base point and path dependency in the lifting and group correlation layer definitions appear problematic. We can, however, reproject the results from these layers to standard functions on M, eliminating this dependency. The only condition is that the same family of paths is used both in the lifting and group correlation layers to parallel transport the kernels.
Indeed, from what precedes, two γand δ-lifts, though in general distinct, obey the simple relation E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 6 ; 1 1 6 ; 6 0 8 A direct computation shows that where we used the fact that the normalized Haar measure on SOðT x MÞ is bi-invariant, thus in particular right-invariant. Thus the following projection layer is well-defined and removes the base point and path dependency Biases are added per kernel. Nonlinear transformations of ReLU type are applied after each of these layers. Note that without them, a lifting followed by group correlation would actually factor in a new lifting transformation.

Discretization and Implementation in the Case M ¼ S 2
In this work, the manifold of interest is S 2 . Spherical functions f∶S 2 → R N are typically given at a number of points and interpolated using a Watson kernel, 27 which also serves as our choice. We use a very simple discretization of S 2 via the vertices of a regular icosahedron. Tangent kernels are defined over these vertices, sampled along with the rays of a polar coordinate system respecting the vertices of the icosahedron. The radius of the circular kernels are chosen such that when a kernel is moved from one vertex to any of its five neighbors, there is going to be overlap between the kernels before and after moving. This is illustrated in Fig. 1. We use a single-shell setup (one value at each point on the sphere) in all our experiments since it is the most common case (and it is the case for our spinal cord data). However, a multishell setup is possible if we interpolate the functions for each shell at the same locations on S 2 , and the spherical function can be treated as a multichannel function.

Experiments and Results
We evaluate our method on three datasets: a DWI scan conducted on a spinal cord that had been dissected out post mortem from a deceased human female, a synthetic dataset that we generated, and the DWI brain scan dataset from the human connectome project. 28 The human spinal cord DWI data is single-shell, with a b-value of 4000 s∕mm 2 , and 80 directions per voxel. The HCP DWI data has three shells, b ¼ 1000; 2000; 3000 s∕mm 2 , and 90 directions per voxel. We train single-shell models, thus three separate models for the HCP data. In terms of model hyperparameter search in all experiments, we choose the hyperparameters that give us models with the lowest capacity without worsening the performance. By doing this, we get the models with proper capacity that are efficient in training meanwhile prevent overfitting. It is worth noticing that as we increase the model capacities, the stability of the models increase as well-that is, less fluctuation of loss during training, and fewer bad initiations of the models. However, we choose the least complex models possible since more complexity does not introduce better performance in this case. As for the data smoothing/interpolation parameter κ in Watson kernel, we choose the parameter values, in all experiments, that provide a trade-off between data smoothing and peak preserving. In that sense, the hyperparameters chosen are the ones that give us the best model performances.

Experimental Setup
After getting the responses from our proposed layers, we feed them into a small feedforward neural network-a single layer perceptron-to perform our classification task. To validate our method, we compare the proposed framework with two experimental setups: (a) a baseline experiment that feeds the smoothed signal values of each voxel directly into a feedforward neural network without our three-layer convolution; (b) S2CNN 18 which performs convolution on spheres by transforming the signals onto the spectral domain. For all the experiments, we use the smallest model possible for both our method and S2CNN. 18

Data description
The study was conducted on a deceased individual who had bequeathed her body to science and education at the Department of Cellular and Molecular Medicine (ICMM) of the University of Copenhagen according to Danish legislation (Health Law No. 546, Section 188). The study was approved by the head of the Body Donation Program at ICMM. Part of the data used here has been published in a previous report. 29 Briefly, the spinal cord was dissected out from a 91-year old Caucasian female without known diseases post mortem within 24 h after her death. The spinal cord was fixed by immersion into paraformaldehyde (4%), where it was kept for 2 weeks, after which it was transferred to and stored in phosphate-buffered saline until the MRI scanning was conducted. The spinal cord was placed in a plexiglas tube and immersed in fluorinert (FC-40, Sigma-Aldrich) to eliminate any background signal. The scanning was accomplished using a 9.4 T preclinical system (BioSpec 94/30; Bruker Biospin, Ettlingen, Germany) equipped with a 1.5 T∕m gradient coil. The scanning was done in 29 sections of length 1.6 cm, thus covering the whole length of the spinal cord of approximately 40 cm. Between each section scan, the tissue was advanced 1.4 cm by a custom-built stepping motor system, resulting in a 0.2-cm section overlap. For each section, a T2-weighted 2D RARE structural scan was performed. Scan parameters were repetition time (TR) = 7 s, echo time (TE) = 30 ms, 20 averages, field of view 1.92 · 1.92 · 1.6 cm 3 , and a matrix size of 384 · 384 · 80, resulting in 50 · 50 μm 2 in-plane resolution and a slice thickness of 200 μm, resulting in a voxel size of 500000 μm 3 . The scanning time for the structural scan was 30 h.
We take individual voxels containing signals defined on S 2 as the input of the networks and achieve segmentation via voxel classification. Since the numbers of samples of white matter and gray matter are not balanced, we use Focal Loss 30 to counter the imbalance. We used 14 slices from the longest dimension to test and the rest of the scan to train.

Results
We can see from Table 1 that all methods perform quite well for this simple task. Showcase of prediction from our model and the ground-truth can be found in Fig. 2. We observe that classifying white matter and gray matter is not a challenging task considering the baseline model works well for this task. This is because there is already a significant difference between white matter and gray matter in terms of the scales of the intensity values of the two tissues. However, our method and S2CNN 18 have a better balance between the accuracies of the two classes compared to the baseline, which shows the importance of geometric information for recognizing minority classes. To test the rotational invariance and the independence to scaling of the signals of our method, we experiment further on the synthetic dataset and the HCP dataset. 28

Dataset generation
To validate the resistance of our method against rotations, we create and classify spherical functions that are defined on a sphere. We first uniformly sample 90 fixed directions on a hemisphere, and spherical functions of different classes are defined in the same 90 directions. For each class, we sample 90 values from a Gaussian distribution as function values of the 90 directions. Thus the only difference among classes is the function values of the given 90 directions, and we sample the function values for each class from the same Gaussian distribution to keep the scales of the values identical. In addition, we rotate the sphere of each class and use these rotated spherical functions as elements of each class. Therefore, each class of the dataset contains just rotations of  Fig. 2 Examples of ground-truth and predictions from the test data. (a)-(d) The same slices from the ground-truth, prediction from our method, prediction from S2CNN, and prediction from baseline.
each spherical function. As explained above, we interpolate the function values at the icosahedron vertices using a Watson kernel 27 from the rotated 90 directions, each assigned with a function value. For the baseline, we interpolate the function values at the same 90 directions that were sampled on the sphere using the same scheme. We generate synthetic datasets of different numbers (n ∈ 2;4; 6) of classes to test the robustness of the model, given different difficulties of the task. For each class, we generate 50 samples for the training set and 1000 samples for the test set.
Architecture and Hyperparameters. As in the experiment above, we use a lift-ReLU-conv-ReLU-projection-FC-softmax architecture for the network. We use k ¼ 1;5; n channels for lift, conv, and FC layers, 0.6 as kernel radius, and 5 rays, 2 samples per ray as kernel resolution. For S2CNN, 18

Results
See Table 2 for comparison of results from different models. We can see that the baseline model is barely learning anything from the data, while our method and S2CNN 18 are capturing the differences from different classes in the data. Moreover, our method achieves more robust performance while having fewer degrees of freedom.

Human Connectome Brain Scans
As in the spinal data experiments, we train networks on individual voxels containing signals on S 2 . Our goal is a voxel-wise classification of four regions of the brain-cerebrospinal fluid (CSF), subcortical, white matter, and gray matter regions.
We used the preprocessed DWI data 31 and normalized each DWI scan for the b-1000, b-2000, and b-3000 images, respectively, with the voxel-wise average of the b 0 .
The labels provided with the T1-image were transformed to the DWI using nearest neighbor interpolation (Fig. 3). Since the four brain regions we are classifying have imbalanced numbers of voxels, we use Focal Loss 30 to counter the class imbalance of the dataset just as in the spine data experiments.
Architecture and hyperparameters. As in experiments above, we use the icosahedron structure as locations for kernels for our method, and lift-ReLU-conv-ReLU-projection-FC-softmax as network architecture with k ¼ 1;5; 4 channels, r ¼ 0.6 as radius, and 5 rays with 2 samples per ray as kernel resolution. For S2CNN, 18 we again use the same architecture provided by the authors S 2 conv-ReLU-SO(3)conv-ReLU-FC-softmax, bandwidth b ¼ 30;10; 6 and k ¼ 3;6; 4 channels. For baseline, we again use FC(90)-ReLU-FC(50)-ReLU-FC(30)-ReLU-FC(4) architecture. We use κ ¼ 10 for the Watson kernel, and 0.001 as learning rate for all models. Table 2 Test accuracy for models evaluated on the generated datasets. Numbers in the brackets are the numbers of parameters for each model. The baseline model is producing prediction accuracies that are the same level as random guessing, while ours and S2CNN 18 can recognize the rotations of the same spherical functions quite accurately, and our method achieves higher accuracy using fewer parameters than S2CNN. 18

Results
We used 1 scan for training, 1 scan for validation, and 50 scans for testing. We chose this split of the dataset for training/validation/testing because it is the most light-weight for training. Including more scans in the training set was also tried, but it did not seem to make the results much better. Therefore, we only used one scan in the training set in the end. Comparison of experimental results of different methods can be found in Table 3. We can see that the baseline experiment does not generalize well compared to our method and S2CNN. 18 Across different b values, we observe that with increased b, over all experiments, it becomes harder to recognize CSF. Higher b does not reduce the accuracies for the majority classes for our method and S2CNN, 18 thus the overall accuracies from these methods do not drop much with increased b. On the other hand, while comparing to S2CNN, 18 we achieve very similar results yet our model has way lower degrees of freedom while achieving the same level of performance as we can see in Table 3. Showcases of predictions from all models can be found in Fig. 4. Model Sensitivity Analysis. We reduce the amount of training data for our method in order to test how sensitive our model is. As mentioned above, there is only one scan in the training set. For that scan, there are 7227 CSF voxels, 35,648 subcortical voxels, 276,191 white matter voxels, and 309,496 gray matter voxels. Therefore, we reduce the number of samples in the training scan from all classes by randomly sampling a fraction of voxels from each class and test how that impacts the performance of the model. The results can be found in Table 4.
We see that reducing the number of samples in each class reduces the performance. On the other hand, it has also boosted the accuracy for the subcortical region, since that the class imbalance was also eased after the reduction. We can observe that the gray matter and white matter tissues are overly represented in a scan that even when we discard most of the voxels from these two classes in the training set, our test result remains a relatively high level of accuracy. This offers us an important application in automating DWI data annotation.

Discussion
This work shows how geometric information in DWI can be significantly useful in understanding general patterns in image analysis. In the future, we expect improvements in performance by Fig. 3 (a)-(c) original diffusion data, the ground-truth segmentation, and the processed groundtruth label image. The label colors for CSF, subcortical, white matter, and gray matter are red, blue, white, and gray, respectively. The figures are only for illustrations of the data, they are not necessarily from the same scan.
adding spatial correlations-convolutions in the product space R 3 × S 2 instead of mere S 2 , for example. This is ongoing work. The correlation of our model to fractional anisotropy (FA) and NODDI is worth investigating as well. Moreover, using scans in the HCP dataset 28 with a different number of diffusion gradients to test our model would also be desirable in later works. Our model generalizes well to the test set while trained with very little data (one scan), but this generalization is limited to data with the same distribution, i.e., with acquisitions from the same Table 3 Results from the HCP brain dataset. We can see that our method has the same level of performance as S2CNN, 18 but uses way fewer parameters. The baseline model produces higher accuracy recognizing the subcortical region, but it is at a high cost of the accuracies of other classes. scanner using the same protocol. A new dataset with a new acquisition protocol would require new training. It will be desirable to apply our model to datasets that consist of irregular scans such as brains with tumors and unprocessed scans, unlike the HCP dataset which only has preprocessed healthy brains. Obtaining that kind of data is another challenge. Additionally, we have so far only tested our construction of the network on S 2 , yet an extension to other surfaces appears feasible, with a smart choice of a discrete representation. An extension to dimension Table 4 Results of sensitivity analysis. The numbers in the first row are the numbers of samples in each experiment for CSF, subcortical, WM, and GM, respectively. We see that while reducing the size of the training set, the overall accuracies decrease to some extent, but the accuracies of the subcortical region are higher since the class imbalance is lower. 3 is worthwhile as well, which will require efficient SOð3Þ convolutions, using, for instance, spectral theory for compact Lie groups.

Conclusion
We proposed a simple extension of CNN to Riemannian Manifolds that learns rotationally invariant features. Our method allows us to learn general patterns from very limited data while having much lower degrees of freedom than existing methods. 18 This is significant because we can now, in machine learning-based DWI analysis, reduce the size of individual data samples to a single voxel-level from a whole volumetric image-level, as well as reduce the training dataset to a single scan-or a fraction of a scan. For the HCP dataset 28 with a single-shell setup, our method, while taking the subcortical region into account, compares well with existing methods that have multishell input, 32,33 which do not classify the subcortical region. We also achieved similar or better results compared to image registration-based methods. 34 The results of this simple task show great potential of this method in understanding structural patterns in brains. Moreover, the results from the model sensitivity analysis show that our method has the potential in aiding manual data annotation. For example, a doctor only has to label a fraction of a scan and the rest can be automated by the model.

Disclosures
The authors declare no conflicts of interest of any kind.