Deep learning-based hyperspectral image reconstruction from emulated and real computed tomography imaging spectrometer data

Abstract. The computed tomography imaging spectrometer (CTIS) is a hyperspectral imaging (HSI) approach where spectral and spatial information of a scene is mixed during the imaging process onto a monochromatic sensor. This mixing is due to a diffractive optical element integrated into the underlying optics and creates a set of diffraction orders. To reconstruct a three-dimensional hyperspectral cube from the CTIS sensor image, iterative algorithms were applied. Unfortunately, such methods are highly sensitive to noise and require high computational time for reconstruction thus hindering their applicability in real-time and high frame-rate applications. To overcome such limitations, we propose a lightweight and efficient deep convolutional neural network for hyperspectral image reconstruction from CTIS sensor images. Compared with classical approaches our model delivers considerably better reconstruction results on synthetic as well as real CTIS images in under 0.17 s, which is over 60 times faster compared with the standard iterative approach. In addition, the reshaping method we have developed enables a lightweight network architecture with over 100 times fewer parameters than previously reported.


Introduction
Hyperspectral imaging (HSI) refers to the acquisition of multiple images corresponding to more spectral bands than just the typical three color channels red, green, and blue in RGB imaging. Early applications of HSI included determining the spectra of stars and their composition 1 or the identification of clay minerals from airborne measurements. 2 During the last decade HSI has been becoming more important in areas such as medicine, 3 food inspection, 4 environmental monitoring, 5 and many others. Multiple approaches for hyperspectral data acquisition have been proposed in the literature and the reader might refer to an extended historical overview of HSI in Ref. 6 and different HSI systems in Ref. 7.
Computed tomography imaging spectrometer (CTIS) has been first proposed by Okamoto and Yamaguchi 8 and independently by Bulygin and Vishnyakov. 9 A three-dimensional (3D) data cube (x, y, and wavelength) is determined on the basis of projections taken from different "directions." The advantages of CTIS are its hyperspectral snapshot capability and simple setup. 10 The disadvantages are a large unused sensor area and a time-consuming iterative algorithm that is required to reconstruct the hyperspectral image. *Address all correspondence to Markus Zimmermann, markus.zimmermann@ito.uni-stuttgart. de 2 Computed Tomography Imaging Spectrometer A standard CTIS design is shown in Fig. 1. The object is illuminated by a broadband light source. A field stop is used to block the outer part of the intermediate image and thereby determines the systems' field of view. For our work, we use an aperture with a quadratic shape which results in a quadratic hyperspectral image. Two further lenses first collimate the light coming from the field stop and then image it onto the sensor. In between is a dispersive element that separates the light depending on the wavelength. Commonly, a diffractive optical element (DOE) (e.g., made of several crossed linear gratings) is employed. 12 We use a binary computer-generated hologram (CGH) that leads to an image shown in Fig. 2(a). In the central part of the image, the original broadband scene is located. Outside of this central area, we have 12 dispersed copies due to the first diffraction order of the CGH. It can be seen that the longer wavelengths, i.e., red region, are diffracted more than the shorter ones, i.e., blue region. This way the spectral information is spatially encoded on the sensor. Since we use a monochromatic sensor we obtain an image as shown in Fig. 2(b). The CGH is calculated using the direct binary search algorithm. 13 Research on CTIS has focused on reducing the unused sensor space with different diffraction patterns, [14][15][16][17] accelerating the reconstruction process, [18][19][20][21][22][23] and finding applications for CTIS in, e.g., fluorescence microscopy, 24 infrared spectroscopy, 25,26 and astronomy. 27,28

Hyperspectral Image Reconstruction from CTIS Data
Spectral information encoded on the sensor can be retrieved using iterative reconstruction algorithms. To reconstruct the hyperspectral cube, usually, the expectation-maximization (EM) algorithm is used. 18 EM uses calibration data in the form of recorded or simulated point spread functions (PSFs) of each object point and each spectral band to be reconstructed. Iterative reconstruction usually requires extremely high memory requirements and computation times. There are methods to reduce those requirements. However, despite increasing computing power, these algorithms still fail to achieve sufficient reconstruction speed for real-time applications. Our implementation of the EM algorithm as utilized for benchmarking purposes assumes a spatially invariant PSF and uses GPU acceleration which significantly reduces the computation time per iteration. Recently, White et al. 23 published their iterative reconstruction algorithm with GPU acceleration. Depending on the number of iterations, computation times from 0.5 (10 iterations) to 153 s (3100 iterations) were achieved for 24-spectral channels.
The reconstruction of hyperspectral images from CTIS images has lately been demonstrated by Huang et al. 29 using a convolutional neural network. Also Douarre et al. 30 introduced CTIS-Net for image classification using CTIS images. In this paper, we present an applicationindependent approach using precise reshaping inside a neural network architecture for the reconstruction of hyperspectral images from CTIS images. Our method demonstrates a clear superiority over EM with respect to processing time and robustness under noisy conditions. By applying prior knowledge about the system's PSF the complexity of the network can be significantly reduced.

Neural Networks for CTIS
In CTIS, the spatial information (zeroth diffraction order) and the spectral information (higher diffraction orders) are projected to different locations on the image sensor whereas large areas remain unused. These circumstances should be considered when designing a neural network for processing CTIS images.
Douarre et al. 30 took into account the characteristics of a CTIS image and separated the zeroth diffraction order from the higher diffraction orders. The higher diffraction orders are cropped and rotated to align them next to each other. The realigned diffractions are then processed by rectangular convolution kernels. The zeroth diffraction order is processed independently from the higher ones and is concatenated later. However, by rotating the higher diffraction orders, the underlying spatial information is also rotated and the neural network had the burden to learn to connect the right areas from each diffraction order.
Huang et al. 29 cropped the different diffraction orders and stacked them on top of each other as a new dimension. Since they only worked with horizontal and vertical diffractions, five subimages were cropped. The cropping eliminates most of the empty sensor space and no spatial information is altered due to rotation. However, the corresponding spectral information for one pixel is still far apart on the different layers, which requires additional and unnecessary work for the neural network and could be avoided by applying prior system knowledge to the CTIS image.

Proposed Architecture
In our proposed projection to cube (p2cube) architecture, the higher diffraction orders are reshaped into cubes before convolutions are applied. The reshaping is done in a way that classic convolution can connect areas where the corresponding spectral information is distributed. The idea is adapted from deformable convolutions where the kernels can connect different regions of the input. 31 While the deformation in deformable convolutions is learned during backpropagation we apply a fixed deformation since the pattern is known from the PSF. Even though this reshaping is rather simple, it plays a pivotal role in a robust reconstruction of the hyperspectral cube from the CTIS image.
The p2cube architecture is shown in Fig. 3 with a brief overview of the applied reshaping in the red box. Here, the input image is reshaped into 13 cubes. One cube for every higher diffraction order and one cube, where the zeroth diffraction order is replicated to fit the spectral dimensions of the other cubes and it provides the spatial information. The reshaped cubes are then processed first by a 3D deconvolution layer and then four convolutional layers which together form the baseline architecture of our approach. We added a U-Net 32 like extension shown in the gray box and the combined architecture will hereafter be referred to as p2cubeU. The number of trainable parameters from the two architectures is shown in Table 1. In comparison, the architecture from Huang et al. 29 has over 85 million parameters used to reconstruct hyperspectral images with a comparable size and the architecture from Douarre et al. 30 uses around 16 million parameters for a classification task on CTIS images.
The reshaping is visualized in more detail in Fig. 4 for two higher diffraction orders. Several squares are cropped and stacked on top of each other. The step size between the squares can be controlled in the model. A step size of five pixels shows good results in combination with a fiveby-five kernel size. On the right side of Fig. 4, the reshaped cubes are shown. The white squares visualize the area which is connected by a convolution kernel. The connected area from the cube is projected back to the original diffraction order. It can be seen that the classic convolution, which is applied to the cube, connects exactly the area where the spectral information from a pixel in the object space is distributed along. This way even the diagonal diffraction orders can be processed by rectangular convolution kernels. By applying a 3D deconvolution to all thirteen cubes, all areas where the spectral information is distributed are connected.   Fig. 3 Schematic visualization of the p2cube architecture with the reshaping layer visualized in more detail in the red box, and the U-Net like extension in the gray box, the combination of both architectures is referred to as p2cubeU.

Data Generation
The training data consists of sensor images as they are captured with a CTIS system together with the hyperspectral representation of the scenes (ground truth). In the following, we describe the synthetic data generation pipeline and our CTIS system that was used to acquire a real dataset with sensor images and corresponding hyperspectral ground truth.

CTIS-Emulation for synthetic data generation
We created a synthetic data set by emulating the CTIS system using Fourier optics to calculate the PSF for every wavelength (color) channel as described in Ref. 33. A monochromatic point source is considered as a scene in the far-field, which results in a uniform wavefront in the Fourier space. The CGH introduces a spatially dependent phase shift according to the optical path length. Our binary CGH creates a phase shift of π at 550 nm. The image in the Fourier space is cropped differently for every wavelength to keep the pixel size in the image plane constant and, therefore, allows the simulation of the CTIS image without interpolation in the spatial dimension. We use a sensor resolution of 1200 × 1200 pixels. An inverse Fourier transformation leads to the wavefront in the image plane. By taking the square of the complex magnitude from the wavefront, the intensity at the sensor (PSF) is obtained. This has to be done for every wavelength. Afterward, each wavelength-dependent PSF is convolved with the corresponding spectral band from the hyperspectral cube. By summing up all spectral channels the final grayscale sensor image is obtained. We use a spectral resolution of one nanometer to create the PSF. The hyperspectral images are interpolated in the spectral dimension to fit the calculated PSF. It has to be noted that this method does not take optical aberrations into account and assumes a spatially invariant system. Only diffraction due to the finite lens diameter is considered.
Hyperspectral images from three different publicly available datasets are used: BGU iCVL Hyperspectral Image Dataset, 34 CAVE Multispectral Image Database, 35 and TokyoTech 31-band Hyperspectral Image Dataset. 11 We extracted the bands from 450 to 690 nm with a spectral resolution of 10 nm, which leads to 25 spectral channels. The original images are first randomly split into training (76.5%), validation (19%), and test (4.5%) sets. Then, using a sliding window, nine subimages were extracted from a given image and then resized to a spatial resolution of 100 × 100 pixels. A tenth image is the whole original one but resized to 100 × 100 pixels also. This way the amount of samples used to train the network is increased by a factor of ten to a total number of 2585 hyperspectral images.

Real world CTIS data
Even though the emulation of the CTIS system using Fourier optics takes into account effects such as wavelength dependent diffraction efficiency, it does not include all optical aberrations from the lenses and internal reflections in the setup. To test the performance of the proposed neural network for real-world data, a real dataset of CTIS images paired with hyperspectral ground truth cubes was captured. A schematic of the expanded optical setup is shown in Fig. 5. The tunable color filter allows to record the spectral channels from 455 nm to 695 nm with a 10-nm step size. 36 A second optical path is added by inserting a beam splitter after the field stop. There, the intermediate image is again collimated and spectrally filtered by a tunable color filter. 36 A second sensor records the spectral channels sequentially. All images were recorded from 455 to 695 nm with a step size of 10 nm resulting in 25 spectral channels. The CTIS images have a resolution of 1088 × 1088 pixels and the hyperspectral ground truth images were downsampled to match the resolution of the zeroth diffraction order of 139 × 139 pixels. The recorded real-world dataset consists of 495 images that were split into 400 images for training, 80 for validation, and 15 images for testing.

Training Details
Keras with TensorFlow 2 and GPU support is used to build, train, and test the neural network. 37 The employed loss function for training and validation is the mean squared error (MSE). Adam optimizer 38 is used for training. The last convolutional layer uses a rectified linear unit as an activation function whereas the other layers use the default configuration of TensorFlow. The images are normalized to [0,1].
After splitting the images as described in Sec. 3.2, no further data augmentation is applied. However, a rotation by 90 deg or a multiple of 90 deg would be possible for the synthetic data set. Since the real-world data set contains aberrations that are not rotationally symmetric, the training would probably not benefit from such augmentation.
To simulate realistic camera noise for the synthetic dataset, a dynamic shot noise layer is added. 39 A quantum full-well capacity of 1000 is set. To do this, the normalized image is scaled to 1000. Then the shot noise is applied and the image is then divided by 1000 so that the values are again between zero and one.

Results
In the following, only results that are part of the test data set and thus not utilized during training or validation are discussed. In addition, to assess the influence of noise, we use one data set without noise and another data set consisting of the same images with shot noise added to the detector images. The neural network is trained with and without noise accordingly. For the EM algorithm, a noise-free calibration (PSF image) is used for both sets. The number of iterations for the EM algorithm was fixed to 200 since the performance stagnated afterward. Hyperspectral reconstruction from real CTIS images with the EM algorithm was not successful and is therefore not presented. The failure of the EM algorithm is due to a higher noise level in the sensor images and the requirement of a spatially invariant PSF, which is not fulfilled by the real CTIS setup.
To evaluate the reconstruction performance, we used three different quality criteria as presented in Tables 2 and 3. An individually optimized scaling factor is applied to the reconstructed images from the EM algorithm to deliver the best result. This is necessary since the sensor images are normalized and the total energy is lost. This is valid because only relative and not absolute values are of interest anyway. As the first quantitative criterion, we used the root mean squared error (RMSE). Since the data range is between 0 and 1, the error can be directly Table 2 Quantitative comparison between EM, p2cube, and p2cubeU architectures for synthetic data. The values labeled with noise are derived from the noisy sensor images. For RMSE and MAE a small value close to 0 shows good agreement of the reconstruction with the ground truth data. For MeanSSIM, in contrast, a value close to 1 shows a good structural similarity of the data. interpreted. Similar to the RMSE is the mean absolute error (MAE), which, in contrast to the RMSE, takes individual strong deviations (peaks) less into account. The mean structural similarity index measure (MeanSSIM) on the other hand is a completely different metric that is usually used to determine the quality of color images. In our case, we use the mean values over all channels. This, therefore, does not account for spectral characteristics anymore. 40 However, we consider it is still a valuable metric since it is different than the others. Respective results are shown in Table 2.
It can be seen that p2cubeU delivers the best results for all metrics for noise-free and noisy images. The EM algorithm is only slightly worse for the noise-free case. For noisy images, however, the quality of the images reconstructed with the EM algorithm decreases significantly. Here, the p2cube and p2cubeU architectures show their strength and can deliver results that are close to the noise-free ground truth images.
Qualitative comparisons are shown in Figs. 6 and 7. The RGB representation of the different images is obtained via the CIE 1931 filter response functions. Notice that the reconstruction from noise-free CTIS images works pretty well with all three methods. However, all images show some imperfections. For the EM reconstruction, one can see straight artifacts that correspond to the direction of diffraction of the DOE. This is typical and can often be observed with such kind of algorithms. With the neural network, it is sometimes noticeable that the color of small details is not reproduced correctly, e.g., the small stripes in the lower left part of the butterfly wing. For the emulated images with noise, it can be seen that the neural network-based approach delivers better results since the images from the EM algorithm become very noisy in the spectral dimension.   In Fig. 8, reconstructed samples from the real CTIS data are shown. Notice that the images from p2cube and p2cubeU show very few artifacts in the spatial dimension. Furthermore, p2cubeU achieves better performance and produces well-reconstructed images inline with the corresponding ground truth.

Conclusion and Outlook
In this paper, we presented a deep convolutional neural network for hyperspectral image reconstruction from synthetic as well as real CTIS data. Since spectral and spatial information overlap on the sensor image, it is rearranged in a first step so that the subsequent convolutional layers can better link spatially related areas. This reshaping organizes the CTIS image in a more efficient way for the network than others have demonstrated so far and it ensures a robust training/ reconstruction with over 30 times less parameters than Douarre et al. 30 for their classification task and 100 times less parameters than Huang et al. 29 for their hyperspectral image reconstruction. Our approach produces high-quality reconstruction results on synthetic and real CTIS data.
The advantages of the proposed neural network are the reduced computation time for inference by a factor >60 compared with the conventional EM algorithm and the reduced complexity compared with other networks dealing with CTIS data. 29,30 With a computation time between 0.14 and 0.17 s per image on a GTX 1060 6 GB, it enables real-time performance with common hardware.
In our experiments, the spatial resolution of the hyperspectral images is the same as the spatial resolution of the zeroth diffraction order. However, it should be possible to generate higher spatial resolution images from CTIS images since the spatial information is also multiplexed in the higher diffraction orders. This will be the focus of future work.