Special-purpose computer for digital holographic high-speed three-dimensional imaging

Yota Yamamoto
Shintaro Namba
Takashi Kakue
Tomoyoshi Shimobaba
Tomoyoshi Ito
Nobuyuki Masuda

Special-purpose computer for digital holographic high-speed three-dimensional imaging

Yota Yamamoto,a,* Shintaro Namba,b Takashi Kakue,a Tomoyoshi Shimobaba,a Tomoyoshi Ito,a and Nobuyuki Masuda,b

aChiba University, Graduate School of Engineering, Chiba, Japan
bTokyo University of Science, Faculty of Industrial Science and Technology, Tokyo, Japan

Abstract. Digital holography is attracting attention because it can record instantaneous three-dimensional (3D) information and record dynamic phenomena. However, when recording high-speed phenomena, the frame rate ranges from tens of thousands to hundreds of millions, and the calculation time of the reconstructed images is a problem. We have developed a special-purpose computer for high-speed 3D imaging using digital holography. The developed special-purpose computer has four calculation modules and has achieved a calculation time 68 times faster than that of a personal computer with 48 cores. With the developed computer, a total of 32 reconstructed images can be calculated in 0.69 ms from four holograms of $128 \times 128$ pixels with eight varying depths. © The Authors. Published by SPIE under a Creative Commons Attribution 4.0 Unported License. Distribution or reproduction of this work in whole or in part requires full attribution of the original publication, including its DOI. [DOI: 10.1117/1.OE.59.5.054105]

Keywords: optics; digital holography; high-speed three-dimensional imaging; field-programmable gate array.

Paper 20200299 received Mar. 13, 2020; accepted for publication May 15, 2020; published online May 28, 2020.

1 Introduction

The digital holography field began with electronic recording of holograms in the 1960s.1,2 Digital holography can dynamically capture information about the amplitude and phase of light. In addition, it is possible to record an instantaneous three-dimensional (3D) velocity field,3–7 which is attracting attention. Hologram resolution and image quality have also improved significantly. Gigapixel digital holography8,9 and large-scale holograms10,11 have been manufactured. Large-scale holograms enable deep 3D measurement and simultaneous imaging of a large number of particle images. However, a huge amount of time is required to calculate the reconstructed image. Research about optical methods is also important; however, computation time remains a problem, but algorithm improvements and hardware approaches are very effective.

Our research group has been developing a special-purpose computer for digital holographic particle tracking velocimetry (DHPTV) named fast Fourier transform (FFT)-HORN12,13 to realize fluid visualization. In DHPTV, the number of pixels is generally large ($>1024 \times 1024$ pixels) and the number of frames can be tens to thousands of frames.

In this study, we deal with high-speed 3D imaging using digital holography. High-speed 3D imaging can be realized by combining it with high-speed imaging.14–22 It is noteworthy that a faster camera produces a smaller number of pixels of the image sensor. When calculating reconstructed images, computational load is reduced using FFT.14 In FFT-based reconstructed image calculation, the calculation efficiency improves as the number of pixels increases, and the calculation load can be reduced. However, efficiency is reduced when the number of pixels is small. In other words, when the frame rate is improved and the number of pixels per hologram is reduced, calculation efficiency is reduced and more calculation time is required. When recording high-speed phenomena, the frame rate ranges from several thousand to several million, and a huge amount of time is required to calculate the reconstructed image.

*Address all correspondence to Yota Yamamoto, E-mail: y-yamamoto@chiba-u.jp

Optical Engineering 054105-1 May 2020 • Vol. 59(5)
To reduce computation time, we have developed a special-purpose computer for high-speed 3D imaging using a field-programmable gate array (FPGA). With a special-purpose calculation circuit, efficient calculations are possible even with a small number of pixels. In this study, 128 × 128-pixel hologram data were input. We developed a system that can calculate 32 reconstructed images in parallel with eight depths from four holograms.

2 Convolution Diffraction Integration Method

The intensity of the reconstructed image \( \phi(x_i, y_i) \) at a certain reconstructed distance \( z_i \) is calculated based on the Fresnel–Kirchhoff diffraction:

\[
\phi(x_i, y_i) = \int_{-N/2}^{N/2} \int_{-N/2}^{N/2} I(x_{\alpha}, y_{\alpha}) g(x_i - x_{\alpha}, y_i - y_{\alpha}) \, dx_{\alpha} \, dy_{\alpha}, \tag{1}
\]

\[
g(x_i - x_{\alpha}, y_i - y_{\alpha}) = \frac{\exp(ikz_i)}{ikz_i} \exp \left\{ \frac{ik}{2z_i} \left[ (x_i - x_{\alpha})^2 + (y_i - y_{\alpha})^2 \right] \right\}, \tag{2}
\]

where \( \phi(x_i, y_i) \) is the intensity of the reconstructed space at point \((x_i, y_i, z_i)\), \( I(x_{\alpha}, y_{\alpha}) \) shows the intensity pattern of the hologram, \( \lambda \) is the wavelength of light, \( k \) is the wavenumber of light expressed as \( k = \frac{2\pi}{\lambda} \), \( x_{\alpha} \) and \( y_{\alpha} \) are the coordinates on the hologram plane, and \( N \) is the number of pixels in the vertical and horizontal directions of the hologram surface and the reconstructed surface.

Equation (1) is a two-dimensional (2D) convolution integral about the \( x \) and \( y \) axes. It can be transformed using 2D Fourier transform; thus, Eq. (1) can be rewritten as follows:

\[
\Phi(n, m) = \hat{I}(n, m) G(n, m), \tag{3}
\]

where \( \Phi(n, m) \) is the 2D Fourier transform of \( \phi(x_i, y_i) \), \( \hat{I}(n, m) \) represents the Fourier transform of \( I(x_{\alpha}, y_{\alpha}) \), and \( G(n, m) \) represents that of \( g(x_i - x_{\alpha}, y_i - y_{\alpha}) \). The convolution integral of \( I(x_{\alpha}, y_{\alpha}) \) and \( g(x_i - x_{\alpha}, y_i - y_{\alpha}) \) can be expressed as the product of \( \hat{I}(n, m) \) and \( G(n, m) \) in the area after the Fourier transform by each Fourier transform. The intensity of the reconstructed surface can be obtained by performing an inverse 2D Fourier transform on the product calculation result and returning it to the original region. Here, FFT can be used to reduce the computational load.\(^3\)

The pixel interval between the hologram and reconstructed surfaces is defined as \( \Delta P \) in Eq. (2). The Fourier transformed \( G \), which is \( \Delta f = 1/(N\Delta P) \) as a minimum unit, is expressed as follows:

\[
G(n\Delta f, m\Delta f) = \exp \left\{ ikz_i - i\pi\lambda z_i \left[ \frac{n}{n\Delta P} \right]^2 + \left[ \frac{m}{m\Delta P} \right]^2 \right\} = \exp(ikz_i) \exp \left\{ 2\pi i \left( \frac{-\lambda z_i}{2N^2(\Delta P)^2} \right) (n^2 + m^2) \right\}. \tag{4}
\]

The 3D information of the measurement target is reconstructed by superimposing reconstructed images with different reconstructed distances (i.e., depths). Therefore, \( \exp(ikz_i) \) in Eq. (4) can be considered a constant when reconstructing a single surface. In addition, 256 gradations are generated based on the minimum and maximum values of \( \phi(x_i, y_i) \); thus, \( \exp(ikz_i) \) can be omitted. Therefore, the reconstructed image is calculated as follows:

\[
G'(n, m) = \exp \left\{ 2\pi i \left( \frac{-\lambda z_i}{2N^2(\Delta P)^2} \right) (n^2 + m^2) \right\}. \tag{5}
\]

In the following, \( G'(n, m) \) in Eq. (5) is referred to as \( G(n, m) \). Here, the depth-related coefficient \( z_{\text{param}} \) is defined as follows:
Equation (5) is expressed by Eq. (6) as follows:

\[ G(n, m) = \exp[2\pi i z_{\text{param}}(n^2 + m^2)]. \]  

(7)

3 Implementation of Special-Purpose Computer

3.1 Hardware

In the calculation of the reconstructed image, the calculation load during FFT is very large. FFT is a simple calculation that repeats the butterfly operation and is suitable for hardware implementation. In this study, the Virtex-7 FPGA VC707 Evaluation Kit (VC707) provided by Xilinx was used in the initial research stage. It is an evaluation board with a Virtex-7 XC7VX485T-2FFG1761C. Table 1 shows the specifications of the logical FPGA installed in the VC707.

3.2 Implementation of Two-Dimensional Convolution Pipeline

Figure 1 shows a block diagram of the special-purpose calculation pipeline based on the convolution diffraction integration method. The numbers on the diagonal lines in the figure indicate the bit width of the data.

In this circuit, the intellectual property core (FFT v7.1) provided by Xilinx is used to perform the Fourier and inverse transforms of the hologram.23 This core can select one-dimensional (1D)

Table 1  FPGA specifications: register is a small, high-speed data storage resource on the FPGA, and Block RAM is a large-capacity, low-speed data storage resource. The LUT is a resource for configuring a circuit.

<table>
<thead>
<tr>
<th>Evaluation board</th>
<th>VC707</th>
</tr>
</thead>
<tbody>
<tr>
<td>Family</td>
<td>Virtex-7</td>
</tr>
<tr>
<td>Device</td>
<td>XC7VX485T-2FFG1761C</td>
</tr>
<tr>
<td>Number of registers</td>
<td>607,200</td>
</tr>
<tr>
<td>Block RAM</td>
<td>37,080 Kb</td>
</tr>
<tr>
<td>Number of LUTs</td>
<td>303,600</td>
</tr>
</tbody>
</table>

Fig. 1 Block diagram of special-purpose calculation pipeline for convolutional diffraction integration method.
FFT and inverse FFT according to the input. It is noteworthy that it is necessary to perform both 2D and inverse FFT. When performing 2D FFT, we first perform 1D FFT in the $x$ direction and then 1D FFT in the $y$ direction. The same applies to the inverse FFT.

The distance $z_i$ between the hologram and reconstructed surfaces only affects $G(n, m)$. Therefore, by multiplying $G(n, m)$ by distance $z_i$ corresponding to each reconstructed plane as a parameter by Fourier transform $\hat{I}(n, m)$ of a single hologram, it is possible to obtain multiple reconstructed surface intensities in the depth direction. In this system, the initial value $z_0$ of the reconstructed distance $z_i$ and the reconstructed distance interval $dz$ are stored in the FPGA's Block RAM, and the values are handled as parameters.

The special-purpose pipeline is a state machine that uses four stages, where eight reconstructed images with different reconstructed distances are calculated from a single hologram.

- **Stage 1**: Hologram data are read from hologram memory, 1D FFT is performed ($x$ direction), and the result is saved to xFFT RAM (Fig. 1). This process is repeated for 128 lines in the $y$ direction. Simultaneously, the initial value $z_0$ of the reconstructed distance $z_i$ is read.
- **Stage 2**: The FFT calculation result from stage 1 is read in the $y$ direction from the xFFT RAM, 1D FFT is performed, and it is saved in yFFT RAM (Fig. 1). This process is repeated for 128 lines in the $x$ direction. Simultaneously, the interval $dz$ of the reconstructed distance $z_i$ is read.
- **Stage 3**: The FFT calculation result from stage 2 is read from yFFT RAM and multiplied by the value of $G$. Then, 1D inverse FFT is performed. The result is stored in xFFT RAM (Fig. 1). This process is repeated for 128 lines in the $y$ direction.
- **Stage 4**: The inverse FFT result from stage 3 is read from xFFT RAM in the $y$ direction, 1D inverse FFT is performed, and it is saved in calculate memory (Fig. 1). This process is repeated for 128 lines in the $x$ direction.

It is noteworthy that stages 3 and 4 are repeated while adding the interval $dz$ to the reconstructed distance $z_i$. This repetition continues until the calculation of eight reconstructed planes is complete.

### 3.3 Implementation of Calc G Unit

Figure 2 shows a block diagram of the Calc G unit shown in Fig. 1. The Calc G unit calculates $G(n, m)$ represented by Eq. (7). Here, $\rho$ is defined as follows:

$$
\rho = -\frac{\pi \lambda}{2N^2(\Delta P)^2},
$$

(8)

where $\rho$ is a constant that can be calculated in advance on the host personal computer (PC). Multiply the $\rho$ calculated by the host PC and $z_i$ then redefine Eq. (6) by Eq. (8) as follows:

$$
z_{\text{param}} = \rho z_i.
$$

(9)

From Euler’s formula, $\exp(i\theta) = \cos \theta + i \sin \theta$ holds. Here, the lookup table (LUT) method is used for the cos and sin calculations, where precalculated values are stored in memory, and the values are obtained quickly by referencing read-only memory (ROM) addresses. In Fig. 2, the real and imaginary parts of $G(n, m)$ are calculated using cos ROM and sin ROM.

![Fig. 2 Block diagram of Calc G unit.](https://www.spiedigitallibrary.org/journals/Optical-Engineering)
4 Results

4.1 Resource Usage

Table 2 shows the circuit scale of the special-purpose computer developed with the VC707. The operating frequency is 250 MHz. It was possible to install four units that calculate eight reconstructed images from a single hologram.

<table>
<thead>
<tr>
<th>Resource name</th>
<th>Number of usages</th>
<th>Usage (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Register</td>
<td>16,460</td>
<td>2</td>
</tr>
<tr>
<td>Block RAM</td>
<td>24,844 Kb</td>
<td>67</td>
</tr>
<tr>
<td>LUT</td>
<td>13,283</td>
<td>4</td>
</tr>
</tbody>
</table>

Table 3 Calculation time.

<table>
<thead>
<tr>
<th>Hardware</th>
<th>Calculation time (ms)</th>
<th>Acceleration ratio</th>
</tr>
</thead>
<tbody>
<tr>
<td>FPGA: VC707</td>
<td>0.69</td>
<td>68</td>
</tr>
<tr>
<td>GPU: NVIDIA GeForce GTX 1080</td>
<td>15.90</td>
<td>2.97</td>
</tr>
<tr>
<td>CPU: Intel Xeon E5-2697</td>
<td>47.16</td>
<td>1.00</td>
</tr>
</tbody>
</table>

4.2 Calculation Time

To compare calculation time, a program that calculates 32 reconstructed images from four holograms (128 x 128 pixels) was created on the central processing unit (CPU) and graphics processing unit (GPU). The comparison of calculation times is given in Table 3.

For the CPU comparison environment, an Intel Xeon E5-2697 2.70 GHz (48 cores) CPU was used. The program was created using CentOS 7.1 and the Intel C++ Compiler 16.0.1.150. In addition, FFTW 3.3.8 was used for the FFT calculation. It is noteworthy that all cores were used and executed in parallel.

For the GPU comparison environment, an NVIDIA GeForce GTX 1080 (2560 CUDA cores) was installed in the CPU comparison environment. The program was created using the NVIDIA CUDA compiler 9.1. In addition, the cuFFT library in NVIDIA CUDA compiler 9.1 was used for the FFT calculation.

As can be seen in Table 3, the developed special-purpose computer is 68 times faster than the CPU and 23 times faster than the GPU.

4.3 Image Quality

The optical system shown in Fig. 3 was constructed and a hologram was captured. Here, a 532-nm laser was used as the light source. The linearly polarized light wave from the laser was propagated to the beam expander and split into two waves. One was propagated into an object. We have set falling granulated sugar particles as an object. Holograms were captured with an exposure time of \( \sim 1/300,000 \) s using a high-speed polarization camera. This camera has a polarization-detection function of a set of 2 x 2 pixels. Even though the function can select four polarization axes, we only used one axis in this experiment. Since captured holograms had vacant pixels, we interpolated the values of the vacant pixels before the reconstruction calculation. Figure 4 compares a reconstructed image calculated by the CPU with double precision.
and a reconstructed image calculated by the developed special-purpose computer. Table 4 shows the reconstructed conditions. The average peak signal-to-noise ratio and structural similarity with 32 reconstructed images were 43.8 and 0.993, respectively. As shown in Fig. 4, the special-purpose computer can calculate with image quality equivalent to the result calculated by the CPU.

Fig. 3 Optical setup (HM stands half mirror).

Fig. 4 Comparison of images reconstructed by (a) CPU and (b) FPGA (Video 1, MPEG, 302 KB [URL: https://doi.org/10.1117/1.OE.59.5.054105.1]).

and a reconstructed image calculated by the developed special-purpose computer. Table 4 shows the reconstructed conditions. The average peak signal-to-noise ratio and structural similarity with 32 reconstructed images were 43.8 and 0.993, respectively. As shown in Fig. 4, the special-purpose computer can calculate with image quality equivalent to the result calculated by the CPU.

### Table 4  Reconstructed conditions.

<table>
<thead>
<tr>
<th>Condition</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>Reconstructed distance</td>
<td>210 mm</td>
</tr>
<tr>
<td>Wavelength</td>
<td>532 nm</td>
</tr>
<tr>
<td>Pixel pitch</td>
<td>20 μm</td>
</tr>
<tr>
<td>Number of pixels</td>
<td>128 × 128</td>
</tr>
</tbody>
</table>
5 Discussion and Future Work

We have developed a special-purpose computer for high-speed 3D imaging using digital holography. We have successfully developed a system that can calculate 32 reconstructed images in 0.69 ms from four holograms of 128 × 128 pixels. This special-purpose computer comprises four units and is 68 times faster than a 48-core CPU and 23 times faster than a GPU with 2560 CUDA cores.

Generally, a multicore system involves overhead, such as preprocessing and control in parallel execution. Even if the number of parallel processes is increased, increased speed cannot be obtained. In this study, we handled a 128 × 128-pixel hologram for high-speed imaging. Since the number of pixels is small, one hologram of processing is immediately terminated. The overhead of parallel execution was apparent and speedup reached its peak. However, such overhead does not exist in a special-purpose computer. If the conditions are met, a special-purpose computer can be sped up as the number of parallel processes increases. The results of this study suggest the usefulness of special-purpose computers in high-speed 3D imaging with a small number of pixels in tens of thousands to hundreds of millions of frames.

In Eq. (4), $z_i$ should satisfy the following sampling condition:

$$z_i = \frac{N \times \Delta p^2}{\lambda}.$$  \hfill (10)

Owing to the physical constraints of the experimental environment, we set $z_i$ to 0.21 m, so the sampling condition is not satisfied. Zero padding is generally required to satisfy this sampling condition. However, it needs more memory and is not suitable for FPGA implementations, and it takes extra calculation time. Using the simulation results, we confirmed that zero padding had no significant effect on the quality of the reproduced image.

In the future, we plan to increase the density and integration of computing units. In the current study, the capacity and operating frequency of the Block RAM that temporarily stores calculation data was a bottleneck, and the installation of four units was a significant limitation. Our system uses 600-Kb memory per unit. In terms of memory capacity, the number of units can be increased by one or two. On the other hand, the distance between the computing units and memory block will increase, resulting in a decrease in the operating frequency. In this work, based on the trade-off between the number of computing units, memory capacity, and operating frequency, we chose four units operating at 250 MHz to obtain better computational performance. However, the latest FPGAs have large, high-speed on-chip UltraRAM storage with a maximum capacity of 500 Mb. The latest FPGA board (Xilinx ALVEO U250) has 100 times more resources in terms of LUT; thus, it is expected to be nearly 100 times faster.

In addition, in this study, we focused on the calculation time of the special-purpose computer, and the communication of the special-purpose computer used a low-speed I/O port. The total communication time was 29.15 ms. However, the communication time can be reduced using direct memory access (DMA) transfer. The measured value of DMA communication time using the Xilinx ALVEO U250 is 0.017 ms to transfer four images and 0.078 ms to receive 32 images. Data reception can be ignored because it can be performed in parallel with the calculation. The actual communication time is only 0.017 ms at the time of data transmission, and, in future, it will be possible to develop a high-speed special-purpose computer with this communication time.

Acknowledgments

This work was partially supported by the Japan Society for the Promotion of Science KAKENHI Grant Nos. 19H01097 and 18K11328. The authors declare no conflicts of interest.

References

Yota Yamamoto is a PhD student at Graduate School of Engineering, Chiba University. He received his BE and ME degrees from Tokyo University of Science in 2016 and 2018, respectively. His current research interests include high-performance computing and its applications for digital holography and electroholography. He is a member of the Optical Society of Japan (OSJ), the Information Processing Society of Japan (IPSJ), and the Institute of Electronics, Information and Communication Engineers (IEICE).

Shintaro Namba is a graduate student at Graduate School of Industrial Science and Technology, Tokyo University of Science. He received his BE and ME degrees from Tokyo University of Science in 2015 and 2017, respectively. His current research interests include high-performance computing and its applications for digital holography.

Takashi Kakue is an assistant professor at Chiba University. He received his BE, ME, and DE degrees from Kyoto Institute of Technology in 2006, 2008, and 2012, respectively. His current research interests include holography, 3D display, 3D measurement, and high-speed imaging. He is a member of SPIE, the Optical Society (OSA), the Institute of Electrical and Electronics Engineers (IEEE), OSJ, and the Institute of Image Information and Television Engineers (ITE).

Tomoyoshi Shimobaba is a professor at Chiba University. He received his PhD in fast calculation of holography using special-purpose computers from Chiba University in 2002. His current research interests include computer holography and its applications. He is a member of SPIE, OSA, IEICE, OSJ, and ITE.

Tomoyoshi Ito is a professor at Chiba University. He received his PhD in special-purpose computers for many-body systems in astrophysics and molecular dynamics from the University of Tokyo in 1994. His current research interests include high-performance computing and its applications for electroholography. He is a member of ACM, OSA, OSJ, ITE, IEICE, IPSJ, and the Astronomical Society of Japan.

Nobuyuki Masuda is a professor at Tokyo University of Science. He received his PhD in simulations of dynamical instability of accretion disks from the University of Tokyo in 1998. His current research interests include high-performance computing and its applications for digital holography. He is a member of IPSJ and IEICE.