Special-purpose computer for digital holographic high-speed three-dimensional imaging

. Digital holography is attracting attention because it can record instantaneous three-dimensional (3D) information and record dynamic phenomena. However, when recording high-speed phenomena, the frame rate ranges from tens of thousands to hundreds of millions, and the calculation time of the reconstructed images is a problem. We have developed a special-purpose computer for high-speed 3D imaging using digital holography. The developed special-purpose computer has four calculation modules and has achieved a calculation time 68 times faster than that of a personal computer with 48 cores. With the developed computer, a total of 32 reconstructed images can be calculated in 0.69 ms from four holograms of 128 × 128 pixels with eight varying depths. © The Authors. Published by SPIE under a Creative Commons Attribution 4.0 Unported License. Distribution or reproduction of this work in whole or in part requires full attribution of the original publication, including its DOI. [DOI: 10.1117/1.OE.59.5.054105]


Introduction
The digital holography field began with electronic recording of holograms in the 1960s. 1,2igital holography can dynamically capture information about the amplitude and phase of light.In addition, it is possible to record an instantaneous three-dimensional (3D) velocity field, [3][4][5][6][7] which is attracting attention.Hologram resolution and image quality have also improved significantly.Gigapixel digital holography 8,9 and large-scale holograms 10,11 have been manufactured.Large-scale holograms enable deep 3D measurement and simultaneous imaging of a large number of particle images.However, a huge amount of time is required to calculate the reconstructed image.Research about optical methods is also important; however, computation time remains a problem, but algorithm improvements and hardware approaches are very effective.
Our research group has been developing a special-purpose computer for digital holographic particle tracking velocimetry (DHPTV) named fast Fourier transform (FFT)-HORN 12,13 to realize fluid visualization.In DHPTV, the number of pixels is generally large (>1024 × 1024 pixels) and the number of frames can be tens to thousands of frames.
In this study, we deal with high-speed 3D imaging using digital holography.5][16][17][18][19][20][21][22] It is noteworthy that a faster camera produces a smaller number of pixels of the image sensor.When calculating reconstructed images, computational load is reduced using FFT. 4 In FFT-based reconstructed image calculation, the calculation efficiency improves as the number of pixels increases, and the calculation load can be reduced.However, efficiency is reduced when the number of pixels is small.In other words, when the frame rate is improved and the number of pixels per hologram is reduced, calculation efficiency is reduced and more calculation time is required.When recording high-speed phenomena, the frame rate ranges from several thousand to several million, and a huge amount of time is required to calculate the reconstructed image.
To reduce computation time, we have developed a special-purpose computer for high-speed 3D imaging using a field-programmable gate array (FPGA).With a special-purpose calculation circuit, efficient calculations are possible even with a small number of pixels.In this study, 128 × 128-pixel hologram data were input.We developed a system that can calculate 32 reconstructed images in parallel with eight depths from four holograms.

Convolution Diffraction Integration Method
The intensity of the reconstructed image ϕðx i ; y i Þ at a certain reconstructed distance z i is calculated based on the Fresnel-Kirchhoff diffraction: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 1 ; 1 1 6 ; 6 0 9 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 2 ; 1 1 6 ; 5 5 2 where ϕðx i ; y i Þ is the intensity of the reconstructed space at point ðx i ; y i ; z i Þ, Iðx α ; y α Þ shows the intensity pattern of the hologram, λ is the wavelength of light, k is the wavenumber of light expressed as k ¼ 2π λ , x α and y α are the coordinates on the hologram plane, and N is the number of pixels in the vertical and horizontal directions of the hologram surface and the reconstructed surface.
Equation ( 1) is a two-dimensional (2D) convolution integral about the x and y axes.It can be transformed using 2D Fourier transform; thus, Eq. ( 1) can be rewritten as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 3 ; 1 1 6 ; 4 4 2 Φðn; mÞ ¼ Îðn; mÞGðn; mÞ; ( where Φðn; mÞ is the 2D Fourier transform of ϕðx i ; y i Þ, Îðn; mÞ represents the Fourier transform of Iðx α ; y α Þ, and Gðn; mÞ represents that of gðx i − x α ; y i − y α Þ.The convolution integral of Iðx α ; y α Þ and gðx i − x α ; y i − y α Þ can be expressed as the product of Îðn; mÞ and Gðn; mÞ in the area after the Fourier transform by each Fourier transform.The intensity of the reconstructed surface can be obtained by performing an inverse 2D Fourier transform on the product calculation result and returning it to the original region.Here, FFT can be used to reduce the computational load. 4he pixel interval between the hologram and reconstructed surfaces is defined as ΔP in Eq. (2).The Fourier transformed G, which is Δf ¼ 1∕ðNΔPÞ as a minimum unit, is expressed as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 4 ; 1 1 6 ; 2 8 5   GðnΔf; mΔfÞ ¼ exp The 3D information of the measurement target is reconstructed by superimposing reconstructed images with different reconstructed distances (i.e., depths).Therefore, expðikz i Þ in Eq. ( 4) can be considered a constant when reconstructing a single surface.In addition, 256 gradations are generated based on the minimum and maximum values of ϕðx i ; y i Þ; thus, expðikz i Þ can be omitted.Therefore, the reconstructed image is calculated as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 5 ; 1 1 6 ; 1 5 2 In the following, G 0 ðn; mÞ in Eq. ( 5) is referred to as Gðn; mÞ.Here, the depth-related coefficient z param is defined as follows: Equation ( 5) is expressed by Eq. ( 6) as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 7 ; 1 1 6 ; 6 8 8 Gðn; mÞ 3 Implementation of Special-Purpose Computer

Hardware
In the calculation of the reconstructed image, the calculation load during FFT is very large.FFT is a simple calculation that repeats the butterfly operation and is suitable for hardware implementation.In this study, the Virtex-7 FPGA VC707 Evaluation Kit (VC707) provided by Xilinx was used in the initial research stage.It is an evaluation board with a Virtex-7 XC7VX485T-2FFG1761C. Table 1 shows the specifications of the logical FPGA installed in the VC707.

Implementation of Two-Dimensional Convolution Pipeline
Figure 1 shows a block diagram of the special-purpose calculation pipeline based on the convolution diffraction integration method.The numbers on the diagonal lines in the figure indicate the bit width of the data.In this circuit, the intellectual property core (FFT v7.1) provided by Xilinx is used to perform the Fourier and inverse transforms of the hologram. 23This core can select one-dimensional (1D) FFT and inverse FFT according to the input.It is noteworthy that it is necessary to perform both 2D and inverse FFT.When performing 2D FFT, we first perform 1D FFT in the x direction and then 1D FFT in the y direction.The same applies to the inverse FFT.
The distance z i between the hologram and reconstructed surfaces only affects Gðn; mÞ.Therefore, by multiplying Gðn; mÞ by distance z i corresponding to each reconstructed plane as a parameter by Fourier transform Îðn; mÞ of a single hologram, it is possible to obtain multiple reconstructed surface intensities in the depth direction.In this system, the initial value z 0 of the reconstructed distance z i and the reconstructed distance interval dz are stored in the FPGA's Block RAM, and the values are handled as parameters.
The special-purpose pipeline is a state machine that uses four stages, where eight reconstructed images with different reconstructed distances are calculated from a single hologram.
• Stage 1: Hologram data are read from hologram memory, 1D FFT is performed (x direction), and the result is saved to xFFT RAM (Fig. 1).This process is repeated for 128 lines in the y direction.Simultaneously, the initial value z 0 of the reconstructed distance z i is read.• Stage 2: The FFT calculation result from stage 1 is read in the y direction from the xFFT RAM, 1D FFT is performed, and it is saved in yFFT RAM (Fig. 1).This process is repeated for 128 lines in the x direction.Simultaneously, the interval dz of the reconstructed distance z i is read.• Stage 3: The FFT calculation result from stage 2 is read from yFFT RAM and multiplied by the value of G.Then, 1D inverse FFT is performed.The result is stored in xFFT RAM (Fig. 1).This process is repeated for 128 lines in the y direction.• Stage 4: The inverse FFT result from stage 3 is read from xFFT RAM in the y direction, 1D inverse FFT is performed, and it is saved in calculate memory (Fig. 1).This process is repeated for 128 lines in the x direction.
It is noteworthy that stages 3 and 4 are repeated while adding the interval dz to the reconstructed distance z i .This repetition continues until the calculation of eight reconstructed planes is complete.

Implementation of Calc G Unit
Figure 2 shows a block diagram of the Calc G unit shown in Fig. 1.The Calc G unit calculates Gðn; mÞ represented by Eq. ( 7).Here, ρ is defined as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 8 ; 1 1 6 ; 3 1 5 where ρ is a constant that can be calculated in advance on the host personal computer (PC).
Multiply the ρ calculated by the host PC and z i then redefine Eq. ( 6) by Eq. ( 8) as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 9 ; 1 1 6 ; 2 4 7 From Euler's formula, expðiθÞ ¼ cos θ þ i sin θ holds.Here, the lookup table (LUT) method is used for the cos and sin calculations, where precalculated values are stored in memory, and the values are obtained quickly by referencing read-only memory (ROM) addresses.In Fig. 2, the real and imaginary parts of Gðn; mÞ are calculated using cos ROM and sin ROM.

Calculation Time
To compare calculation time, a program that calculates 32 reconstructed images from four holograms (128 × 128 pixels) was created on the central processing unit (CPU) and graphics processing unit (GPU).The comparison of calculation times is given in Table 3.
For the CPU comparison environment, an Intel Xeon E5-2697 2.70 GHz (48 cores) CPU was used.The program was created using CentOS 7.1 and the Intel C++ Compiler 16.0.1.150.In addition, FFTW 3.3.8 24was used for the FFT calculation.It is noteworthy that all cores were used and executed in parallel.
For the GPU comparison environment, an NVIDIA GeForce GTX 1080 (2560 CUDA cores) was installed in the CPU comparison environment.The program was created using the NVIDIA CUDA compiler 9.1.In addition, the cuFFT library 25 in NVIDIA CUDA compiler 9.1 was used for the FFT calculation.
As can be seen in Table 3, the developed special-purpose computer is 68 times faster than the CPU and 23 times faster than the GPU.

Image Quality
The optical system shown in Fig. 3 was constructed and a hologram was captured.Here, a 532-nm laser was used as the light source.The linearly polarized light wave from the laser was propagated to the beam expander and split into two waves.One was propagated into an object.We have set falling granulated sugar particles as an object.Holograms were captured with an exposure time of ∼1∕300;000 s using a high-speed polarization camera. 22This camera has a polarization-detection function of a set of 2 × 2 pixels.Even though the function can select four polarization axes, we only used one axis in this experiment.Since captured holograms had vacant pixels, we interpolated the values of the vacant pixels before the reconstruction calculation.Figure 4 compares a reconstructed image calculated by the CPU with double precision  and a reconstructed image calculated by the developed special-purpose computer.Table 4 shows the reconstructed conditions.The average peak signal-to-noise ratio and structural similarity with 32 reconstructed images were 43.8 and 0.993, respectively.As shown in Fig. 4, the special-purpose computer can calculate with image quality equivalent to the result calculated by the CPU.We have developed a special-purpose computer for high-speed 3D imaging using digital holography.We have successfully developed a system that can calculate 32 reconstructed images in 0.69 ms from four holograms of 128 × 128 pixels.This special-purpose computer comprises four units and is 68 times faster than a 48-core CPU and 23 times faster than a GPU with 2560 CUDA cores.Generally, a multicore system involves overhead, such as preprocessing and control in parallel execution.Even if the number of parallel processes is increased, increased speed cannot be obtained.In this study, we handled a 128 × 128-pixel hologram for high-speed imaging.Since the number of pixels is small, one hologram of processing is immediately terminated.The overhead of parallel execution was apparent and speedup reached its peak.However, such overhead does not exist in a special-purpose computer.If the conditions are met, a special-purpose computer can be sped up as the number of parallel processes increases. 26The results of this study suggest the usefulness of special-purpose computers in high-speed 3D imaging with a small number of pixels in tens of thousands to hundreds of millions of frames.

Laser
In Eq. ( 4), z i should satisfy the following sampling condition: 27 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 0 ; 1 1 6 ; 5 3 7 Owing to the physical constraints of the experimental environment, we set z i to 0.21 m, so the sampling condition is not satisfied.Zero padding is generally required to satisfy this sampling condition.However, it needs more memory and is not suitable for FPGA implementations, and it takes extra calculation time.Using the simulation results, we confirmed that zero padding had no significant effect on the quality of the reproduced image.
In the future, we plan to increase the density and integration of computing units.In the current study, the capacity and operating frequency of the Block RAM that temporarily stores calculation data was a bottleneck, and the installation of four units was a significant limitation.Our system uses 600-Kb memory per unit.In terms of memory capacity, the number of units can be increased by one or two.On the other hand, the distance between the computing units and memory block will increase, resulting in a decrease in the operating frequency.In this work, based on the trade-off between the number of computing units, memory capacity, and operating frequency, we chose four units operating at 250 MHz to obtain better computational performance.However, the latest FPGAs have large, high-speed on-chip UltraRAM storage with a maximum capacity of 500 Mb. 28The latest FPGA board (Xilinx ALVEO U250) has 100 times more resources in terms of LUT; thus, it is expected to be nearly 100 times faster.
In addition, in this study, we focused on the calculation time of the special-purpose computer, and the communication of the special-purpose computer used a low-speed I/O port.The total communication time was 29.15 ms.However, the communication time can be reduced using direct memory access (DMA) transfer.The measured value of DMA communication time using the Xilinx ALVEO U250 is 0.017 ms to transfer four images and 0.078 ms to receive 32 images.Data reception can be ignored because it can be performed in parallel with the calculation.The actual communication time is only 0.017 ms at the time of data transmission, and, in future, it will be possible to develop a high-speed special-purpose computer with this communication time.

Fig. 2
Fig. 2 Block diagram of Calc G unit.

Table 1
FPGA specifications: register is a small, high-speed data storage resource on the FPGA, and Block RAM is a large-capacity, low-speed data storage resource.The LUT is a resource for configuring a circuit.
Fig.1Block diagram of special-purpose calculation pipeline for convolutional diffraction integration method.
Yamamoto et al.: Special-purpose computer for digital holographic. . .

Table 2
shows the circuit scale of the special-purpose computer developed with the VC707.The operating frequency is 250 MHz.It was possible to install four units that calculate eight reconstructed images from a single hologram.

Table 2
Circuit scale.

Table 3
Calculation time.