Two schemes for rapid generation of digital video holograms using PC cluster

Abstract. Computer-generated holography (CGH), which is a process of generating digital holograms, is computationally expensive. Recently, several methods/systems of parallelizing the process using graphic processing units (GPUs) have been proposed. Indeed, use of multiple GPUs or a personal computer (PC) cluster (each PC with GPUs) enabled great improvements in the process speed. However, extant literature has less often explored systems involving rapid generation of multiple digital holograms and specialized systems for rapid generation of a digital video hologram. This study proposes a system that uses a PC cluster and is able to more efficiently generate a video hologram. The proposed system is designed to simultaneously generate multiple frames and accelerate the generation by parallelizing the CGH computations across a number of frames, as opposed to separately generating each individual frame while parallelizing the CGH computations within each frame. The proposed system also enables the subprocesses for generating each frame to execute in parallel through multithreading. With these two schemes, the proposed system significantly reduced the data communication time for generating a digital hologram when compared with that of the state-of-the-art system.


Introduction
Holography is a technology that enables people to view three-dimensional (3-D) images (called holographic images or simply holograms) displayed in real space with the naked eye.Although a hologram was originally generated using optical apparatuses, 1 it can be digitally implemented on computers with many advantages. 2,3Computer-generated holography (CGH) is a method that computes digital holographic interference patterns required for generating holograms in a holographic 3-D display.5][6] Both generally involve a huge amount of computations; thus, computational reduction has been a main research topic in this field.However, the point-based method further suffers from the high computational complexity as shown in Eq. ( 1) wherein the computational complexity rapidly increases in proportion to the hologram resolution and the number of light sources (referring to pixels with a nonzero intensity value in a depth image) of a 3-D object.
E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 1 ; 6 3 ; 1 9 9 Iðx h ; y h Þ ¼ where Here, I and A denote the light intensities of a hologram and a 3-D object (or a set of 3-D light sources), respectively.x l , y l , and z l are the 3-D coordinates of the light sources.In addition, λ and p denote the wavelength of the reference wave and pixel pitch, respectively, and L denotes the number of 3-D object light sources.W h and H h are the width and height of the hologram.Several software-based [5][6][7][8][9][10][11][12] and hardware-based [13][14][15][16][17][18][19] methods were proposed to reduce the computational complexity.Software-based methods have tried to store the CGH computation results in a look-up table in advance, 7,8 recursively generate the intensities of the rest using the precalculated values of neighbor or particular CGH pixels, 9,10 and reduce the CGH computation using a cosine approximation algorithm, 11 an effective diffraction area recording method, 12 a layered model, 5 or a patch model. 6However, those could not speed up enough to generate high-resolution holograms in real time, and some of them have degraded the quality of holograms.Conversely, hardware-based methods have generated high-resolution holograms in near real-time without any quality change by parallelizing the CGH computation using field-programmable gate array, 13 a single unit or multiple graphic processing units (GPUs), [14][15][16][17][18] and even a personal computer (PC) cluster system 19,20 composed of multiple PCs in which each PC has multiple GPUs.As a state-of-the-art method, a scalable and flexible PC cluster system was proposed 21 to generate higher resolution holograms [called high-quality (HQ) holograms] with a considerably larger number of object light sources.The system was a serverclient system and could be flexibly composed of different numbers or performance of PCs and GPUs.A PC acted as a server and periodically investigated the computing power of each client PC and optimally distributed the amount of computations.Consequently, the system generated an HQ hologram (1536 × 1536 resolution and more than 2.1 million light sources) in 10 s.This was highly efficient when compared with the previous systems.However, the method still involved a significantly long period of time to generate an HQ hologram even if the cluster system was used.Hence, it is important to further improve the performance of the cluster system.In particular, the server-client system spent a considerable amount of time communicating data between the server and the clients. 21Therefore, a method for reducing the communication time is necessary, and this is the main focus of this study.
Digital video holograms are composed of a number of frames, and each frame can be generated separately and quickly using the aforementioned existing methods or systems as in Refs.21 and 20.Indeed, this has been a common way to generate digital video holograms.Strictly speaking, there has been no specialized approach for generating video holograms in the literature.However, it is possible to further reduce the hologram generation time (exactly, the data communication time between the server and the clients) by considering and generating the frames together.In this context, instead of distributing/parallelizing the CGH computations for generating each individual frame, this study proposes assigning all the computations of a single frame to a single PC and determining the number of frames assigned to each PC on the basis of the performance of each PC.This implies that the parallelization is achieved on a frame-to-frame basis (Scheme 1).In addition, the previous studies that focused on fast generation of a single hologram paid no attention to the parallelization of subprocesses (i.e., distribution, CGH computation, and collection, which will be specified later) for hologram generation because the subprocesses should be executed sequentially for the generation of a single hologram.However, they can be parallelized in video hologram generation, and this parallelization can reduce the data communication time.Therefore, this study proposes parallelizing the subprocesses and provides a practical solution based on multithreading (Scheme 2).With these two schemes, the data communication time between the server and the clients can be minimized.
The first scheme is similar to that in a previous study 20 in that a client PC is fully in charge of the CGH computations of a frame.However, in the study, 20 the framewise generation was not newly designed for quick generation of video holograms and the system required all the clients to have the same performance (i.e., identical GPUs), which is usually not the case in real computing environments.In addition, the data transmission time between the server and the client PCs was ignored using an extremely-high-speed network.Our second scheme provides a practical solution to reduce the data transmission time in real network environments.

Proposed System: A Digital Video Hologram
Generation System with Two Speedup Schemes The proposed system is very similar to that used in a previous study. 21Both systems are based on the server-client architecture, where the client PCs have different performance; thus, the server PC periodically investigates the time varying computing power (s) of each client PC by sending a small and identical amount of CGH computations to each client and receiving the computation time (T ct ) measured by each client.The computing power is computed as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 2 ; 3 2 6 ; 4 8 7 s ¼ κ∕T ct : (2) Here, κ is a predefined constant.For the true generation of digital holograms, the server PC assigns a certain amount of CGH computations in proportion to the computing power of each client PC (called distribution subprocess hereafter).Each client PC performs the assigned CGH computations (called CGH computation subprocess hereafter).Then, the server collects the results from each client and generates the final holograms by accumulating/arranging them (called collection subprocess hereafter).However, in the previous study, 21 generation of each frame (W h × H h ) of a video hologram was parallelized separately.That is, the light sources for generating a single frame were distributed to C clients and the partial CGH computations with the distributed light sources were performed for each client.Then, the intermediate interference patterns (with the same resolution as the final hologram, i.e., W h × H h ) computed for each client were sent back to the server and were accumulated.Therefore, given that the mean data communication time between the server and each client was T t , the total communication time for collecting the results from the clients was CT t (the distribution time could be ignored when compared with the collection time).With the high hologram resolution and the large number of PCs, the communication time was too long, and this presented a significant challenge for the rapid generation of each frame.In the generation of a video hologram, the same process was repeated for each frame.The total generation time linearly increased in proportion to the number of frames F; hence, the total communication time was CT t F. The proposed system tries to reduce the data communication time in two ways.The overview of the proposed system is shown in Fig. 1.

Distribution of Computations on a Frame-to-Frame Basis
The proposed system distributes a certain number of frames to each PC on the basis of its performance of each PC as described in Eq. ( 3) (also see Fig. 2).It assigns all of the CGH computations to generate each frame for a client; thus, the data communication time to generate each frame is T t and not CT t (once for each frame, the fully generated hologram is sent to the server).In other words, the proposed system can reduce the data communication time by a factor of C. The total communication time is T t F during the generation of a video hologram.
E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 3 ; 3 2 6 ; 6 1 6 Here, Ψ and Ψ c denote the number of frames that is a controllable throughput (≤F) and is allocated to each client, respectively.s c denotes the computing power of the c'th client.Conversely, the CGH computation time (denoted by T cp ) of each frame in the proposed system can be high since all the computations are performed on a single PC.This is in contrast with the method used in the previous study in which the computations were distributed to multiple PCs. 21owever, with respect to a video hologram with a large number of frames, multiple frames can be simultaneously generated via parallel processing with multiple PCs.Additionally, the computation time is optimally minimized by determining the number of frames generated in each PC based on the computing power of each PC.Specifically, in the previous study, 21 by setting T cc as the CGH computation time for each frame, the total CGH computation time for F frames simply becomes T cc F. In contrast, in a simple case in which all the PCs have the same computing power, F∕C frames are assigned to each PC, and the total CGH computation time in the proposed system is T cp F∕C.Although T cp is considerably larger than T cc , T cp ∕C is equal to T cc with a large C.This is also applicable when client PCs have different computing power because a smaller number of frames are assigned to the client with lower computing power.Consequently, the proposed system specializes in generating a video hologram with a large number of frames (at least F ≥ C).
Notice that, in the proposed method where the parallelization is achieved on a frame-to-frame basis, each PC has a residual computational capacity as shown in Fig. 2. To resolve the problem, one can consider an approach that splits a frame into two (or more) subframes (i.e., distributing the light sources of a frame to different clients, which is similar to the previous study 21 ) and assigns them to the residual space as shown in the lower figure of Fig. 2.However, since the CGH images (i.e., intermediate interference patterns) computed from the split frames have the same resolution as that of the CGH images (i.e., fully generated interference patterns) computed from the nonsplit frames, the data communication time is doubled.In turn, the benefit from minimizing the residual capacity by splitting the frame is larger than the loss associated with the increase in the data communication time.

Parallelization of the Subprocesses through Multithreading
By distributing the CGH computation on a frame-to-frame basis, the number of transmissions of the computation results from the client PCs to the server can be reduced.However, when the number of client PCs or the hologram resolution is high, the time taken for the reduced number of transmissions is still long.To resolve this problem, the proposed system executes the subprocesses (distribution, CGH computation, and collection) in parallel by multithreading.In other words, each client can get the light source information for the next frames or send the computation results for the previous frames to the server while performing the CGH computation for the current frame.With this scheme, if the time taken for both the distribution and collection subprocesses is shorter than that taken for the CGH computation subprocess (actually, this is very common), the total hologram generation time is fully determined by the CGH computation time and the data transmission time can be zero.
To make the subprocesses run in parallel, all the operations in each client PC and the server are implemented as thread functions that communicate with each other using the message passing method 22 (see Fig. 3).On the server side, the control thread decides how many frames to distribute to each client PC and the collect thread collects the computation results (i.e., fully generated interference pattern for each frame) from the client PCs and arranges them.On the client side, the compute thread computes the interference patterns for the assigned frames.The send and receive threads on both sides communicate the light source information of frames or the resulting interference patterns with each other.In each client PC, the receive thread sends a message to the compute thread after receiving the light source information from the server and then waits for the light source information for the next frame.The compute thread sends a message to the send thread after completing the CGH computation for the current frame and then waits for the message from the receive thread.The send thread sends the resulting interference pattern for the current frame to the server and then waits for the message from the computer thread.Consequently, the receive thread can receive the light source information for the next frames while the computer thread is performing the CGH computation for the current frame.The compute thread can perform the CGH computation for the next frames while the send thread is sending the interference pattern for the current frame to the server.
Notice that there is no memory problem occurred by parallelizing the subprocesses; thus, no elaborate memory management is required.In the distribution and computation subprocesses, the amount of light source data is very tiny and each frame is computed/generated sequentially (not in Fig. 3 Thread functions in each client PC and the server PC. parallel) in the clients.This ensures that the clients need only a small amount of memory.In the collection subprocess, all the frames can arrive at the server at the same time in the worst case.However, this situation rarely happens, and the required memory amount is still not a big deal.

Experimental Results and Discussion
The performance of two proposed schemes, namely, frame distribution and multithreading (abbreviated to FD and MT hereafter), for reducing the data communication time in generating video holograms is evaluated.

Effect of Changing the Way for Distributing CGH Computations
A PC cluster was composed of six PCs (a server and five clients) that were connected to each other through a gigabit Ethernet hub (Cisco SG300-28 23 ) and Winsock TCP/IP. 24o network performance optimization was considered.Each client PC had one or two CUDA-enabled GPUs as shown in Table 1.In a manner similar to the previous study, 21 the "windmill" video was used as a 3-D object (see Fig. 4).The OpenCV 25 library and the CUDA API 26 were used for image processing and parallel processing, respectively.
In Eq. ( 1), the reference wavelength was 532 nm and the pixel pitch was 8 μm.
Four experiments were performed to analyze how the hologram generation time varies in various conditions (number of client PCs, number of light sources, hologram resolution, and number of frames).Each experiment was repeated 10 times, and the results were averaged.
First, the CGH computations were performed for 100 frames with a hologram resolution of 2048 × 2048 and ∼23;000 light sources.The computation times of two systems, namely, the proposed system (with FD only) and the system used in the previous study, 21 were measured while increasing the number of client PCs (see Table 2).The total time of the proposed system was continuously reduced but that of the previous system 21 was not.However, the core computation times (which corresponded to the difference between the total time and the data communication time) of both systems were similar and continuously decreased by increasing the number of client PCs.This was because the data communication time of the previous system 21 rapidly increased whereas that of the proposed system decreased.Consequently, the difference between the total computation times of both systems was potentially owing to the difference between their data communication times.With five clients, the proposed system assigned the frames of 30%, 21%, 8%, 39%, and 2% to each client in order and was ∼2.1 times faster than that of the previous system. 21The data communication time was ∼10.0 times shorter.
Notice that the total computation time of the previous system 21 increased when C > 3.This is because the increase in data communication time was larger than the time saved by the distributed computation using multiple PCs.This indicates that the performance of the previous system 21 is strictly limited without resolving the increase in data communication time.Consequently, it is expected that the difference between the total computation times of the proposed system and the previous system 21 would be larger when C > 5.
Second, the CGH computations were performed for 100 frames with 1536 × 1536 hologram resolution and five client PCs.The computation times of the same two systems were measured while increasing the number of light sources (see Table 3).The data communication time was only slightly influenced by the number of light sources.With respect to the time for the system used in the previous study, 21 the total time gradually increased given that the study achieved the parallelization on a light source basis.However, the proposed system with the fixed number of frames was slowed  down in proportion to the number of light sources.Consequently, the ratio between the total time of both systems continuously decreased owing to the increase in the number of light sources.This implies that the proposed system may not be suitable for the case with a small number of frames and a huge number of light sources.In our experiments, although the total time of the proposed system was still shorter than that of the previous system, 21 this can be reversed with more than 60,000 of the light sources as shown in Fig. 5. Third, the CGH computations were performed for 100 frames with ∼23;000 light sources and five client PCs.The computation times of the same two systems were measured while increasing the hologram resolution (see Table 4).The core computation times of both systems were similar and equally increased by increasing the hologram resolution.However, the data communication time for the previous system 21 indicated a significantly rapid increase (this was because the difference between CT t ðC ¼ 5Þ and T t increased as T t increased); thus, the ratio between the total time of both systems was maintained as ≈2.
Fourth, the CGH computations were performed with 1536 × 1536 hologram resolution, ∼23;000 light sources, and five client PCs.The computation times of three systems, namely, the proposed system, a system that parallelized the CGH computation on a frame-to-frame basis (same as the proposed system) but assigned the same number of frames to each client PC, and the system in the previous study 21 were measured while increasing the number of frames (see Table 5).The results indicated that the previous system 21 was faster than the other systems for the single frame case.However, since its data communication time was lengthy when compared with that of the other systems (the ratio between the data communication time of the previous system 21 and those of the other systems increased as the number of frames increased), the previous system 21 was slower than the other systems for the cases with two frames and higher.The difference in the total times of the previous system 21 and the proposed system increased as the number of frames increased.In Fig. 6, while the core computation time of the previous system 21 was almost constant with respect to the number of frames, the core computation time of both the uniform-distribution and the proposed systems decreased as the number of frames increased.With more than 30 frames, the core computation time of the proposed system became similar to or slightly shorter than that of the previous system 21 (as mentioned before, the efficiency of the proposed system comes from reduction in the data communication time).Note that the core computation time of the uniformdistribution system could not be reduced below 400 ms.This led to the difference between the total time of the proposed adaptive-distribution system and the uniform-distribution system.In the experiments where each frame had the same number of light sources, the total time of the proposed system was saturated at 100 frames or higher.The proposed system could be more advantageous if each frame had different numbers of light sources.[The uniform-distribution system distributes the frames to clients evenly, regardless of what numbers of light sources each frame has.This has a risk of distributing the frames that have a number of light sources to a low-performance client PC.In contrast, the proposed  Fig. 5 Plotting the total computation time in Table 3 and its secondorder polynomial extrapolation.
system can readily handle this problem, by modifying Eq. ( 3) to adaptively distribute the frames while taking into consideration the number of light sources that each client PC has.]

Effect of Using Multithreading
A slightly different PC cluster was used (see Table 6), but the other experimental environments were almost the same as the previous experiments.First, 150 frames were generated with a hologram resolution of 1024 × 1024 and ∼23;000 light sources.The generation times of two systems, namely, the proposed system (with FD only) and the proposed system (with both FD and MT), were measured while increasing the number of client  *The system that uniformly assigns the number of frames to each client PC.
Fig. 6 Core computation time (ms) in Table 5. PCs (see Table 7).The separate CGH computation time and data transmission time of both systems were similar, and the total generation times of both systems were continuously decreased.However, by using multithreading, the data communication time could be further reduced (because the collection subprocess are running in the background) and the proposed system with both schemes was faster by increasing the number of client PCs.With four client PCs, the proposed system with both FD and MT was ∼1.2 times faster than with FD only.Second, 150 frames were generated with 1024 × 1024 hologram resolution and four client PCs.The generation times of the same two systems were measured while increasing the number of light sources (see Table 8).As expected, both systems with the fixed number of frames were slowed down in proportion to the number of light sources.In particular, the CGH computation time was much longer than the data transmission time.This gradually reduced the benefit from using multithreading.Consequently, although the total generation time of the proposed system could always be shorter by using multithreading, the speedup index of 1.56 with 5300 light sources was decreased to 1.10 with 50,000 light sources.
Third, 150 frames were generated with ∼23;000 light sources and four client PCs.The generation times of the same two systems were measured while increasing the hologram resolution (see Table 9).As already observed in Table 4, both the CGH computation time and the data communication time increased together and at the same rate by increasing the hologram resolution.Consequently, regardless of the hologram resolution, the proposed system with both FD and MT was ∼1.17 times faster than with FD only.
Fourth, in the experiment of Table 8, the MT scheme was applied to the previous system. 21For each frame, the three processes for distribution of light sources, CGH computation on the client side, and collection of the partial interference patterns from the clients were parallelized through multithreading.As shown in Table 10, the previous system was also greatly improved although the improvement was gradually lost in proportion to the number of light sources.This indicates that the MT scheme is useful for the previous system as well.Actually, the MT scheme was more effective for the previous system because of the higher percentage of the data transmission time.Compared with the results of Table 8, the previous system with MT could be faster than the proposed system with FD only when using a large number of light sources.This presents the impact of the MT scheme.However, with a small number of light sources, the proposed system with FD only was faster.The more important thing is that the proposed system with both FD and MT was always faster (maximally 4.3 times faster) than the previous systems  with and without MT.Therefore, we can safely say that both the schemes FD and MT are necessary for fast generation of a video hologram.Finally, in the experiment of Table 9, the MT scheme was applied to the previous system. 21As shown in Table 11, the improvement by MT was significant and consistent regardless of the hologram resolution.When the hologram resolution was high, the previous system with MT could be faster than the proposed system with FD only (see the results for L ≈ 50;000 in Tables 9 and 11).However, the proposed system with both FD and MT was always faster (maximally 2.1 times faster) than the previous systems with and without MT.Therefore, it is clear again that both the schemes FD and MT are necessary for fast generation of a video hologram.

Conclusion
This study proposed a PC cluster system that efficiently generated a video hologram.The system first parallelized the hologram generation on a frame-to-frame basis to reduce the data communication time between client PCs and the server and thus specialized in generating a video hologram with a large number of frames.In addition, the system could optimally distribute the number of computations to each PC according to its computing power.The efficiency of the proposed system was evident in the experiment.For a video hologram with 100 frames, 1536 × 1536 hologram resolution, and ∼23;000 light sources, the proposed system (composed of five client PCs) generated each frame in 242 ms.This was 1.9 times shorter than the system that parallelized the computations for generating each individual frame and 1.8 times shorter than the system that equally distributed the number of computations to each PC.
Then, the proposed system also enabled the subprocesses for generating each frame of a video hologram to execute in parallel through multithreading.This made the data communication time close to zero and thus enabled the proposed system (composed of four client PCs) to be additionally 1.2 times faster in the experiment where a video hologram with 150 frames and ∼23;000 light sources was generated.
With the proposed schemes for reducing the data communication time, it could be expected that the hologram generation time would be further reduced by increasing the number of client PCs.Therefore, it would be interesting to analyze the performance of the proposed system with many more PCs.In addition, the performance of the proposed system will depend on the other system configurations (specifications or topology of client PCs).Therefore, in the near future, we are going to explore how to set up a set of clusters that is more optimal.

Fig. 1
Fig.1Overview of the proposed system.

Fig. 2
Fig.2Parallelizing the CGH computation on a frame-to-frame basis (upper one) and on a light source basis (lower one).The height of the blue boxes indicates the amount of computations that may be processed at a time on each PC.

Fig. 4
Fig. 4 3D object video used in our experiments.(a) 46th frame, (b) its CGH image (1536 × 1536), (c) enlargement of a region (black square) in the CGH image, and (d) the optical reconstruction image.The main purpose of this study is the rapid generation of the CGH image for each frame.

Table 1
Specifications of each PC in the first PC cluster.

Table 2
CGH computation time (ms) per frame according to the number of cluster PCs.The value within parentheses represents the core computation time except the data communication time. *

Table 3
CGH computation time (ms) per frame according to the number of light sources.

Table 4
CGH computation time (ms) per frame according to the hologram resolution.The value within parentheses represents the core computation time except the data communication time. *

Table 5
CGH computation time (ms) per frame according to the number of frames [Ψ in Eq. (3)].

Table 6
Specifications of each PC in the second PC cluster.

Table 7
Video hologram generation time (s) according to the number of cluster PCs.

Table 8
Video hologram generation time (s) according to the number of light sources.

Table 9
Video hologram generation time (s) according to the hologram resolution.

Table 11
21deo hologram generation time (s) of the previous study21according to the hologram resolution.

Table 10
21deo hologram generation time (s) of the previous study21according to the number of light sources.