Fast digital image correlation using parallel temporal sequence correlation method

Abstract. Digital image correlation (DIC) is a noncontact technique that is widely used for deformation measurement, but improving the calculation efficiency to achieve real-time DIC calculation has always been a big concern. A parallel temporal sequence DIC method is proposed, which chooses seed points to determine the integer-pixel displacement and applies the moving least-squares fitting technique to acquire the subpixel displacement. This method avoids traditional complex iterations and takes full advantage of the GPU parallel computing. Results of a simulation experiment and an actual experiment demonstrate the accuracy and efficiency of the proposed algorithm. The calculation speed in the simulation experiment of the proposed method achieved 463,320 POI/s, whereas the speed in the actual experiment was 432,866 POI/s, when the speed of the ICGN method was 2700 POI/s and 2074 POI/s under the same accuracy, respectively. Also, the subpixel displacement calculation made up less than 1% of the entire calculation. The computational efficiency could be further enhanced if a faster integer-pixel displacement calculation method is discovered or a parallel algorithm is used.


Introduction
Due to its incomparable advantages, such as low environment requirement and easy data processing, digital image correlation (DIC) 1,2 has been applied in many noncontact measurement areas. [3][4][5][6] Now with the development of highspeed cameras and computer performance, improving the computation efficiency of the DIC method has become a major problem. Continuous improvements have been made to the DIC algorithm.
Originally, the Newton-Raphson 7,8 algorithm was widely accepted and used to determine the subpixel displacement due to its high accuracy. This method requires several iterations to optimize coefficients and to get the correlational function maximum. Later, the inverse compositional Gauss-Newton 9,10 algorithm was proposed, omitting the repeated calculation of Hessian matrix and greatly improving the computation efficiency. Also, a reliability-guided displacement scanning strategy and a precomputed global interpolation coefficient look-up table are employed to accelerate the integer-pixel displacement searching and subpixel displacement calculation, respectively. 11 The calculation of a reliability-guided digital image correlation (RGDIC) method begins with a seed point and is then guided by ZNCC coefficients of the computed points. 9 This ensures that the calculation path is always along the most reliable direction, and error propagation is avoided. Then, a seed point-based parallel method was proposed to improve the calculation speed, which can maximize the computing speed using an improved initial guess. 12 In addition, the parallel computing based on graphics processing units (GPU) 13,14 has been found suitable for image processing, including in the DIC field. The compute unified device architecture (CUDA) as a parallel computing platform, developed by NVIDIA, has been widely used due to its flexibility. Through programming, GPU can be assigned different computational tasks. Points of interest can be divided into many groups, and points in a single group can be calculated at the same time. Considering that the pathdependent tracking strategy wastes the parallel computing power of modern computers with multicore processors, the RGDIC method is improved by using a two-section tracking scheme. 15 These calculated points with correlation coefficients higher than a preset threshold are taken as reliable computed points and are given the same priority to extend the correlation analysis to their neighbors. Thus, DIC calculation can be initially executed in parallel at multiple points by separate independent threads. A parallel DIC (paDIC) method powered by GPU parallel computing is proposed. 16 This method combines the IC-GN algorithm for subpixel registration with a path-independent fast Fourier transformbased cross correlation algorithm for integer-pixel initial guess estimation, achieving a superior computation efficiency over the DIC method, purely running on the CPU. Based on paDIC method, a pipelined system framework unifying five variants of combinations of CPU and GPU was proposed, which can be flexibly applied to various practical applications with different requirements of measurement scales and speeds. 17 A fast method was introduced for estimation of dense two-and three-dimensional displacement fields from image correlation. It is based on a local, or window-based, optical flow algorithm, which is ideally suited for parallel processors. This method adopted a coarse-to-fine strategy on a multiresolution image pyramid helping to propagate a good initialization for the optimization throughout the scales; hence, convergence could be reached even for *Address all correspondence to Ming Cai, E-mail: caiming@mail.sysu.edu.cn displacements significantly larger than the subset size. 18 However, the repeated iteration is not avoided and is only simplified through these methods.
A DIC algorithm combining the spatial correlation with temporal continuity is proposed. 19 The time-consuming iteration operation during subpixel displacement calculation is replaced by the fitting method based on the moving leastsquares technique. Based on this method, a parallel temporal sequence DIC method is proposed. A fast integer-pixel searching algorithm based on the seed points is used to calculate the integer-pixel displacement. The integer-pixel displacement results of seed points are calculated along the time axis, and these results are taken as the initial displacement guess of other points. Thus, the searching field of other points can be greatly reduced and calculated together by separate independent threads. The fitting method based on the moving least-squares technique 20 is applied to determine the subpixel displacement, avoiding complex iterations and improving computation efficiency. Also, the fitting process of every point is fully independent, without any hinderance from the neighboring points and the overlapping subsets, such that the advantage of GPU parallel computing can be exploited.
In this paper, GPU parallel computing is used to speed up a temporal sequence DIC method, which chose seed points to determine the integer-pixel displacement and applied the moving least-squares fitting technique to acquire the subpixel displacement. The result shows a super-fast calculation speed under the equivalent precision.
2 Principle of the Parallel Temporal Sequence DIC Method The parallel temporal sequence DIC algorithm is divided into three steps, and it mainly considers a continuous deformation process. The first step is to calculate the integer-pixel displacement of seed points along the time axis. The second step is to use the results of seed points as the initial values to determine the integer-pixel displacement of other points, and the calculation of other points takes the advantage of parallel computing to accelerate. At last, the moving leastsquares technique is applied to determine the subpixel displacement, also using parallel computing.

Integer-Pixel Displacement Calculation
In order to decrease the computation time, the integer-pixel displacement of seed point at every deformation moment is calculated in advance. Along the time axis, the result of the last moment is also taken as the initial guess of the current moment considering the temporal continuity. As Fig. 1 shows, the point P is chosen as the seed point. In each speckle image, the rectangles in different colors stand by the search field, and the search of every moment is centered on the result of the last moment. Since the deformation is continuous in space, once the integer-pixel displacement of one point pðx; yÞ is acquired, the searching scale of its neighboring points p n ðx n ; y n Þ can be reduced greatly. The searching scales of the neighboring points are depended on the complexity of the deformation field. Considering the large deformation and deformation discontinuity in space, more seed points are needed, so the searching scale of neighboring points can be much smaller.
It is remarkable that the method is not suitable when there is a jump during the deformation process. Large deformation gradient will cause a large computation error, but this kind of computation error can be effectively decreased or avoided by increasing the search scale or by improving the exposure speed of the camera.
Other calculation points usually are chosen in a checkerboard pattern (if the surface of the specimen is irregular such as a hole, the calculation points can then be chosen according to the real situation). The whole points in different locations and in different deformation moments are calculated together. However, if the grid is too tight, the data conflict will still exist when reading gray data from the same speckle image. The strategy is to move the calculation grid. At first, a suitable vertical spacing and horizontal spacing are chosen to achieve a reasonable grid layout to avoid the data conflict, and then the grid is separately moved vertically and horizontally. Thus, the grid can now become tighter within several iterations and all the points of interest can be calculated.
Once the calculation task was determined, the CUDA programming is used to finish the job. As Fig. 1 shows, a multicore GPU usually consists of multiple steaming multiprocessors (SMs), and each SM includes a certain number of steaming processors (SPs). In the CUDA programming, the calculation task was assigned to grids of blocks, and each block contains a number of threads. In order to ensure maximum utilization of the GPU resource and hiding the latency effectively, the number of threads usually required is more than 64 (in this case around 512), and the number of blocks cannot be too small either, 9 which is according to the quantity of the speckle image analyzed once (in this case around 100). During the calculation process, a kernel consisting of a series of data and instructions is assigned to blocks and then distributed to threads, which run on every SP independently within each block.  A cubic polynomial is adopted as the basic function. However, the integer-pixel displacements happen to be stair-stepping due to the neighboring results being the same. In order to overcome the stairstepping effect and improve the fitting accuracy, the integerpixel displacement is weighted according to the correlation coefficient in advance. Previous studies have generally considered all integer-pixel displacements to have equal precision, hence the least squares algorithm is employed. However, accuracy differs among points in practice. The correlation coefficient is a simple representation of reliability, where larger correlation coefficient indicates that the integer-pixel displacement is a better approximation of the actual displacement. Therefore, a weighted function is generated based on the correlation coefficient, such that a point's weight is larger for larger correlation coefficient, and vice versa. The weighted fitting principle is described as follows:

Subpixel Displacement Calculation
E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 1 ; 6 3 ; 2 7 9 Jða 0 ; a 1 ; : : : ; a n Þ ¼ where Jða 0 ; a 1 ; : : : ; a n Þ represents the distance function, y represents the integer-pixel displacement results at different moments, and k represents the weight coefficients. The function fða 0 ; a 1 ; : : : ; a n Þ is the adopted polynomial, and a 0 ; a 1 ; : : : ; a n represents the unknown parameters. m represents the number of integer-pixel displacement results involved in fitting. In order to get the minimum of Jða 0 ; a 1 ; : : : ; a n Þ, a 0 ; a 1 ; : : : ; a n is substituted into Eq. (2): E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 2 ; 6 3 ; 1 1 6 ∂J ∂a j ¼ 2 X m i¼1 k i ðx i Þ j ða 0 þ a 1 x i þ : : : þ a n x n i − y i Þ ¼ 0; : The parameters can be obtained according to Eq. (3), then the subpixel displacement of the central point can be calculated. It is remarkable that this method may not work well at the few first moments because the location in the fitting curve of these moments is at the beginning, which may cause large computation error. For example, when the fitting data number is 33, the central number is 16 (t ¼ 16), but the first to the fifth moment must be calculated through the same curve (t ¼ 0;1; 2; : : : ; 15) ( Fig. 3).
Moreover, the fitting process of different points is fully independent, without any hinderance from the neighboring points and the overlapping subsets, such that the advantage of GPU parallel computing can be exploited. Similar to the integer-pixel displacement calculation, the computation task was divided into many groups and was distributed to every thread within every block. Also, the number of threads required is more than 64 (in this case around 512), and the number of blocks cannot be too small either (in this case around 100).

Experimental Verification
The parallel fitting DIC method was programmed using the C++ language based on CUDA 8.0 and was run on a desktop computer equipped with Intel i7-7700 CPU and a GeForce GTX 1080Ti graphics card. A simulation duralumin fatigue tensile experiment was used to illustrate the accuracy and efficiency of the method. As Fig. 4 shows, the speckle image was obtained as the reference image from a real experiment. Other speckle images were generated with preset subpixel  vðn; hÞ ¼ 5.0 þ 5.0 · sin 2π 75 · n − π 2 where h is the row number in the pixel matrix for each image, nðn ¼ 1;2; : : : ; 375Þ is the time series number for the image, and the unit for v is pixels. The coefficient in this equation was acquired from the ANSYS simulation result. A model of the duralumin specimen under load was established and simulated in the ANSYS software. The displacement curve was then obtained to calculate the coefficient. Since the plastic stage was relatively complex in the tensile experiment, the results were only calculated for the elastic stage in this study.
The computation points were arranged as a checkerboard. Note that the errors of integer-pixel displacements are all within 1 pixel. In order to evaluate the calculation accuracy and speed of the proposed algorithm, the ICGN method was used to compare. As shown in Fig. 5, the green line was the real displacement curve of a random calculation point along the deformation time axis, and the blue and the red points stood by the calculation results of the proposed algorithm and the ICGN method at each deformation time, respectively. These points agreed well with the actual sine curve, thereby exhibiting that the calculation of the proposed algorithm was correct and highly accurate.
The average error 7 and standard error 7 at every deformation time were used to illustrate the computation accuracy. Figure 6 shows the average error of the proposed algorithm and the ICGN method at different deformation times, and the result showed that the average error of the proposed algorithm is almost at the same level with the ICGN method.
Tables 1 and 2 show that the calculation speed of the proposed algorithm increased when the calculation image number or the calculation point number at a single deformation time increased. With the calculation requirement increasing, the computing units of GPU device can be fully unitized and the calculation speed increases. At last, it will reach the limit, and the speed becomes stable. According to Table 2, the maximum calculation speed achieved 463,320 POI/s (points of interest per second) when the calculation speed of the ICGN method was 2700 POI/s under the same precision, which verified that the proposed algorithm could greatly improve the computation efficiency compared with the ICGN method. Also, the result showed that the subpixel fitting method was suitable for parallel computing, and the subpixel displacement calculation made up less than 1% of the entire calculation.
In order to verify the accuracy and efficiency of the proposed method in the actual experiment, a real low carbon steel tensile experiment was used. The tensile sample size is shown in Fig. 7(a). The loading rate was 0.02 mm∕s, and   200 speckle images were obtained from the experiment, one of which is shown in Fig. 7(b). Also, the average error 7 and standard error 7 at every deformation time were used to illustrate the computation accuracy. (The real displacement of every point at every deformation time was obtained and calculated from the extensometer.) Figure 8 shows the average error and standard error of the proposed algorithm and the ICGN method at different deformation times. The result showed that the average error and standard error of the proposed algorithm are little more than those of the ICGN method. But the average error was within 0.04 pixels, whereas the standard error was within 0.03 pixels, which shows that the computation accuracy of the proposed algorithm was at the same level with the accuracy of present algorithms (which were usually within 0.005 to 0.1 pixels). 14 According to Table 3, in the actual tensile experiment, the calculation speed achieved 432,866 POI/s when the calculation speed of the ICGN method was 2052 POI/s under the same precision, which verified that the proposed algorithm could greatly improve the computation efficiency compared with the ICGN method. Also, in the simulation experiment, the subpixel displacement calculation made up less than 1% of the entire calculation. The computational efficiency could be further enhanced if a faster integer-pixel displacement calculation method is discovered. Moreover, it can be found that the calculation speed of the ICGN method clearly reduced in the actual experiment, whereas the speed of the proposed method almost remained unchanged. In the stage of subpixel displacement calculation using the ICGN method, more iterations are always needed in the actual experiment due to the lower quality speckle image compared with the simulation  experiment, but its effect on the speed of the proposed method can be ignored because there were no iteration progress during the proposed calculation method.

Conclusion
In this study, a new parallel temporal sequence DIC method was described. This method chose seed points to determine the integer-pixel displacement and applied the weighted moving least-squares fitting technique to acquire the subpixel displacement. The complex iterations were completely avoided, and the computation efficiency was further improved by parallel computing. The computational accuracy and efficiency of this proposed method were analyzed, and the conclusions are as follows.
• In the actual experiment, the calculation speed achieved 432,866 POI/s when the calculation speed of the ICGN method was 2052 POI/s under the same precision, which greatly improved the computation efficiency and did help in achieving real-time DIC. • The subpixel fitting method was suitable for parallel computing, and the subpixel displacement calculation made up less than 1% of the entire calculation. If a faster integer-pixel displacement is discovered, the computation speed can be further improved. • Under the restriction of data reading conflict, the grid cannot be too dense. If too many interest points were to be calculated, the grid can be separately moved vertically and horizontally, and the calculation grid can then be converged within several iterations.