An image mosaic combines a series of images or video frames to form a comprehensive view of the scene. It has wide applications in a variety of areas such as video compression,^{1} global environment understanding,^{2} video editing and indexing,^{3} and panoramic image generation.^{4}

In order to combine a series of images, correspondence among images has to be established. This can be achieved by making use of common areas in the images, or the relationship between two images, which is usually represented by a mathematical transform. Due to the overwhelmingly large number of image pixels, it is prohibiting to find correspondence between images by doing a pixel by pixel search. Instead, only those pixels that convey critical information about the images are chosen, and they are referred to as interest points. The geometric and optical properties of these interest points are then evaluated to form the so-called local descriptors. The name comes from the fact that local descriptors are calculated from a neighborhood of individual interest points. Interest points from different images can then be matched by comparing their local descriptors. By using the matching points, correspondence between images can then be established. Finally, the images are mapped onto a common grid to form a comprehensive view.

The selection of interest points is critical for mosaicking. A significant amount of research has gone into developing criteria for choosing interest points. For example, some criteria are based upon target corners or junction of edges, to name a few.^{5} The interest points are chosen from those pixels with the largest distinguishing value according to their respective criteria. However, little has been done to determine how many interest points should be chosen for an image series. This is not considered an issue if one takes into account the fact that the more interest points are selected, the better the matching. However, the increase in computational complexity may prohibit the image mosaicking system from realistic applications. To illustrate, let the number of interest points be
$n$
. Then there are
${n}^{2}$
pairs of points to be matched given two images. Thus the computation time is quadratic to the number of interest points.

In this paper, we propose a dynamic point selection procedure to automatically choose the number of points according to similarities among images. Specifically, fewer interest points are chosen if the images are similar to each other; while more are chosen if the images are more disparate. Simulations show that by this mechanism, computation time is significantly reduced without compromising mosaicking accuracy.

Figure 1 shows a novel automatic, dynamic point selection procedure for image mosaicking.

First of all, the saliency map for each image frame is calculated. The saliency information of a pixel corresponds to its likelihood of being a corner point, and is calculated by the algorithm developed by Harris and Stephens.^{6} Once the salience for every pixel is calculated, those pixels with locally maximal saliency values are detected and sorted. From the sorted list, a number
$\left({N}_{1}\right)$
of the most salient points are chosen as interest points.

Local descriptors are calculated from a neighborhood of the interest points. Among a variety of local descriptors, scale-invariant feature transform (SIFT)^{7} has been shown to be robust^{8} and thus used in this paper. Refer to Ref. 7 for implementation details. The descriptors are then normalized to eliminate the influence of luminance change. The similarity of two interest points is determined by the inner product of their descriptors: the larger the inner product, the stronger the similarity.

The interest points are used to establish the correspondence between two images. According to the fundamental principle of camera geometry, if camera lens distortion and target occlusion are not considered, two images, ${\mathbf{I}}_{1}$ and ${\mathbf{I}}_{2}$ , generated for the same target are related by the following projective transform:

where $({x}_{1},{y}_{1})$ and $({x}_{t},{y}_{t})$ are the coordinates of the same target point in ${\mathbf{I}}_{1}$ and ${\mathbf{I}}_{2}$ , respectively.Note that due to quantization error in digital images, Eq. 1 cannot be satisfied strictly for two points corresponding to the same target point. Hence, correspondence between two interest points is defined as follows: if there exists an interest point at $({x}_{2},{y}_{2})$ in ${\mathbf{I}}_{2}$ such that the distance between $({x}_{t},{y}_{t})$ and $({x}_{2},{y}_{2})$ is no greater than $\sqrt{2}$ pixels, then $({x}_{1},{y}_{1})$ and $({x}_{2},{y}_{2})$ correspond to each other. The choice of $\sqrt{2}$ pixels as the distance threshold is because $\sqrt{2}$ is the maximum distance two connecting pixels can have, if 8-connectness is considered.

Once the similarity between every interest point in ${\mathbf{I}}_{1}$ and that in ${\mathbf{I}}_{2}$ is calculated, a number of ${N}_{2}$ $({N}_{2}\u2a7d{N}_{1})$ pairs of the most similar interest points are found. Based on these ${N}_{2}$ pairs of interest points, the projective transform coefficients $\left[{a}_{ij}\right]$ are obtained.

To register a pair of images, a certain number of corresponding interest points have to be found. For example, at least 4 pairs of corresponding interest points are necessary if projective transform as shown in Eq. 1 is used. To ensure a sufficient number of corresponding points generated, more interest points have to be found when the images are more disparate. On the other hand, fewer interest points are preferred for computation efficiency.

The proposed dynamic point selection procedure automatically increases the number of interest point, ${N}_{1}$ , when the selected interest points do not have enough corresponding ones. This is carried out by applying the estimated projective model on all the ${N}_{1}$ interest points. If ${N}_{3}({N}_{2}\u2a7d{N}_{3}\u2a7d{N}_{1})$ pairs of corresponding points are found, and the descriptors for these interest points are similar, then the projective model $\left[{a}_{ij}\right]$ is validated. Otherwise, ${N}_{1}$ is increased and the aforementioned procedure is reiterated.

On the contrary, if the selected interest points consist of more than enough corresponding ones, fewer interest points are chosen for the next pair of images. To be exact, if the previous ${N}_{4}$ or more pairs of images have been successfully registered without increasing ${N}_{1}$ , then ${N}_{1}$ is reduced.

Typical settings for ${N}_{2}$ , ${N}_{3}$ , and ${N}_{4}$ are 6, 12, and 4, respectively. In the experiments, ${N}_{1}$ is initialized to 50, and ${N}_{1}$ is increased by 10 each time for a failure of point registration, or reduced by 10 each time ${N}_{4}$ or more successive pairs of images have been registered without increasing ${N}_{1}$ . The choice of using 10 as the step size for adjusting ${N}_{1}$ is not critical: any step size, as long as it is not too large, can reduce computation.

Upon validation of the matching points, a local registration is carried out to accommodate changes in target appearance. Moving targets are removed and luminance variance is corrected. Finally, the images are warped to a common image grid to form a comprehensive view.

The proposed dynamic point selection is evaluated on the dataset in Ref. 8. This dataset includes the transformation parameters in addition to the images. Figure 2 shows three of the images and 100 interest points detected on each of them, respectively.

As images become more disparate, e.g., when the common area of the image is less, the corresponding points are fewer. Refer to Figs. 2 and 3 for examples. However, to ensure correctly identifying correspondence between images, the number of corresponding points should be greater than a certain number. For example, if the projective model is used to represent the transformation between images, at least 4 pairs of corresponding points have to be identified. Therefore, more interest points should be used for a pair of more disparate images. As shown in Table 1, only 4 interest points are needed from images 1 and 2, to result in 4 pairs of corresponding points between them; while 14 interest points are needed for images 1 and 3, for the same number of corresponding points.

## Table 1

Number of interest point necessary for 4,8,12,16,20,24,28 pairs of corresponding points. The amount of computation saved is shown in the last row.

# of corresponding point | 4 | 8 | 12 | 16 | 20 | 24 | 28 | |
---|---|---|---|---|---|---|---|---|

1 to 2 | 4 | 14 | 18 | 23 | 27 | 39 | 44 | |

# of | 1 to 3 | 14 | 23 | 29 | 41 | 53 | 56 | 62 |

interest | 1 to 4 | 20 | 33 | 53 | 66 | 78 | 87 | 97 |

point | 1 to 5 | 18 | 33 | 50 | 56 | 70 | 87 | 97 |

1 to 6 | 155 | 191 | 247 | 273 | 334 | 429 | 474 | |

Computation saved (%) | 79.22 | 78.41 | 77.88 | 77.40 | 77.40 | 77.85 | 77.81 |

To register a series of images without using dynamic point selection, the number of interest points has to be the maximum for matching any pair of images. As an example, refer to Table 1. To ensure 4 pairs of corresponding interest points between image 1 and image 2, 3, 4, 5, or 6, at least 4, 14, 20, 18, or 155 interest points have to be found, respectively. Without using dynamic point selection, at least 155 interest points have to be found from each of the images. However, by using dynamic point selection, only enough interest points are required. Because the computation complexity of point matching is quadratic to the number of points, the computation time can be reduced as much as 79.22%. Experiments are carried out to find 4, 8, 12, 16, 20, 24, or 28 pairs of corresponding interest points. As shown in Fig. 3 and Table 1, computation time can be reduced around 78%.

The proposed mosaicking algorithm has been well tested on real-world, monocular video sequences. It is shown to be accurate and robust, and runs in real time on a P4 CPU of $2.4\phantom{\rule{0.3em}{0ex}}\mathrm{GHz}$ . One of the sequences is shown in Fig. 4 and the mosaic is shown in Fig. 5.

In conclusion, an image mosaicking algorithm is developed by using a novel dynamic point selection procedure. It automatically selects a sufficient number of interest points. Simulations show that the proposed algorithm generates mosaic accurately and efficiently. Simulations also show that the dynamic point selection procedure can reduce computation time for point matching by about 78%.

## Acknowledgment

This work was supported by General Dynamics C4 Systems, and in part by the National Science Foundation under grant ECS-0002098.

## References

*Proc. 4th Alvey Vision Conf.*, Manchester, NH, pp. 147–151 (1988). Google Scholar