Statistical approach to time-to-impact estimation suitable for real-time near-sensor implementation

Abstract. We present a method to estimate the time-to-impact (TTI) from a sequence of images. The method is based on detecting and tracking local extremal points. Their endurance within and between pixels is measured, accumulated, and used to achieve the TTI. This method, which improves on an earlier proposal, is entirely different from the ordinary optical flow technique and allows for fast and low-complex processing. The method is inspired by insects, which have some TTI capability without the possibility to compute high-complex optical flow. The method is further suitable for near-sensor image processing architectures.


Background
There are numerous applications for a system that can estimate time-to-impact (TTI) from a sequence of images generated by a video camera. The applications range from vehicle collision warning sensors to robotics and safety systems in industry. TTI estimation is a special case of general motion estimation. The aim is to estimate when a possible collision may occur between the camera and an object seen by the camera.
The image processing needed to perform real-time TTI estimation usually requires a fair amount of hardware and processing resources. Spatial motion within the image is typically based on estimating the optical flow. 1 To do this in real time requires fast computing hardware and data storage that can hold one or more frames. The camera itself needs to produce low-noise images of good quality. 2 If used in outdoor applications, the dynamic range of the camera needs to be high. 3 Inspired by the vision system of insects our ambition has been to find a simpler and faster method for TTI estimation. One characteristic of insect vision is the use of extremely localized spatial and temporal contrasts. 4 In two previous papers, we presented a method based on estimating the "inverse" of the motion (how long an image feature stays at the same pixel position). 5,6 This approach drastically reduces the computational load. We also showed that the method lends itself naturally to a smart sensor architecture denoted near-sensor image processing (NSIP). 7 In this paper, we extend our search for a method which, like our previous proposal, lends itself to a low-complex hardware implementation. The two methods detect LEPs in the same way. However, they use completely different methods to compute TTI. The presented method is much more stable in the presence of noise, particularly in the 2D case. The main reason for this is that the previous method required runs of 10 to 20 consecutive uninterrupted LEPs, which made it highly noise sensitive. In contrast, the current method requires no more than two frames, which, besides being less sensitive to noise, also allows for a faster decision.
Like in our previous papers, we investigate the case where the relative motion is "head-on" meaning that the motion of the camera is co-linear with the line-of-sight. Furthermore, the object is assumed to fill up the entire field-of-view and has a depth that is essentially constant over the image.
In Sec. 3, we give a short background to TTI estimation and present our alternate way to compute optical flow and to estimate TTI for the 1D case. Sections 4 and 5 extend the method to 2D and give simulation and experimental results. In Sec. 6, we discuss implementation issues based on dedicated architectures and performance estimates. Section 7 finally summarizes the results and discusses conclusions that can be drawn from them.

TTI Estimation, 1D
It is well-known that TTI estimation from a camera input does not require absolute measurements of velocity or distance. Instead, motion in the image plane, such as the optical flow, 8 is sufficient data for the purpose. An object that contains contrast variations will generate an image sequence where such image motions (displacements) expand from a central point, denoted focusof-expansion (FOE). The rate of expansion is small close to this point and increases further out in the image. Assuming a pinhole camera model, the displacement at an image point will be proportional to the distance between this point and FOE. The proportionality factor k will change as the camera moves closer to the object. However, it is sufficient to get just one measurement of k in order to estimate TTI provided that the relative velocity between the camera and the object stays constant. As shown earlier, TTI is inversely proportional to k. 5 Naturally, with several estimates at different points in time, the accuracy can be improved and will also allow to track nonconstant camera and/or object velocity.
We next describe our proposed method to estimate k.

Statistical Approach to Estimate k
Our earlier approach for the TTI application utilized two main principles, namely (i) that motion vectors are only computed at specific feature points which we denote local extremal points (LEPs) and (ii) that time, rather than displacement, is measured. Based on how long of a time period, a feature point stays within one pixel position we obtained the optical flow data needed for our purpose by taking the inverse of this time interval. The algorithm is based on the same principle as the previous algorithm, to use the stability of LEPs in the image to compute TTI. However, the previous algorithm required a large number of frames before producing a result. The algorithm, based on statistics, can achieve an answer from just two consecutive frames similar to the case for a traditional optical-flow method.
We will describe the method by starting with the 1D case. We assume that there are N pixels altogether along a line, that the frame rate is high so that consecutive frames are available at approximately the same TTI measurement instance (same k) and that the FOE is located at the center of the line. We further index the pixels with i ¼ 0;1; 2; : : : ; N∕2 in both directions starting from the FOE. Displacements in the image plane due to the motion is denoted DðiÞ and is expressed in pixels with reference to the previous frame.
By fðiÞ, we denote the existence of an LEP at location i at the time of measurement, i.e., fðiÞ ¼ 1 if there is a LEP, otherwise fðiÞ ¼ 0. Likewise, f 0 ðiÞ denotes the existence of a new LEP [meaning that fðiÞ ¼ 0 in the previous frame]. The frame rate is assumed to be high enough that all new LEPs are captured. This requires that the maximum displacement D appearing at i ¼ N∕2 is less than the interpixel distance D N∕2 < 1.
However, as we will see later, the method fails gracefully for larger displacements. Given that an LEP at position i moves at a speed of D i pixels/frame, this LEP will, on the average, stay at the same pixel location for a duration of 1∕D i frames which we denote as the LEP run length. It will then have moved to a neighboring pixel. In our previous method, we measured the run lengths of the LEPs that required to collect many frames, particularly to get correct values for LEPs close to the FOE.
However, as there is a close relationship between the appearance of a new LEP and the ending of a run length, it is possible to estimate k from this relationship in a statistical sense.
Based on the (optical flow) relation D i ¼ k · i and the fact that a new LEP appears whenever the total displacement reaches the distance of one pixel (after some number of frames), a proportion equal to D i of the LEPs will generate new LEPs at the next frame. In our 1D setup, there can be up to two LEPs located at the distance i from FOE. However, in 2D, there can be several of them. Alternatively, we can view the proportion as a probability that a new LEP will appear at a neighbor location given a LEP at location i. It should be observed that, due to the definition of LEPs, two of them cannot exist at neighboring locations.
The expected number of new LEPs at the next frame can thus be written as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 1 ; 1 1 6 ; 6 4 0 where the summation is taken over the whole image line (both instances of i).
From this, we see that k ≈ If the LEPs are sufficiently spread out across the image, which is the case for typical natural images, we can approximate (see Appendix): E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 3 ; 1 1 6 ; 5 0 3 This will lead to the following estimate of k: With FOE located at the midpoint of the sensor, it can be shown that constant ≈ 4∕N.

Finding Local Extremal Points
LEPs are defined as local maximum or minimum points in the image. In 1D, we compare each pixel with its two neighbors. If the pixel has a higher/lower value than its neighbors, it will be considered a local maximum/minimum. Although susceptible to noise, the LEP normally needs to be visible in at least two successive frames to allow for the computation of its displacement. Our proposed algorithm is statistical and a few erroneous LEPs will not distort the TTI estimation severely. Since the LEPs are based on the relative intensities between neighboring pixels, they can be extracted nicely by cameras that are able to form such information dynamically, for instance, NSIP cameras. 9,10

Estimating k and TTI
The basis of the algorithm is thus the following: The TTI can now be computed as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 5 ; 1 1 6 ; 1 1 0 where T d is the time between two frames and k is given by Eq. (4).

Simulation
We simulate the algorithm using a 1D image of size N ¼ 512. We zoomed into the image in 150 steps. For the first 15 steps, the zoom factor was 0.1 pixels per frame (for the outermost pixel). For the next 15 steps, the zoom factor was 0.2 and so on. For each step, we compute all LEPs. Plotting the LEP-images for all 150 1D frames results in the image shown in Fig. 1. It is seen that the LEP positions are quite stable in the beginning (top of the image) while they change their positions at the end (bottom of the image).
Computing the sums of fðiÞ and f 0 ðiÞ as a time sequence gives the data shown in Fig. 2. The blue curve corresponds to fðiÞ and the red curve to f 0 ðiÞ.
Defining, for convenience, R ¼ k · N which we denote the "zoom factor," it is seen that R corresponds to twice the maximum displacement. Applying Eq. (4) and multiplying by N, we thus get an estimate of the applied zoom factor. Figure 3 shows, for each frame, the computed value of R (in green) and the corresponding true value in blue. The black dots represent the average of the 15 steps with the same step length. Here we can see that the more data we have

Graceful Degradation
From Fig. 3, it is seen that as the zoom factor increases, the algorithm will start to deviate from the correct result. The reason is that the method requires that all displacements are below 1 pixel. However, the problem is not that severe. As is seen in Fig. 4, for zoom factors up to 3 (maximum displacements up to 1.5 pixels), the error will be relatively small. For larger zoom factors, the error will increase sharply. However, it is possible to detect when this is the case. In Fig. 4, the green dots represent sample R values, the black dots represent the average within each group of 15 frames, and the blue line represents the actual zoom factor at each frame.  It is seen that that the R values are slightly below the correct values when between 2 and 3. Above 3, the R values will level out.
The reason that R levels out is that f 0 ðiÞ will approach fðiÞ, as shown in Fig. 5, limiting the R value. This will be the case when essentially all LEPs are new LEPs.
The difference between the true zoom factor and the estimated value are plotted against the true zoom factor in Fig. 6. When >3, the estimate will be lower than the true value. Since we can estimate R, we know when it is out of range. The corresponding impact time is then shorter than the predicted time.

Dependence on the Number of LEPs
The algorithm is relatively insensitive to the number of LEPs. If we take the same 1D image and perform low-pass filtering before zooming, we get the LEPS shown in Fig. 7.
The corresponding graphs of the sums of fðiÞ and f 0 ðiÞ are shown in Fig. 8. When compared with Fig. 2, it is seen that the number of LEPs has been reduced by a factor of 2.5.  The R values are noisier. However, the average value is stable as is seen in Fig. 9.

Limitations
If the distance to the object is d and the field of view is N, the zoom factor between two frames is given by E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 6 ; 1 1 6 ; 1 5 5 The movement toward the camera between two frames differing in time by Δt given the velocity v is E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 7 ; 1 1 6 ; 9 0 Δd ¼ vΔt:  The zoom factor must be smaller than some constant C. We have shown previously that C can be around 3: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 8 ; 1 1 6 ; 4 6 7 This means that the velocity v must be smaller than E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 9 ; 1 1 6 ; 4 1 3 v < 2 C d NΔ t : 3 Extension to 2D

Estimating k
The estimation of k in the 2D case follows the same principle as in the 1D case. The proportion of LEPs that will generate new LEPs in the next frame depends on the distance from the FOE. The displacements in off-angled directions (not along a horizontal or vertical line through the FOE) are larger than their horizontal or vertical components, but so are also the distances that the LEPs need to move before reaching a new pixel and thus generate new LEPs. For this reason, it is necessary to compensate for those larger distances between the pixels. Thus a LEP located at position (i; j) with respect to the FOE is subjected to a displacement of E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 0 ; 1 1 6 ; 2 3 9 but the step required to give rise to a new LEP at the next frame is simultaneously increased due to the longer distance between off-angled neighboring pixels. For example, this extended distance is a factor of p 2 of the interpixel distance for diagonally located pixels and in general depends on the angle given by tanðj∕iÞ.
This function is not trivial as it depends on the pixel shape and fill factor of the specific sensor in use. For the moment, we will simply assume the existence of such a function sði; jÞ. In the following, the 2D image is assumed to have a size of N × N pixels.
The expected number of new LEPs is then given by where const ¼

Finding Local Extremal Points in 2D
In the 2D case, there are several different ways to define an LEP. The two most common ways are based on the chessboard distance and the city-block distance. In the chessboard case, the pixel is compared to its eight neighboring pixels. In the city-block case, there are four neighboring pixels. The image shown in Fig. 10 was used as the object. Camera motion toward this object is simulated by zooming in on the image. Figure 11 shows the LEPs in a 2D image based on the chessboard distance. As in the 1D case, an LEP is defined as a pixel which has a higher value than its neighbors.
In the same way as in the 1D case, we define f 0 ði; jÞ as the new LEPs appearing in a consecutive frame. The image in Fig. 12 shows f 0 ði; jÞ when zooming into the image in Fig. 10.

Estimating k
For the 2D simulation, we generated an image sequence of 15 frames. The first six frames were generated using a zoom factor of 0.2 pixels/row per frame, where each row consists of 512 pixels. The following five frames were generated using a zoom factor of 0.4 pixels/row per frame, and the last four frames were generated using a zoom factor of 0.6 pixels/row per frame. As in the 1D case, the zoom factor corresponds to twice the largest horizontal/vertical displacement in the frames appearing at the four midpoints of the frame borders since the FOE is located at the center of the frames.
Summing up all fði; jÞ and f 0 ði; jÞ) as a function of time, we get the graphs shown in Fig. 13. If we divide them with each other and apply the constant factor given in Eq. (15), we get the points shown in Fig. 14.
It is seen that we get three clusters with a scale factor of 1, 2, and 3 between them, which matches well with the true zoom factors.

Experiments
The experimental setup consists of a moving platform that moves toward the camera in a controlled fashion (see Fig. 15). The platform carries a picture which is the input to the algorithm. The distance between the camera and the image is initially 180 cm to become 171 cm after 50 frames. The camera has a frame rate of 1 frame/s. The speed of the platform is 0.12 cm/s, for the first 25 frames, and then 0.24 cm/s. It should be noted that the setup of this system is many orders of magnitude slower than the expected performance of a dedicated sensor. The reason for this was to simplify the setup using a standard consumer camera. From a principal point of view, this  amounts to merely a temporal scale factor with no other expected influence on the measurements.
From the high-resolution color camera image, we extracted a 512 × 512 grayscale image, which was at the center of the original image (see Fig. 16).
The TTI for the first 25 samples should be between 1550 and 1475 s depending on if the first frames or last frames are used for the estimation. Likewise, for the last 25 samples, the TTI should be between 738 and 712 s. Figure 17 shows the results from the experiment. The blue line is the actual value for each sample and the red line is the 3 × 1 average of the blue line. We can see that we have good agreement between the theoretical values and the experimental results.
In Eq. (9), it is seen that v is bounded by the inverse of the resolution, i.e., if we downsample by a factor of 4, we will increase the maximum value of v by a factor of 4.
In Fig. 18, the frame was further downsampled to 128 × 128 before the algorithm was applied. Here we can see that for the higher speed (frames 26 to 50), we get almost the same results as for the original image. For the first 25 samples, the difference varies from 0% to 100%.  The reason for the difference between the two resolutions can be found in Fig. 19 where we have plotted the number of LEPs and the number of new LEPs for each frame. In the first part, there are very few new LEPs per frame which makes it sensitive to noise.

Implementation Issues
Our aim has been to develop a method that lends itself to a low-complex implementation, ultimately a single chip solution. In this section, we discuss the basic ingredients of such an  implementation. The philosophy is to use the NSIP paradigm. According to this principle, some of the early feature extractions are performed in the analog domain utilizing the behavior of the photosensing circuitry itself. 9

Sensor-Pixel Processor and the NSIP Paradigm
NSIP is based on a tight integration between the photosensitive element and a low-complex digital pixel processor. A/D conversion is not supported in hardware and is typically not used as part of the required signal processing. Instead, the fact that a photodiode discharges linearly in  Åström and Forchheimer: Statistical approach to time-to-impact estimation suitable for real-time. . . time when exposed to light is combined with a comparator/threshold circuit to turn the light intensity into a corresponding timing event as shown in Fig. 20.
The output from the comparator is fed into a digital processor element. To implement the TTI algorithm, this processor consists of a combinatorial circuit, which connects to the eight neighboring comparator outputs and two one-bit memories. The combinatorial circuit detects asynchronously when its own comparator returns a binary zero, whereas the neighboring comparators are still returning binary ones, thus indicating the occurrence of an LEP. One of the bit memories stores this information. The second bit memory indicates if the LEP is new or existed in the previous exposure.   The size of the chip and the number of transistors is estimated as follows.

Single-Chip System
Assuming the use of a 100-nm CMOS technology and a sensor size of 256 × 256, the pixel pitch is estimated to be 14 × 14 μm 2 out of which 50% consists of the photodiode. The summation network is purely combinatorial and consists of cascaded one-bit adders. To make it faster, it can be implemented as a two-layer structure (e.g., eight groups of 32 one-bit adders in the first layer and seven 8-bit adders in the second layer). Based on previous implementations of such networks made by one of the authors, its size is estimated to occupy a depth of 50 μm. Depending on the desired output (separate LEP sums, k value, or TTI), a fix-point multiplier/ divider unit may be included as well. Since it only needs to compute one value per frame, it can be sequential in nature, e.g., computing a reciprocal iteratively to perform division through the multiplier unit. Altogether, including pads and some decoding circuits, the chip size is estimated to be <20 mm 2 . Naturally, using a more advanced VLSI technology, the size of the chip would shrink accordingly.
The summation network as well as the available light intensity will define the limiting speed in which the circuit can operate. It is estimated that a two-layer summation network requires <0.5 μs. Since available LEPs as well as new LEPs are summed, each row computation will take 1 μs leading to a frame time Δt of 0.25 ms. According to Eq. (9), this corresponds to the following minimum value of the TTI:   Let us assume that fðiÞ is a binary random vector with probability q that fðiÞ ¼ 1 and probability (1 − q) that fðiÞ ¼ 0.
Taking the expectation of the left-hand side gives: E Q -T A R G E T ; t e m p : i n t r a l i n k -; s e c 7 . 1 ; 1 1 6 ; 3 4 9 Similarly for the right-hand side: E Q -T A R G E T ; t e m p : i n t r a l i n k -; s e c 7 . 1 ; 1 1 6 ; 2 8 6 The approximation is thus exact in the statistical mean sense. Naturally, this does not guarantee that the approximation is "good" in specific cases. In our case, we use 1D images of size 512 pixels. Table 1 shows the result of 100 outcomes at three different values of the probability q for this vector size. It shows the correct value (rounded over the 100 trials) and four statistical measures for the approximation error. Table 1 shows the correct value (rounded over the 100 trials) and four statistical measures for the approximation error. It is seen that the error increases when there are few LEPs and can reach more than 10% in such cases. ∕sði; jÞ: We will break this down into a substantially simpler computation through a series of approximations.
First, as noted in Eq. (13), the two factors within the summation sign are separated: Second, we will choose sði; jÞ to be a constant function, leaving P i;j ffiffiffiffiffiffiffiffiffiffiffiffiffi ffi i 2 þ j 2 p to be computed.
For this purpose, we will approximate the discrete sum by a continuous sum: where yet one more approximation is introduced by taking the second integral over a circular area of radius R instead of the quadratic area of size N Ã N. Setting πR 2 ¼ N 2 gives R ¼ N ffiffi π p . So that E Q -T A R G E T ; t e m p : i n t r a l i n k -; s e c 7 . 2 ; 1 1 6 ; 2 6 4 X i;j ffiffiffiffiffiffiffiffiffiffiffiffiffi ffi i 2 þ j 2 q ≈ 2 3 ffiffiffi π p N 3 : For example, when N ¼ 512, the above approximation gives 50,482,829, whereas the discrete sum is 51,521,231. Thus the approximation error is <2%.
Next, we set sði; jÞ ¼ 1þ A more detailed analysis shows that sði; jÞ is not a constant but grows approximately linearly from 1 to p 2 for directions between 0 deg and 45 deg. Taking this into consideration when evaluating Eq. (14) would increase the above value of const by about 1.5%.