## 1.

## Introduction

The global village enjoys the availability of data rich environment, but sometime suffers from an explosive growth of useless data known as the curse of dimensionality in the digital age. Thus, the scientific community has developed various pre- and post-processing techniques to deal with the curse of dimensionality. For example, a smart organization principle called compressive sensing.^{1, 2} Specifically, when one does not need the frequency information of those sinusoidal components, one shall not sense it with the Fourier transform (FT), but with a physical group wave packet called the wavelet transform (WT). Furthermore, the wavelet transform in compressive sensing can be viewed upon as preprocessing technique replacing the FT with the adaptive WT,^{3} which allows several super-mother wavelet kernels that each match the time evolving signal content, and therefore create a natural spatiotemporal preprocessing. For post-processing, the conventional data compressions or removing the redundancy in data such as JPEG, MPEG, face detection, etc., and dimensionality reduction (DR) algorithms, i.e., ISOMAP,^{4} local linear embedding,^{5} Laplacian eigenmap^{6, 7} and diffusion map.^{8, 9} They have been developed to discover the nonlinear manifold underlying complex datasets that traditional dimensionality reduction methods, e.g., principle component analysis^{10} and multidimensional scaling^{11} could not do. In addition, more DR algorithms^{12, 13, 14, 15} have been developed by encoding objects as matrices or tensors or arbitrary order. A generalized framework for dimensionality reduction has been presented in Ref. 16.

The generic DR for video datasets is defined as follows. Since every lexicographical order of pixels of the *i*’th video frame, *L*×*M* pixels, forms a long vector
[TeX:]
$\vec X _t \in O(L \times M)$
${\stackrel{\u20d7}{X}}_{t}\in O(L\times M)$
, subscript *t* = 1, …,*N*, has its own intensity values, the video itself becomes an exceedingly high dimensional set of light spots *O*(*L*×*M*×*N* time-frames). All these light spots are contributing to the Haken's cooperative and competitive self-organization phenomena^{17, 18} yielding the collective or master degrees of freedom, e.g., facial pose angles, which drive all the underlying slaver degrees of freedom.

Szu^{19} has derived asymmetric weights *W*_{i, j} ≠ *W*_{j, i} which have the positive diagonal *W*_{i, i} ⩾ 0 and a Kirchhoff-like graph theory energy function *K* as the weighted least mean square (LMS)
[TeX:]
$\min\cdot K = \min\cdot\break \sum\nolimits_{i = 1}^N {\sum\nolimits_{j = 1}^N {W_{i,j} |\vec X _i - \vec X _j |} ^2 \ge 0}$
$\mathrm{min}\xb7K=\mathrm{min}\xb7{\sum}_{i=1}^{N}{\sum}_{j=1}^{N}{{W}_{i,j}|{\stackrel{\u20d7}{X}}_{i}-{\stackrel{\u20d7}{X}}_{j}|}^{2}\ge 0$
where an in-bound risk
[TeX:]
$R_j = \sum\nolimits_i {W_{i,j} } \ge 0$
${R}_{j}={\sum}_{i}{W}_{i,j}\ge 0$
is different from the out-bound popularity
[TeX:]
$P_i = \sum\nolimits_j {W_{i,j} \ge 0}$
${P}_{i}={\sum}_{j}{W}_{i,j}\ge 0$
,
[TeX:]
$\sum\nolimits_i {|\vec X _i |^2 P_i } \ne \sum\nolimits_j {|\vec X _j |} ^2 R_j$
${\sum}_{i}|{\stackrel{\u20d7}{X}}_{i}{|}^{2}{P}_{i}\ne {\sum}_{j}{|{\stackrel{\u20d7}{X}}_{j}|}^{2}{R}_{j}$
and the off-diagonal terms are
[TeX:]
$ - 2\sum\nolimits_i {\sum\nolimits_j {W_{i,j} } } \vec X _i \cdot \vec X _j$
$-2{\sum}_{i}{\sum}_{j}{W}_{i,j}{\stackrel{\u20d7}{X}}_{i}\xb7{\stackrel{\u20d7}{X}}_{j}$
.^{19} Thus, the Kirchhoff energy
[TeX:]
$K \equiv (\vec X _i,K_{i,j},\vec X _j) \ge 0$
$K\equiv ({\stackrel{\u20d7}{X}}_{i},{K}_{i,j},{\stackrel{\u20d7}{X}}_{j})\ge 0$
becomes the inner product of the Kirchhoff matrix operator *K*_{i, j} ≡ [(*R*_{j} + *P*_{i}/2)δ_{i, j} − *W*_{i, j}],
[TeX:]
$\vec X _i$
${\stackrel{\u20d7}{X}}_{i}$
and
[TeX:]
$\vec X _j$
${\stackrel{\u20d7}{X}}_{j}$
. In the traditional video pose nonlinear DR, the Gaussian distance among pixels of video images becomes symmetric, and then the Kirchhoff matrix operator *K*_{i,j} is reduced to the graph-Laplacian matrix: *L*_{ij} ≡ [*d*_{i}δ_{ij} − *W*_{ij}] where *d*_{i} ≡ (*R*_{j} + *P*_{i}/2). The ground state zero eigenvalue is the facial membership functions indicating the same person^{7} (a degenerated ground state will indicate more than one person). The eigenvectors of the next eigenvalue become the master order parameters whose inner products with each facial image are the directional cosine angles representing the turning pose ordering angles. Such a pose-sorting similarity measure refers to the heading angles such as left-to- right and up-to-down.

To deal with the curse of dimensionality in the persistent surveillance, different strategies mentioned above can be applied, but in the end, a potential loss of association among different sensory tracks, i.e., video and audio, may occur. The observer cannot answer “who speaks what, when and where.” In this paper, a video organization principle is given and demonstrated in a controlled persistent surveillance environment, which can be generalized for other applications such as i. smart grid infrastructure, ii. biomedical wellness web, and iii. hyperspectral data for precision farming, etc.

In persistence surveillance, video cameras are remotely set up for security reasons while low-cost audio recorders are hidden locally within the persistent surveillance field shown in Fig. 1. To reduce the video storage size requirement, one can extract facial poses from each video frame and store those facial boxes instead of the video, which reduces the storage size by 6 orders of magnitude (see Table 1). The face detections, by facial color hue, and extractions in video are implemented by a face detection (FD)^{20} system on chip. Adding time stamps on facial boxes is not recommended because it may be insufficient to reconstruct time-order and takes extra time. For an example, when there is a crowd in a single camera, faces of a crowd may be blocked by each other, obstacles, or detected without enough size, these faces are ignored by face detection techniques. Moreover, when passengers move past each other, the face of unwanted people momentarily fills the location of the face of the person of interest. In the consequences, facial boxes cut from each video frame are labeled by time stamps but the order of facial boxes from each frame are still in random order. Due to the reasons mentioned above, the collection of facial poses in video over 10 s is stored in a randomized order for further process. Such a loss of temporal correlation requires a re-establishment of the association across different sensor modalities that may help to ascertain “who said what, when, and where.”

## Table 1

Persistent surveillance is commonly used from military to civil applications. The storage requirement of persistent surveillance is huge and content organization is complex due to the huge amount of video data. It is a challenge for security personnel to sort and search for specific intruders among enormous video data. Since human faces are the most typical surveillance targets, saving faces of passengers/intruders without redundant background information saves up to 6 orders of magnitude when considering storage volume.

Size of Video Frames (pixels), 30Hz | 640×480, 8 bits gray vlue | 1024×768, 8 bits gray value |

Storage requirement of video (Week) | 5.58×10^{12} Bytes | 1.43×10^{13} Bytes |

Size of facial boxes (pixels), 30Hz. | 60×60 | 90×90 |

Total time of Intruders revealed in video (α) | 10% | 10% |

Average number of intruders recorded in video (β) | 5 | 5 |

Storage requirement of saving facial boxes (Week) | 3.27×10^{10} Bytes | 7.3×10^{10} Bytes |

Frontal facial boxes recorded in different resolutions (γ) | 5 | 5 |

Total Number of intruders recorded in video (week) | 1000 | 1000 |

Storage requirements of saving frontal facial boxes (week) | 1.8×10^{7} Bytes | 4.05×10^{7} Bytes |

Size of recorded audio (week), 8 kHz Mp3 8kbit/s | 6×10^{8} Bytes | 6×10^{8} Bytes |

For an example, a passenger was walking in an airport transit looking left, right, and left again for the exit sign and speaking to her son about “where is the exit?” Both of their facial boxes were collected in Fig. 2. In Fig. 2, facial poses were sorted from left to right and top to bottom by graph embedded dimensionality reduction algorithms such as Laplacian eigenmap, etc., that is good to pick out mug-shots of each person, but they cannot reconstruct the time-ordering of walking passengers which is necessary to associate with an acoustic manifold finding out what may have been spoken. On the opposite, in Fig. 2, video and audio were correlated by the reconstructed time-ordering of passengers in a video. Furthermore, the walking speed of a passenger could be determined from a set of time-ordered facial boxes alone, assuming only few facial boxes were ignored by the FD system on a chip. In the latest studies,^{21, 22} walking speed is correlated with the personality.

A time-order reconstruction requires the careful differentiation between the continuous notion of time versus the discrete sampling of time flowing known as the time-ordering in this paper. Thus, we begin with a concept of a clock, Newton's concept of time, which can mark 1 s time unit with a “tick and tock.” Then, the unit of time must be followed by another unit tick and tock to indicate the 2 s time-order. This fact of directional flow may be described by a product of triple correlation functions,
[TeX:]
$W_{t_i,t_j,t_k } W_{t_k,t_l,t_m }$
${W}_{{t}_{i},{t}_{j},{t}_{k}}{W}_{{t}_{k},{t}_{l},{t}_{m}}$
, where the common node *t*_{k} represents a discrete sampling point of time.

In Sec. 2, we approximate the triple correlation products with a product of pair correlation functions for a computationally efficient sorting algorithm, which we call the zipper chain (ZC) algorithm. It works like a zipper, two rows of multiple “teeth,” mounted across each other alternatively. The underlying topology may be visualized as the time flow along a geodesic line, where two intertwined least mean square (LMS) minimums intersect. The ZC algorithm describes the product of paired correlations verified by a zip-chain event—an anchor face, say frontal β, has two nearest neighbors, A and C, while A and C have another two nearest neighbors, respectively. The time-order is determined if the anchor face β happens to be also the nearest neighbor of A and C. To choose an anchor face, the traditionally supervised desired output of the LMS Weiner matched filter becomes unsupervised, and is self-determined by the facial box dataset itself in two passes: the so-called local instances of good seeing determined by a self-referencing matched filter (SRMF) by Szu and Blodgett^{23} in 1982. The first pass takes a uniform average of all faces, producing a blurred, but statistically correct, face at a specific pose with respect to the setup between multiple camera angles and the walking corridor [emulating the so-called astronomical imaging long exposure (LE) of a double star in order to catch an instance of good seeing to take a turbulent-free snapshot]. For our digital video post-processing, the second pass chooses a sharp short exposure (SE) to be the anchor face (which could be the frontal face for a narrow corridor and head-on setup of videos).

Another phenomenon of facial boxes extracted by face detection techniques is that facial boxes extracted from video are in different sizes and coordinates. A preprocessing of viewing transformation and registration for each facial box is necessary before time-ordering reconstruction. This preprocessing is introduced to overcome the change of facial boxes by a maximum overlapping central region by a morphological image processing focusing the binarized center of each face box to achieve the scale invariant peripheral vision^{24} in Sec. 3. The ZC algorithm for time-ordered reconstruction of facial poses for association across different sensory is in Sec. 3. In Sec. 4, the impact of leak-through background information in maximum overlapped face boxes for the neighborhood determination is discussed. In addition, when the product of triple correlation is replaced by a pair-wise correlation, we have derived a fast approximation of embedding in line for facial pose similarity sorting in a short duration. The simulation of ZC is compared step-by-step against simpler similarity sorting results without the iterative re-check necessarily for the time-order sorting in Sec. 5. In Sec. 6, several possible applications of the algorithmic framework developed in this paper are briefly introduced before the conclusion in Sec. 6.

## 2.

## Theory of Time-Order Reconstruction of Pictures

Understanding time through the simple observation of a clock is the key. One tick-tock of a clock is a record of time, but it may take more than one set of tick-tock's to tell the time-order. This time-order is used to track the flow of time by the changes. That is to say tick-tock and tick-tock. On the other hand, a static face can be regarded as a collection of pixels (*L*×*M*) with light intensities representing each pixel. This collection of pixels specifies a high dimensional vector space
[TeX:]
$\vec x _j \in O(L \times M)$
${\stackrel{\u20d7}{x}}_{j}\in O(L\times M)$
with respect to a set of *L*×*M* Cartesian coordinates, one per pixel. Given any specific face,
[TeX:]
$\vec x _j$
${\stackrel{\u20d7}{x}}_{j}$
, there are two nearest neighbor frames of facial poses,
[TeX:]
$\vec x_i,\vec x_k$
${\stackrel{\u20d7}{x}}_{i},{\stackrel{\u20d7}{x}}_{k}$
except the first and last frames. In other words, we may need to introduce more than one pair correlation
[TeX:]
$W_{i,j} (\vec x_i,\vec x_j)$
${W}_{i,j}({\stackrel{\u20d7}{x}}_{i},{\stackrel{\u20d7}{x}}_{j})$
of the *i*’th tick and the *j*’th tock time subscript; e.g.,
[TeX:]
$W_{i,j} (\vec x _i,\vec x _j)W_{l,m} (\vec x _l,\vec x _m)$
${W}_{i,j}({\stackrel{\u20d7}{x}}_{i},{\stackrel{\u20d7}{x}}_{j}){W}_{l,m}({\stackrel{\u20d7}{x}}_{l},{\stackrel{\u20d7}{x}}_{m})$
where the *l*’th tick must follow *j*’th tock as the necessary condition with two more faces
[TeX:]
$(\vec x _l,\vec x _m)$
$({\stackrel{\u20d7}{x}}_{l},{\stackrel{\u20d7}{x}}_{m})$
. To make explicit this restrictive condition, time-order reconstruction is an iterative sorting of finding the maximum chain product of triple image correlation *W*_{i, j, k}*W*_{k, l, m} over the time-ordered facial image space. The triple correlation is defined as

## Eq. 1

[TeX:] \documentclass[12pt]{minimal}\begin{document}\begin{eqnarray} W_{i,j} &=& R({\vec{x}}_i,{\vec{x}}_j)=\int\!\!\!\int \vec x _i\vec x _j f_{\vec x _i\vec x _j}(\vec x _i,\vec x _j) d\vec x _i d\vec x _j\nonumber\\ &\equiv & \sum\nolimits^{L}_{x=1}\sum\nolimits^{M}_{y=1}I_{x_{i}}(x,y)I_{x_j}(x,y) \end{eqnarray}\vspace*{-12pt}\end{document} $$\begin{array}{ccc}\hfill {W}_{i,j}& =& R({\stackrel{\u20d7}{x}}_{i},{\stackrel{\u20d7}{x}}_{j})=\int \int {\stackrel{\u20d7}{x}}_{i}{\stackrel{\u20d7}{x}}_{j}{f}_{{\stackrel{\u20d7}{x}}_{i}{\stackrel{\u20d7}{x}}_{j}}({\stackrel{\u20d7}{x}}_{i},{\stackrel{\u20d7}{x}}_{j})d{\stackrel{\u20d7}{x}}_{i}d{\stackrel{\u20d7}{x}}_{j}\hfill \\ & \equiv & {\sum}_{x=1}^{L}{\sum}_{y=1}^{M}{I}_{{x}_{i}}(x,y){I}_{{x}_{j}}(x,y)\hfill \end{array}$$## Eq. 2

[TeX:] \documentclass[12pt]{minimal}\begin{document}\begin{equation} W_{i,j.k} \equiv \sum\nolimits^{L}_{x=1}\sum\nolimits^{M}_{y=1}I_{x_{i}}(x,y)I_{x_j}(x,y)I_{x_{k}}(x,y) \end{equation}\end{document} $${W}_{i,j.k}\equiv {\sum}_{x=1}^{L}{\sum}_{y=1}^{M}{I}_{{x}_{i}}(x,y){I}_{{x}_{j}}(x,y){I}_{{x}_{k}}(x,y)$$*x*(

*t*) and each facial box is the function of

*x*(

*t*) at a specific time

*t*= 1, …,n. [TeX:] $I_{x_i}$ ${I}_{{x}_{i}}$ , [TeX:] $I_{x_j}$ ${I}_{{x}_{j}}$ and [TeX:] $I_{x_k}$ ${I}_{{x}_{k}}$ are facial boxes.

## 2.1.

### Theory of Time-Order Reconstruction

One of our working assumptions about time-order is that video cameras are fixed and continuously image moving people, causing changes in facial sizes, norm distances, etc. We observed that the tick-tock of a clock follows another tick-tock. Similarly, the time-order in face images takes a minimum of three time points between frames. Two similar faces becoming smaller, or vice versa bigger, do not imply a correct time-ordering of whether they are walking closer or further away, unless a third similar face becomes even smaller or vice versa even bigger. In other words, in a set of *N* facial image vectors where
[TeX:]
$\vec x _i \in O(L \times M)$
${\stackrel{\u20d7}{x}}_{i}\in O(L\times M)$
*i* = 1, …, *N*, a tick-tock
[TeX:]
$(\vec x _{t_i },\vec x _{t_j },\vec x _{t_k })$
$({\stackrel{\u20d7}{x}}_{{t}_{i}},{\stackrel{\u20d7}{x}}_{{t}_{j}},{\stackrel{\u20d7}{x}}_{{t}_{k}})$
is determined by triple face image vectors
[TeX:]
$(\vec x _i,\vec x _j,\vec x _k)$
$({\stackrel{\u20d7}{x}}_{i},{\stackrel{\u20d7}{x}}_{j},{\stackrel{\u20d7}{x}}_{k})$
and
[TeX:]
$W_{i,j,k} (\vec x _i,\vec x _j,\vec x _k) \ne 0$
${W}_{i,j,k}({\stackrel{\u20d7}{x}}_{i},{\stackrel{\u20d7}{x}}_{j},{\stackrel{\u20d7}{x}}_{k})\ne 0$
in the directional order (*t*_{i} ⩾ *t*_{j} ⩾ *t*_{k}). The time-order in tandem is determined by a tick-tock,
[TeX:]
$(\vec x _{t_i },\vec x _{t_j },\vec x _{t_k })$
$({\stackrel{\u20d7}{x}}_{{t}_{i}},{\stackrel{\u20d7}{x}}_{{t}_{j}},{\stackrel{\u20d7}{x}}_{{t}_{k}})$
, with another tick-tock,
[TeX:]
$(\vec x _{t_k },\vec x _{t_l },\vec x _{t_m })$
$({\stackrel{\u20d7}{x}}_{{t}_{k}},{\stackrel{\u20d7}{x}}_{{t}_{l}},{\stackrel{\u20d7}{x}}_{{t}_{m}})$
, that satisfies
[TeX:]
$W_{i,j,k} (\vec x _i,\vec x _j,\vec x _k)W_{k,l,m} (\vec x _k,\vec x _l,\vec x _m) \ne 0$
${W}_{i,j,k}({\stackrel{\u20d7}{x}}_{i},{\stackrel{\u20d7}{x}}_{j},{\stackrel{\u20d7}{x}}_{k}){W}_{k,l,m}({\stackrel{\u20d7}{x}}_{k},{\stackrel{\u20d7}{x}}_{l},{\stackrel{\u20d7}{x}}_{m})\ne 0$
. This may lead to a higher order graph theory that is not addressed herein but in a George Washington University PhD thesis which requires higher order correlations to automate the cross association among different sensor modality tracks, achieved by time-ordered reconstruction for each sensor.

## 2.2.

### Self-Reference Matched Filter

For simplicity, the square (Euclidean) norm distance, also called the LMS distance, in Eq. 3 represents a measurement of similarity (or for neighborhood determination) between two images,

## Eq. 3

[TeX:] \documentclass[12pt]{minimal}\begin{document}\begin{equation} \hbox{\it Min}\;e = \hbox{\it Min}||\vec x _i - \vec x _j ||^2, \end{equation}\end{document} $$\mathit{\text{Min}}\phantom{\rule{0.28em}{0ex}}e=\mathit{\text{Min}}\left|\right|{\stackrel{\u20d7}{x}}_{i}-{\stackrel{\u20d7}{x}}_{j}{\left|\right|}^{2},$$*L×M*) of image size

*L×M*consisting of all gray pixel values of the whole image. Although in general the norm distance may not necessarily be the accurate metric to determine the similarity (e.g., other pertinent features: color of clothes, background of the person). In our applications, Eq. 3 seems to be adequate for image sorting of facial surveillance videos provided that we have a preprocessing called binarized centroid registration of each short exposure in a facial box, inspired by a graceful degradation of scaling in the human visual system (HVS) (see Appendix).

If we explicitly combine the maximum likelihood cost function together with the graph-Laplacian diffusion distance, the similarity metrics becomes identical to our self-reference matched filter (SRMF) least mean square energy as follows:

## Eq. 4

[TeX:] \documentclass[12pt]{minimal}\begin{document}\begin{eqnarray} e &=& \hbox{\it Min}||\vec x _D - \vec x _{t_i } ||^2 /2\sigma = - \log [G_{ij} (\vec {x_i },\vec x _j)]\nonumber\\ &=& - \log \left[ {\exp \left( { {-} \frac{{||\vec x _i - \vec {x_j } ||^2 }}{{2\sigma }}} \right)} \right]. \end{eqnarray}\end{document} $$\begin{array}{ccc}\hfill e& =& \mathit{\text{Min}}\left|\right|{\stackrel{\u20d7}{x}}_{D}-{\stackrel{\u20d7}{x}}_{{t}_{i}}{\left|\right|}^{2}/2\sigma =-\mathrm{log}\left[{G}_{ij}(\stackrel{\u20d7}{{x}_{i}},{\stackrel{\u20d7}{x}}_{j})\right]\hfill \\ & =& -\mathrm{log}\left[\mathrm{exp}\left(-\frac{\left|\right|{\stackrel{\u20d7}{x}}_{i}-\stackrel{\u20d7}{{x}_{j}}{\left|\right|}^{2}}{2\sigma}\right)\right].\hfill \end{array}$$Coifman have adopted the graph-Laplacian spectral theory
[TeX:]
$G_{ij} (\vec x _i,\vec x _j) \equiv \exp ( - ||\vec x _i - \vec {x_j } ||^2 /2\sigma)$
${G}_{ij}({\stackrel{\u20d7}{x}}_{i},{\stackrel{\u20d7}{x}}_{j})\equiv \mathrm{exp}(-||{\stackrel{\u20d7}{x}}_{i}-\stackrel{\u20d7}{{x}_{j}}{\left|\right|}^{2}/2\sigma )$
to sort facial poses as an example of nonlinear dimensionality reduction.^{8, 9} Without creating any confusion, we have associated, for two sets of readerships familiar with video processing and with nonlinear dimensionality reduction (NDR) graph theory, a snapshot time index
[TeX:]
$\vec x _{t_i }$
${\stackrel{\u20d7}{x}}_{{t}_{i}}$
freely with the vortex node
[TeX:]
$\vec x _i$
${\stackrel{\u20d7}{x}}_{i}$
of a face image. The minimization process, Min in Eq. 4, is due to the opposite algebraic signs. The first term
[TeX:]
$\vec x _{t_i }$
${\stackrel{\u20d7}{x}}_{{t}_{i}}$
represents an individual face image and
[TeX:]
$\vec x _D$
${\stackrel{\u20d7}{x}}_{D}$
is the first pass average through all faces as the statistically centroid correct but blurred LE defined as:

## Eq. 5

[TeX:] \documentclass[12pt]{minimal}\begin{document}\begin{equation} \vec x _D \cong \frac{1}{N}\sum\limits_{t = 1}^N {\vec x _t }. \end{equation}\end{document} $${\stackrel{\u20d7}{x}}_{D}\cong \frac{1}{N}\sum _{t=1}^{N}{\stackrel{\u20d7}{x}}_{t}.$$^{23, 25}An example is illustrated in Fig. 3.

## 3.

## Implementations

In remote video surveillance, facial data acquired from video is generally a lower resolution than photographic biometric resolution, which typically must have at least 12 pixels between the eyes defined.^{26} Our time-order reconstruction algorithm must follow this constraint as well. A static face is regarded as a collection of pixels with light intensity on each pixel. This collection of pixels also specifies the Cartesian coordinates of a point with respect to a set of axes. Therefore, a point can represent a face in an abstract image space.^{4, 5, 27} For facial images of walking passengers, they can be visualized as lying on a three-dimensional manifold that can be parameterized by two pose angles and changes of resolution, assuming light intensity is constant in a short duration, perhaps lasting over 10 s.

## 3.1.

### Solving Variable Faces by Focusing at the Binarized Centroid for a Graceful Degradation of Facial Scaling

Our goal is to sort the time-order of changing faces among backgrounds. We are motivated from HVSs to derive a morphologically efficient preprocessing technique. It allows us to track maximum overlapping common foreground faces among changing perspectives and size distances. The outputs of a face detection system on chip, in Fig. 9, are boxes in different sizes. Some boxes have a lot of background information and some boxes do not. Before comparison between facial boxes, image preprocessing is needed to deal with the different sizes of boxes and any unwanted background information. Mathematically speaking, our approach is equivalent to a video version of gray-scaled self-reference Wiener matched filter. Physiologically speaking, we have 150 million night vision cones distributed nonuniformly in polar-exponentially grids at the fovea, which is densely packed in the middle of the fovea and exponentially dropping out in the peripheral. The location of each rod is denoted by a vector *r*_{i}. There are about 100 rods that share an integrator called ganglion firing through a single neurofiber at the fovea output denoted as a vector *u*_{j} to the cortex 17. Altogether there are millions of neurofibers that are closely packed together into a uniform output bundle toward the lateral geniculate nucleus.^{24} For example, the input *i*‘th pixel of facial boundary is *r*_{i} = exp(*u*_{j}), which is equivalent to the output *u*_{j} = log(*r*_{i}) achieving a massive parallel flow-through logarithmic scaling transform without explicit computation. In other words, when a face changes its size with a scale factor *s* at
[TeX:]
$r^{\prime} _i \equiv r_i s = \exp (u^{\prime} _i)$
${r}_{i}^{\prime}\equiv {r}_{i}s=\mathrm{exp}\left({u}_{i}^{\prime}\right)$
implying
[TeX:]
$u^{\prime} _i = \log (r^{\prime} _i) = \log (r_i s) = \log (r_i) + \log (s) \cong u_j$
${u}_{i}^{\prime}=\mathrm{log}\left({r}_{i}^{\prime}\right)=\mathrm{log}\left({r}_{i}s\right)=\mathrm{log}\left({r}_{i}\right)+\mathrm{log}\left(s\right)\cong {u}_{j}$
. For example, a size of half *s* = 2 change gives log_{e} (2) = ln(2) = 0.69 which is a small fraction over many pixels. This fan-in polar exponential grid (PEG) architecture^{24} facilitates the logarithmic computation of millions of rods’ coordinates in a parallel flow through the PEG. This real-time “algo-tecture” is one of the wonders of the human visual system. This is the reason why a hunter running after a prey in the moon light can gracefully tolerate the rapid changes of the prey, that allows the hunter to integrate continuously for a better signal to noise ratio (SNR) over several hundred photons falling upon the bundle of 100 rods via another fan-in architecture connecting hundreds of rods to a single ganglion toward lateral geniculate nucleus (LGN) and cortex 17. Incidentally, a rod detecting a single photon at SNR = 40 due to the wavelength at night is about 1 μm, equivalent to 1 eV, while the body temperature at 37°C is equivalent to 1/40 eV. The spatial spreading of a single micron photon can cover a 100 rod bundle at 100 nm each.

To achieve a morphologic registration of the facial invariant centroid, we will approximate the aforementioned color hue for face detection by the gray scale intensity thresholding. For example, we choose the Heaviside step function of the gray scale *g* by an arbitrary θ threshold:
[TeX:]
$u = \hbox{\it Heaviside}|\frac{g}{\theta } - 1| = 1\;{\rm if}\;g \ge \theta$
$u=\mathit{\text{Heaviside}}|\frac{g}{\theta}-1|=1\phantom{\rule{0.28em}{0ex}}\mathrm{if}\phantom{\rule{0.28em}{0ex}}g\ge \theta $
; zero if otherwise. Instead of the logarithmic transform of a million coordinates for the scale invariance, a real-time morphologic preprocessing is a simple binarized centroid determination at the middle of unity counts, as shown in Fig. 4.

Binarized centroid registrations: For *N* face boxes (SE) in different sizes, a complete segmentation of a face box *R* is a finite set of regions *R*_{1}, …, *R*_{s},

## Eq. 6

[TeX:] \documentclass[12pt]{minimal}\begin{document}\begin{equation} R = \cup _{i = 1}^s R_i,\quad R_i \cap R_j = \phi,\quad i \ne j. \end{equation}\end{document} $$R={\cup}_{i=1}^{s}{R}_{i},\phantom{\rule{1em}{0ex}}{R}_{i}\cap {R}_{j}=\phi ,\phantom{\rule{1em}{0ex}}i\ne j.$$*f*to an output binary image

*g*as follows:

## Eq. 7

[TeX:] \documentclass[12pt]{minimal}\begin{document}\begin{equation} \begin{array}{*{20}c} \hfill {g(i,j) = 1\quad \hbox{\it for}\quad f(i,j) \ge T,} \\[5pt] \hspace*{33pt}\hfill { = 0\quad \hbox{\it for}\quad f(i,j) < T,} \\ \end{array} \end{equation}\end{document} $$\begin{array}{c}\hfill g(i,j)=1\phantom{\rule{1em}{0ex}}\mathit{\text{for}}\phantom{\rule{1em}{0ex}}f(i,j)\ge T,\\ \phantom{\rule{33.0pt}{0ex}}=0\phantom{\rule{1em}{0ex}}\mathit{\text{for}}\phantom{\rule{1em}{0ex}}f(i,j)<T,\end{array}$$*T*is the threshold,

*g*(

*i, j*) = 1 for the face, and

*g*(

*i, j*) = 0 for the background.

## Eq. 8

[TeX:] \documentclass[12pt]{minimal}\begin{document}\begin{equation} \overrightarrow {bc} _i = \frac{1}{n}\sum\limits_{j,k}^n {g(j,k)},\quad\hbox{\it where}\quad g(j,k) = 1, \end{equation}\vspace*{-11pt}\end{document} $${\stackrel{\u20d7}{bc}}_{i}=\frac{1}{n}\sum _{j,k}^{n}g(j,k),\phantom{\rule{1em}{0ex}}\mathit{\text{where}}\phantom{\rule{1em}{0ex}}g(j,k)=1,$$## Eq. 9

[TeX:] \documentclass[12pt]{minimal}\begin{document}\begin{equation} (\overrightarrow {bc} _i - \overrightarrow {gc} _i) = 0,\quad\hbox{\it for}\quad i = 1,\ldots,N, \end{equation}\vspace*{-11pt}\end{document} $$({\stackrel{\u20d7}{bc}}_{i}-{\stackrel{\u20d7}{gc}}_{i})=0,\phantom{\rule{1em}{0ex}}\mathit{\text{for}}\phantom{\rule{1em}{0ex}}i=1,...,N,$$## Eq. 10

[TeX:] \documentclass[12pt]{minimal}\begin{document}\begin{equation} \overrightarrow {bc} _i^* - \overrightarrow {bc} _{LE} = 0,\quad {\it for}\quad i = 1,\ldots,N, \end{equation}\end{document} $${\stackrel{\u20d7}{bc}}_{i}^{*}-{\stackrel{\u20d7}{bc}}_{LE}=0,\phantom{\rule{1em}{0ex}}\mathit{for}\phantom{\rule{1em}{0ex}}i=1,...,N,$$*i*’th face box (SE), [TeX:] $\overrightarrow {gc} _i$ ${\stackrel{\u20d7}{gc}}_{i}$ is the geometric centroid of the

*i*’th face box, * in [TeX:] $\overrightarrow {bc} _i^*$ ${\stackrel{\u20d7}{bc}}_{i}^{*}$ indicates the translation from the

*i*’th coordinate to the laboratory coordinates system that covers every face box, and [TeX:] $\overrightarrow {bc} _{LE}$ ${\stackrel{\u20d7}{bc}}_{LE}$ is the binarized centroid of the long exposure, Eq. 5, determined by Eq. 8. The first pass centroid registration is demonstrated in Eqs. 9, 10 is the second pass centroid registration. The determination of

*T*’s gray scale value is critical in this algorithm. For optimization of the threshold, different strategies have been developed for different scenarios.

^{28}The robustness of the binarized centroid of facial boxes is discussed in the Appendix.

## 3.2.

### Binarized Centroid Registration Algorithm

*Step 1:* Find the binarized centroid of every face and translate the binarized centroid of face to the geometry centroid of each face box.

*Step 2:* Then, translate every face box to the laboratory coordinate

*Step 3:* Registration of every face box to the centroid of the long exposure from Eq. 5.

*Step 4:* Remove unwanted background by rectangle template.

In two-pass binarized centroid registrations, we used the properties of the benchmark for face detection techniques; the eye(s), nose, and the mouth have to be inside the facial boxes if they are in video. Otherwise, the detection result is incorrect. When we developed and tested our algorithms, we assumed that every cut-out facial box was correct. In other words, every facial box should have eye(s), a nose, and a mouth in it. It also implied that no matter what size the facial boxes were, they were partially overlapped in content, area of eye(s), nose, and mouth even when they were in different sizes and resolutions. With the condition mentioned above, in most facial poses, eyes, nose, and mouth are always around the face centroid in the facial boxes except poses heading left-most and right-most.

## 3.3.

### ZC Algorithm

The theory of time-order from Sec. 2 is implemented in an iterative research fashion herein. The video sequence captured by the camera satisfies the following inequality, Eq. 11, in an abstract manifold:

## Eq. 11

[TeX:] \documentclass[12pt]{minimal}\begin{document}\begin{eqnarray} \left\| {O_t {-} O_{t \pm 1} } \right\| < \left\| {O_t {-} O_{t \pm s} } \right\|,\;t > s > 1\;\hbox{\it and}\,t,s \in {\rm Integer},\hspace*{-15pt}\nonumber\\ \end{eqnarray}\end{document} $$\begin{array}{c}\hfill \Vert {O}_{t}-{O}_{t\pm 1}\Vert <\Vert {O}_{t}-{O}_{t\pm s}\Vert ,\phantom{\rule{0.28em}{0ex}}t>s>1\phantom{\rule{0.28em}{0ex}}\mathit{\text{and}}\phantom{\rule{0.16em}{0ex}}t,s\in \mathrm{Integer},\end{array}$$*O*

_{t}is a node representing a facial box and the subscript labels

*t*and

*s*are the time indices of the time-ordering of nodes on the manifold. The ∥·∥ represents the geodesic distance between nodes of the original time-ordering of nodes on the manifold. If Eq. 11 holds true, it implicitly indicates that a trajectory of time-ordering (of a passenger) cannot intersect itself and other trajectories of time-orderings (of other passengers) on the manifold. It also indicates that finding the time-order is equivalent to find out the geodesic order which has the lowest total neighborhood geodesic distances. We plot the geodesic distance (simulated by LMS distance) as the contour map over the

*N*

^{2}dimensionality surface in Fig. 5.

In implementation, we adopt LMS distance between facial boxes to approach the geodesic distance on the manifold. Explicitly, this approach may be skewed sometimes. If a video sequence cannot satisfy Eq. 11 all the time, some frames may be in an incorrect time-order. The impact of incorrect alignment, partial frames in sequence, and of time-order is discussed in Sec. 7. The meta algorithm is stated as follows:

## 3.4.

### Steps of the ZC Algorithm

*Step 1:* Apply two-pass binarized centroid registrations introduced in Sec.
3.1 to approximate the area of the eye(s), nose, and mouth in every face box. The approximation result has to meet Eq. 11 to insure the result of the ZC algorithm is correct.

*Step 2:* Determine the anchor pose: The long exposure is calculated by Eq. 5. The anchor pose is the nearest neighbor of the long exposure, selected from the input facial boxes.

*Step 3:* Construct a two nearest neighborhood matrix between each node and an adjacency graph by the two nearest neighborhood matrix.

*Step 4:* Determine the weights of connections in the adjacency graph. The weight of the nearest neighbor connection is 3 and the weight of the next nearest neighbor connection is 1. Every node is supposed to connect two neighbors except the head and tail nodes that we do not know. When a node has more than two relations with other nodes, the connections are selected by the top two highest weights of relations.

## 4.

## Discussion

In this section, two phenomenon are discussed. The first is the impact of leak-through background information in face boxes for the neighborhood determination. The second is what the consequence is if the step of iterative two neighborhoods comparisons in ZC algorithm is replaced by a direct neighborhood determination by LMS distance.

## 4.1.

### Impact of Background Information of Face Boxes for Neighborhood Determination

Many spectral clustering methods
^{4, 5, 6, 7, 8, 9, 27, 29} are developed for sorting facial poses at a fixed distance with clear backgrounds. As mentioned in Sec. 2, norm distance or LMS distance has been used for the neighborhoods (or similarity) determination of facial poses in adjacency matrix.
^{4, 5, 6, 7, 8, 9, 16, 27, 29} However, in video surveillance, when passengers move, the facial resolutions of passengers change over time with respect to a fixed video camera. Moreover, the backgrounds of face boxes are not clear. In this case, the pixel-by-pixel LMS distance could not determine the correct neighborhoods due to the leak-through background information.

The under-sampled real world surveillance images of faces shown in 14 facial poses in a 4 s window are in Fig. 6. The LMS distance may not be sufficient to determine the similarity between faces when the changes of background are drastic for each face box shown in Fig. 6. However, with binarized centroid registration preprocessing applied be-fore sorting, the result became much more robust to be able to tolerate a factor of 2 in sizes over about 10 s walking time.

## 4.2.

### Consequence of Replacing the Product of Two Triplet Correlation Checks in ZC Algorithm

As mentioned in Sec. 2, we assumed that the time-ordering is a product of triple image correlation *W*_{i, j, k}*W*_{k, l, m} over the time-ordered facial image space. In the implementation of time-ordering reconstruction, a product of triple image correlation is equivalent to iterative two neighborhoods comparisons in ZC algorithm presented in Sec. 3. What is the consequence if we replace the product of triple image correlation by a pair-wise correlation? In other words, what is the sorting result if we replace iterative two neighborhoods comparisons in ZC algorithm by a neighborhood determination by LMS distance? The answer is that the result is a fast approximation of embedding in line for facial poses similarity sorting in a short duration, several seconds. This fast approach of similarity sorting algorithm is stated as follows:

## 4.3.

### Steps of the Similarity Sorting Algorithm

*Step 1:* Apply two-pass binarized centroid registrations introduced in Sec.
3.1 to approximate the area of the eye(s), nose, and mouth in every face box.

*Step 2:* Determine the anchor pose: The long exposure is calculated by Eq. 5. The anchor pose is the nearest neighbor of the long exposure, selected from the input facial boxes.

*Step 3:* Neighborhood determination by LMS (or norm) distance. The connectivity of the anchor pose and other poses is determined by the LMS (or norm) distance. Other choices of distance, such as face, feature, distance, etc., are considerable if the LMS distance is not adequate to present the similarity. The result is a fast approximation of embedding in line for similarity sorting in a short duration, several seconds.

## 5.

## Simulations

The ZC algorithm was coded in MATLAB on a PC with an Intel i-7 920 processor and 12 GB memory. The time of loading facial images is excluded in the execution time. The first simulation is a similarity sorting test of 60 facial poses of 1 person in different order. The face boxes are in different sizes and resolutions over time. In this simulation, it shows the robustness of our similarity sorting algorithm. No matter what orders the facial poses are, the sorting result is the same, as illustrated in Fig. 7. It is a robust and real-time sorting approximation of moving objects in a short time window, based on the two-pass centroid registrations and SRMF proposed in this paper.

For a simulation of time-ordered reconstruction, detailed steps are illustrated in Fig. 8. There are seven facial boxes in a random order. By the zipper chain algorithm introduced in Sec. 3, the norm distance matrix is built up by the input facial boxes. After that, the two nearest neighbor matrix are constructed in Fig. 8. The adjacent graph is determined by the two nearest neighbor matrix and the result is in Fig. 8.

## 6.

## Applications

In this section, we briefly consider several possible video surveillance applications for the algorithmic framework developed in this paper.

## 6.1.

### Video Organization Principle

Electro-optical/infrared videos with the help of sleep-wake acoustic/seismic trigger for the persistent surveillance can last about 7 days and 24 hours per day. The video database captured by several surveillance cameras can easily exceed peta-bytes worth of data, and the inspectors have to spend a lot of time sorting out useful information from such an enormous database. An efficient strategy for reducing video database storage and an efficient sorting algorithm of the content of video database has been the theme of this paper.

In general, surveillance looks for the most wanted passengers at traveler transit locations. The facial boxes of all passengers have been extracted in parallel per frame into the video storage from megabyte per frame to kilobyte of facial boxes in Fig. 9.^{30} In addition, an automatic selection of frontal views for each person can save another 3 orders of magnitude, e.g., walking through an inspection corridor already shown in 10 min transit time per person in Table 1.

## 6.2.

### Aided Target Recognition for Video Surveillance with Mug Shots

In an effort to strengthen National Security, spotting people of interest in airport surveillance is critical. Currently, airport surveillance relies on security personnel to tediously match a wanted person in a crowd with a mug-shot. An automated video surveillance and recognition system is desirable to boost efficiency. However, passengers’ facial biometric data captured by video can be in different resolutions, poses, and even be disguised from mug-shots.^{14, 26, 31, 32} It is how and why, until then, there is not any automated video surveillance system adopted for airport surveillance. A man in the loop for surveillance is still necessary. Thus, a database-cueing aided-target recognition (AiTR) video surveillance system is recommended to aid security personnel for better recognition performance.^{33}

The fundamental concept is to maximize the overlapping pixels on target between faces and mug shots at matched poses and inspectors can be prompted for a closer look at a specific passenger. To achieve this goal, the AiTR system has to detect and sort passengers poses in real-time through surveillance cameras in a corridor illustrated in Fig. 9. The robust and real-time SRMF algorithm proposed in this paper is well suited for such an AiTR surveillance system in Fig. 9.

## 6.3.

### Association Across Sensory Modalities

The cross track association problem of archival video and audio becomes challenging after a different compression algorithm using current technologies. Without the basic spatiotemporal-order reconstruction, a close-up lip read and gesture motion cannot easily associate without ambiguity a remotely imaged passenger and with *in situ* audio recording by local microphones. The facial boxes of a crowd can be detected efficiently in random order by the face detection system on the chip mentioned in this paper. The next step is to sort out those facial boxes in a time-ordering so that we may be able to read lips, gestures, and other body languages in the right order. For example, in Fig. 10, “where is exit?” and “I am tired.” are recorded by dual microphones with airport music and background noise. In Fig. 10, two speeches are separated, e.g., by fast independent component analysis (ICA) algorithm based on different positive Kurtosis values. For a passenger who is looking for an exit, the passenger has a higher chance to look around, turning to face right and left, instead of looking down at the floor. The time-ordering sorting algorithm proposed in this paper is useful for cross track association of video and audio.

## 7.

## Conclusion

In this paper, we have presented an efficient time-order reconstruction algorithm of sorting facial boxes in order to associate remote video and local audio for answering the surveillance challenge—“who speaks what, where and when.” We show that it is crucial to have a morphological image preprocessing which begins at a binarized centroid registration, in order to find a maximum overlapping common region from all facial boxes in different sizes and coordinates. Then a ZC algorithm is developed by considering the products of pair correlations functions in tandem verification. Consequently, each face box is linked together in a time-order, as if the zipper's teeth are mounted overlapping one another. This ZC algorithm reconstructs the time-order linking the video manifolds of walking facial boxes and audio manifolds of *in situ* voice recorders. Independent of the cameras setup configuration, the video sampling rate is an important parameter. Our experience shows to be sufficient at 20 to 30 Hz for an average walking pace of 3 to 6 miles per hour. However, if a passenger runs faster than the walking limit, the camera sampling rate must be increased to reduce the image blur. We combine the anchor face that is automatically discovered by means of self-reference matched filter with the time-order ZC sorting capability, and we hope to find the time-order of each person to allow us to associate each with his or her own voices recording. In general, the average speaking speed is 3 to 4 words per second. In the movie industry, the technician can sync the voice recording by a lip reading technique when a close up facial detail is available. However, the remote video may not have the detail. In this situation, our goal is to identify the corresponding speakers according to the recorded audio.

The time-order reconstruction of facial poses belongs to a nondeterminant polynomial-time (NP) complete problem and the proof is in Appendix B. Given *N* image poses, the ordering index runs from 1 to *N*. Then, there are *N*! possibilities to assign the ordering index to each pose. We plot the LMS distance as the contour map over the N^{2} dimensionality surface in Sec.
3.3. The time-order reconstruction is equivalent to find the geodesic order that has the lowest total neighborhood distances on an abstract manifold. ZC algorithm is a heuristic solution for this NP complete problem. Equation 11 is the constraint of the solution that each time-ordering trajectory on an abstract manifold is prohibited to intersect with itself or other time-ordering trajectories.

## Appendices

### Appendix A: Real World Challenges of Cross-Track Association: Time-Ordered Reconstruction

Cross sensory track association is an unsolved Empire 2010 sensor challenge problem proposed by the Office of Naval Research. The real world field test is much more difficult from a laboratory test. In the real world, there are many more uncontrollable factors that may affect test results. For example, on a rainy day, an outdoor video may be unclear, or passengers in the video may have heavy shadows on a sunny day. In addition, passengers may be obscured by each other or obstacles such as trees, pillars, cars, etc. To overcome those uncontrollable environmental factors, we take advantage of the strategy adopted in this paper: cutting facial boxes and removing redundant background information can save up to 6 orders of magnitude in data storage (see Table 1), and by increasing cameras, coverage density is increased. The problem of facial continuity in occluding and unpredictable environmental factors may be mitigated by an increase in the sensor density.

It is known that our HVS gracefully degrades the scale of faces by focusing on the centroid of the faces, say nose and/or eyes. This is accomplished by taking a log transform along the radial direction of the facial center so that the peripheral boundary sizes are less sensitive to our eye. Thus, if we wish to ignore a varied facial size, we take a binary truncation of a gray-scale image for efficiency reasons (equivalent to zero, when the log of unity at the threshold, and to one, when the gray value is above the threshold). If the centroid of a face is so chosen within 1 or 2 pixels, the peripheries of the face become less sensitive as demonstrated in Fig. 11, where binarized centroid sensitivity tests of different faces in different poses and sizes are given with different thresholds. In the results, the binarized centroids of facial boxes are gracefully degraded with threshold changes between 60 and 140, over 256 dynamic ranges. In our analysis, it is because the area of the face is over at least 50% of each face box so that the binarized centroid is insensitive to threshold changes. In general, most correct face boxes from FD systems on chip also follow the condition mentioned above with very few exceptions.

In our ZC algorithm, the product of triple correlations with common nodes in the middle, i.e., *W*_{i, j, k}*W*_{k, l, m} is equivalent to the products of pair correlations in a double verification, which saved computation costs.

Time difference, time-ordering, and arrow of time are different. Time-ordering is the change of the time difference, while the arrow of time is the flow of the time difference. In this paper, we have presented a theory and a heuristic implementation of time-ordered image reconstruction. Understanding the arrow of time may be achieved by calculating the Boltzmann entropy of time-ordered images; the direction of flow as the increase of entropy.

We have re-established the time-ordering in an airport persistent surveillance cases of those random collection of facial poses at arbitrary sizes, after they have been cut automatically by a video system on chip using the facial color hue. Given multiple cameras in the transit corridors, we have demonstrated that we can determine their cross-sensor modality correlation by the re-establishment of time-order of each sensor. However, for Navy large area Marine harbor and Department of Homeland Security open boarder surveillances there remains the challenge to be formulated and tackled.

### Appendix B: Geodesic Neighborhood Re-Ordering is a NP Complete Problem

**Lemma:**

Geodesic spatiotemporal neighborhood re-ordering of *N* projected poses of a moving person is a NP complete problem.

There are combinatorial possibilities in the re-ordering of *N* poses: the first place has *N* choices, and the second place has *N*−1, etc. The total number is explosive as *N* increases:

## Eq. 12

[TeX:] \documentclass[12pt]{minimal}\begin{document}\begin{eqnarray} {\rm total\,\, number\,\, of\,\, ordering} &=& N(N - 1)(N - 2) \ldots 1/2\nonumber\\ &=& N!/2 \end{eqnarray}\vspace*{-24pt}\pagebreak\end{document} $$\begin{array}{ccc}\hfill \mathrm{total}\phantom{\rule{0.16em}{0ex}}\phantom{\rule{0.16em}{0ex}}\mathrm{number}\phantom{\rule{0.16em}{0ex}}\phantom{\rule{0.16em}{0ex}}\mathrm{of}\phantom{\rule{0.16em}{0ex}}\phantom{\rule{0.16em}{0ex}}\mathrm{ordering}& =& N(N-1)(N-2)...1/2\hfill \\ & =& N!/2\hfill \end{array}$$*Theorem***:** The geodesic neighborhood ordering of video projections of a moving person in space and time must be at the absolute minimum of the total difference distances between all successive order.

*Proof:* By induction

Given *N* = 3 poses. Assume the correct spatiotemporal neighboring order is A, B, C, then

## Eq. 13

[TeX:] \documentclass[12pt]{minimal}\begin{document}\begin{eqnarray} |{\rm A} - {\rm B}|^2 + |{\rm B} - {\rm C}|^2 \le |{\rm A} - {\rm C}|^2 + |{\rm C} - {\rm B}|^2 \quad \quad {\rm Q}.{\rm E}.{\rm D}.\nonumber\\ \end{eqnarray}\end{document} $$\begin{array}{c}\hfill {|\mathrm{A}-\mathrm{B}|}^{2}{+|\mathrm{B}-\mathrm{C}|}^{2}\le {|\mathrm{A}-\mathrm{C}|}^{2}+{|\mathrm{C}-\mathrm{B}|}^{2}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\mathrm{Q}.\mathrm{E}.\mathrm{D}.\end{array}$$If the inequality is satisfied, then the correct neighborhood order is A, B, C.

Assume *N* to be true, it is trivial to prove *N*+1 is true.

[TeX:] $\hfill{\rm Q.E.D.}$ $\mathrm{Q}.\mathrm{E}.\mathrm{D}.$

## Acknowledgments

We thank Mr. Louis Larsen and Mr. Jeffrey Jenkins at NVESD and Professor Harrington, Professor Carroll, and Professor Doroslovacki at ECE Department in GWU for useful discussions.

## References

**,” IEEE Trans. Inf. Theory, 38 (2), 881 –884 (1992). https://doi.org/10.1109/18.119745 Google Scholar**

*A sampling theorem for wavelet subspaces***,” IEEE Trans. Inf. Theory, 52 (4), 1289 –1306 (2006). https://doi.org/10.1109/TIT.2006.871582 Google Scholar**

*Compressed sensing***,” Opt. Eng., 33 (7), 2111 –2124 (1994). https://doi.org/10.1117/12.173205 Google Scholar**

*Mathematics of adaptive wavelet transforms: relating continuous with discrete transforms***,” Science, 90 2319 –2323 (2000). https://doi.org/10.1126/science.290.5500.2319 Google Scholar**

*A global geometric framework for nonlinear dimensionality reduction***,” Science, 290 2323 –2326 (2000). https://doi.org/10.1126/science.290.5500.2323 Google Scholar**

*Nonlinear dimensionality reduction by locally linear embedding***,” Neural Comput., 15 (6), 1373 –1396 (2003). https://doi.org/10.1162/089976603321780317 Google Scholar**

*Laplacian eigenmaps for dimensionality reduction and data representation***,” Stat. Comput., 17 (4), (2007). https://doi.org/10.1007/s11222-007-9033-z Google Scholar**

*A tutorial on spectral clustering***,” Appl. Comput. Harmon. Anal., 21 5 –30 (2006). https://doi.org/10.1016/j.acha.2006.04.006 Google Scholar**

*Diffusion maps***,” Proc. Natl. Acad. Sci. U.S.A., 102 (21), 74267431 (2005). https://doi.org/10.1073/pnas.0500334102 Google Scholar**

*Geometric diffusions as a tool for harmonic analysis and structure definition of data: diffusion maps***,” Philos. Mag., 2 (6), 559 –572 (1901). https://doi.org/10.1080/14786440109462720 Google Scholar**

*On Lines and Planes of Closest Fit to Systems of Points in Space***,” 1208 –1213 (2005). https://doi.org/10.1109/ICCV.2005.167 Google Scholar**

*Neighborhood preserving embedding***,” Proc. Inter. Conf. Computer Vision and Pattern Recognition, 1 526 –532 (2005). https://doi.org/10.1109/CVPR.2005.131 Google Scholar**

*Discriminant analysis with tensor representation***,” Opt. Eng., 49 (7), 77203 (2010). https://doi.org/10.1117/1.3465582 Google Scholar**

*Neighborhood discriminant embedding in face recognition***,” IEEE Trans. Pattern Anal. Mach. Intell., 29 40 –51 (2007). https://doi.org/10.1109/TPAMI.2007.250598 Google Scholar**

*Graph embedding and extensions: a general framework for dimensionality reduction***,” Proc. SPIE, 7703 770304 (2010). https://doi.org/10.1117/12.853682 Google Scholar**

*Asymmetric GT of Social Networks***,” (2010). http://www.research.microsoft.com Google Scholar**

*A survey of recent advances in Face Detection***,” J. Gerontol. B Psychol. Sci. Soc. Sci., 65B (2), 174 –184 (2010). https://doi.org/10.1093/geronb/gbp130 Google Scholar**

*Sex-Specific correlates of walking speed in a wide age-ranged population***,” Proc. of AAMAS, 2 (2004). https://doi.org/10.1109/AAMAS.2004.10119 Google Scholar**

*Modeling of pedestrian behavior and its applications to spatial evaluation***,” Opt. Soc. Am., 72 1666 –1669 (1982). https://doi.org/10.1364/JOSA.72.001666 Google Scholar**

*Self-reference spatio-temporal image-restoration technique***,” Proc IEEE, 74 518 –520 (1986). https://doi.org/10.1109/PROC.1986.13494 Google Scholar**

*Adaptive invariant novelty filters***,” Opt. Commun., 35 317 –322 (1980). https://doi.org/10.1016/0030-4018(80)90042-5 Google Scholar**

*Local Instances of Good Seeing***,” 330 –338 (2005). Google Scholar**

*Video-based framework for face recognition in video***,” Science, 290 2268 –2269 (2000). https://doi.org/10.1126/science.290.5500.2268 Google Scholar**

*The manifold ways of perception***,” USA TODAY, 2003). http://www.usatoday.com/usatonline/20030902/5460651s.htm Google Scholar**

*Airport anti-terror systems flub tests face-recognition technology fails to flag'suspects***,” 313 –320 (2003). Google Scholar**

*Video-based faces recognition using probabilistic appearance manifolds***,” Proc. SPIE, 7703 770305 (2010). https://doi.org/10.1117/12.850546 Google Scholar**

*Video surveillance of passengers with mug shots*## Biography

**Ming Kai Hsu** received his MS degree in computer engineering from Tatung University, Taiwan, in 1998 and his MS degree in electrical engineering from George Washington University (GWU), Washington DC, in 2003. He is a PhD candidate in electrical engineering at GWU (final stage of dissertation defense). Besides academia research, he has rich industrial and information technology working experience. In 2001, he worked in the IT industry as a research engineer in Taiwan. From 2003 to 2005 he was a technical assistant and information specialist in SEAS computing facility at GWU. From 2005 to 2007, he was a research assistant in GWU. He has been involved in Internet web delivery system, FPGA firmware, system on chips, and smart sensor on chip processing. He has worked in neural networks, unsupervised learning, chaos, fuzzy, security codec, and moving platform video imaging. His current focus and interests are multiple band video images processing for biomedical wellness and surveillance applications.

**Ting N. Lee** is a professor in the Department of Electrical and Computer Engineering at George Washington University (GWU). His main research interests are in systems theory and network synthesis. Recent projects have centered on optimal distributed network theory and neural network theory, digital, and analog.

**Harold Szu** received his PhD in theoretical physics in 1971 from the Rockefeller University, New York. He worked at the Naval Research Lab (1977 to 1990) and was the leader of Naval Surface Warfare Center at White Oak (1990 to 1996) followed by his relocation to Dahlgren, Virginia (1997 to 2008). He is now a senior scientist at the Army Night Vision and Electronic Sensors Directorate, located in Ft. Belvoir, Virgina. He is a founder, former president, and a current governor of the International Neural Network Society (INNS), receiving the INNS Dennis Gabor Award in 1997 and the Eduardo R. Caianiello Award in 1999 from the Italian Academy. Recently, SPIE awarded him with the Nanoengineering Award and the Biomedical Wellness Engineering Award. He is a fellow of American Institute of Medicine & BioEngineering 2004, a fellow of IEEE (1997), a foreign academician*,* Russian Academy of Nonlinear Sciences, 1999, a fellow of the Optical Society of America (1996), a fellow of International Optical Engineering*,* a Fellow of SPIE (1995), and a fellow of INNS (2010).