An algorithm has been developed for the automatic identification of human faces. Because the algorithm uses facial features restricted to the nose and eye regions of the face, it is robust to variations in facial expression, hair style and the surrounding environment. The algorithm uses coarse to fine processing to estimate the location of a small set of key facial features. Based on the hypothesized locations of the facial features, the identification module searches the database for the identity of the unknown face. The identification is made by matching pursuit filters. Matching pursuit filters have the advantage that they can be designed to find the differences between facial features needed to identify unknown individuals. The algorithm is demonstrated on a database of 172 individuals.
In this paper we describe experiments using eigenfaces for recognition and interactive search in the FERET face database. A recognition accuracy of 99.35% is obtained using frontal views of 155 individuals. This figure is consistent with the 95% recognition rate obtained previously on a much larger database of 7,562 `mugshots' of approximately 3,000 individuals, consisting of a mix of all age and ethnic groups. We also demonstrate that we can automatically determine head pose without significantly lowering recognition accuracy; this is accomplished by use of a view-based multiple-observer eigenspace technique. In addition, a modular eigenspace description is used which incorporates salient facial features such as the eyes, nose and mouth, in an eigenfeature layer. This modular representation yields slightly higher recognition rates as well as a more robust framework for face recognition. In addition, a robust and automatic feature detection technique using eigentemplates is demonstrated.
A face recognition system has been developed and demonstrated at the Rutgers University Center for Computer Aids for Industrial Productivity. The system uses a preliminary data reduction step, gray scale projection, and a fast transform technique to greatly reduce the computational complexity of the problem and, consequently, the cost of high-speed implementation. The decision function is a new, extremely cost-effective neural network, the Mammone/Sankar Neural Tree Network. This paper examines the use of gray scale projection in detail, and demonstrates the use of 1D signal processing techniques in 2D imaging applications. Results are presented showing immunity to changes in expression and small rotations about the vertical axis.
A method of face recognition based on the second-order statistics (SOS) is proposed and an optimal size of feature space of the SOS is determined by experiments. Experimental results show that the correct recognition rate of 97.1% is obtained for 91 subjects. The frontal face recognition system is then compared with a profile recognition system based on the P-Fourier descriptor. For input patterns without rotations/inclinations, the frontal face recognition system achieved higher recognition rate than that of profiles. However, for input images with tilts, the profile recognition system performed higher accuracy than that of frontal faces.
The NRC laboratories have developed a laser scanning technique to digitize shapes and colors in registration. The technique, known as synchronized scanning, is capable of digitizing topography as small as the relief of a bare finger tip, showing a clear picture of the skin structure (essentially a clean fingerprint without distortion), as well as the shape and size of body components such as hands, face, and feet, and the full body of one or more subjects simultaneously. The laser scanner uses a RGB laser, coupled to an optical fiber, which is projected in the field of view. The 3D color measurements are made by optical triangulation to a resolution of 10 micrometers for finger tip scans and a resolution of 1 mm for whole body scans. Experimental results are presented and discussed. Potential applications of this technology in the field of identification and inspection of humans include face recognition, finger, foot and teeth print identification, and 3D mugshots that can be rapidly broadcast through satellite communication. One of the unique properties of this technology is that absolute measurements, not only appearance and relative position of features, can be used for identification purposes.
This paper proposes a new technique for the identification of face images. The basis idea is that the front face images of a person are considered as the samples coming from multiple classes, each class corresponding to the face images of one head orientation. Therefore, for each person, we can take his front facial images from a number of head orientations as training data based on which an algebraic feature extractor and a classifier can be built for this person. The problems of feature extraction, classifier design, face verification and recognition are discussed in this paper. Experimental results are also provided.
A facial-imaging system to verify a person's supplied identity as part of a secure access control system is outlined. Classical image processing techniques transform the live-scan image to a standard position, scale, and lighting level. Two neural network classifiers, trained in a previous enrollment session, make the access decision. One neural net classifies the grayscale image directly. The other network uses as features the live-scan image's projection onto a general face-space similar to the approach of Turk and Pentland. This paper develops a method to generate additional dimensions, peculiar to the enrolled user, to augment the general face-space. This enhanced face-space enables the network to verify a specific person. A system with 16 enrolled users was attacked by 40 imposters with a false acceptance rate of 0.2%.
A new system is presented for text-dependent speaker verification. The system uses data fusion concepts to combine the results of distortion-based and discriminant-based classifiers. Hence, both intraspeaker and interspeaker information are utilized in the final decision. The distortion and discriminant-based classifiers used are dynamic time warping and the neural tree network, respectively. The system is evaluated with several hundred one word utterances collected over a telephone channel. All handsets considered in this experiment use electret microphones. The new system is found to perform exceptionally well for this task. A second experiment uses handsets having both electret and carbon button microphones. Here, a channel detection scheme is proposed that improves performance under these conditions.
In this paper, various linear predictive (LP) analysis methods are studied and compared from the points of view of robustness to noise and of application to speaker identification. The key of the success of LP techniques is in separating the vocal tract information from the pitch information present in a speech signal even under noisy conditions. In addition to considering the conventional, one-shot weighted least-squares methods, we propose three other approaches with the above point as a motivation. The first is an iterative approach that leads to the weighted least absolute value solution. The second is an extension of the one-shot least-squares approach and achieves an iterative update of the weights. The update is a function of the residual and is based on minimizing a Mahalanobis distance. Thirdly, the weighted total least- squares formulation is considered. A study of the deviations in the LP parameters was done when noise (white Gaussian and impulsive) is added to the speech. It was revealed that the most robust method depends on the type of noise. A closed set speaker identification experiment with 20 speakers was conducted using a vector quantizer classifier trained on clean speech. For a modest codebook size of 32, all of the approaches are comparable when the testing condition corresponds to clean speech or speech degraded by white Gaussian noise. When the test involves speech degraded by impulse noise, the proposed approach based on minimizing a Mahalanobis distance which was found to be the most robust, is also the best for speaker identification.
This paper describes an automated method of comparing a voice sample of an unknown individual with samples from known speakers in order to establish or verify the individual's identity. The method is based on a statistical pattern matching approach that employs a simple training procedure, requires no human intervention (transcription, work or phonetic marketing, etc.), and makes no assumptions regarding the expected form of the statistical distributions of the observations. The content of the speech material (vocabulary, grammar, etc.) is not assumed to be constrained in any way. An algorithm is described which incorporates frame pruning and channel equalization processes designed to achieve robust performance with reasonable computational resources. An experimental implementation demonstrating the feasibility of the concept is described.
In this paper, we introduce a new methodology called Pole Filtering to remove the residual effects of speech from the cepstral mean channel estimate, for extracting features robust to transmission channel degradations. The approach is based on filtering the eigenmodes of speech that are more susceptible to convolutional distortions caused by transmission channels. Poles and their corresponding eigenmodes for a frame of speech are investigated when there is a channel mismatch for speaker identification systems. Linear Predictive (LP) cepstra of speech has been found to be a useful feature set for recognition systems. The relation between the LP cepstral coefficients and eigenmodes of speech has been exploited to develop a robust feature set. In this paper an algorithm based on Pole-filtering has been developed to improve the cepstral features for channel normalization. Experiments are presented in speaker identification using speech in the TIMIT database processed through a telephone channel simulator and on the San Diego portion of the KING database. The technique is shown to offer improved recognition accuracy under cross channel scenarios when compared to conventional methods.
The two largest factors affecting automatic speaker identification performance are the size of the population to be distinguished among and the degradations introduced by noisy communication channels (e.g., telephone transmission). To experimentally examine these two factors, this paper presents text-independent speaker identification results for varying speaker population sizes up to 630 speakers for both clean, wideband speech and telephone speech. A system based on Gaussian mixture speaker models is used for speaker identification and experiments are conducted on the TIMIT and NTIMIT databases. The aims of this study are to (1) establish how well text-independent speaker identification can perform under near ideal conditions for very large populations (using the TIMIT database), (2) gauge the performance loss incurred by transmitting the speech over the telephone network (using the NTIMIT database), and (3) examine the validity of current models of telephone degradations commonly used in developing compensation techniques (using the NTIMIT calibration signals). This is believed to be the first speaker identification experiments on the complete 630 speaker TIMIT and NTIMIT databases and the largest text-independent speaker identification task reported to date. Identification accuracies of 99.5% and 60.7% are achieved on the TIMIT and NTIMIT databases, respectively.
Hands-free operation of speech processing equipment is sometimes desired so that the user is unencumbered by hand-held or body-worn microphones. This paper explores the use of array microphones and neural networks (MANN) for robust speech/speaker recognition in a reverberant and noisy environment. Microphone arrays provide high-quality, hands-free sound capture at distances, and neural network processors compensate for environmental interference by transforming speech features of the array input to those of close-talking microphone input. The MANN system is evaluated using both computer-simulated degraded speech and real- room collected speech. It is found that the MANN system is capable of elevating recognition accuracies under adverse conditions, such as room reverberation, noise interference, and mismatch between the training and testing conditions, to levels comparable to those obtained with close-talking microphone input under a matched training and testing condition.
In this paper, a new algorithm for text-dependent speaker verification is presented. The algorithm uses a set of concatenated Neural Tree Networks (NTNs) trained with sub-word units for speaker verification. The conventional NTN when trained by all the words in training data achieves good results in the text-independent task. The proposed method is described as follows. First, the predetermined password in the training data is segmented into sub-word units by Hidden Markov Model (HMM). Second, an NTN is trained for only the data segmented into that sub-word unit. It integrates the discriminatory ability of NTN with the framework of HMMs which demonstrates ability in modeling temporal variation of speech. The sub-word NTN trained with clustered data reduces the complexity of the NTN structure, and is more powerful in discriminating speakers. This new algorithm was evaluated by experiments on a TI isolated-word database, which contains 16 speakers. An improvement of performance was obtained over baseline performance obtained from conventional methods.
In this paper, an evaluation of various discriminant neural networks classifiers for text- independent speaker verification problem is presented. Each person to be verified has a personalized neural network model. A new classifier called neural tree network (NTN) is also examined for this application. The memoryless feedforward neural network architecture makes decisions based on static features. Time delay neural network (TDNNs) have proved to be an efficient way to handle the dynamic nature of speech. Furthermore, a model called recurrent time delay neural networks (RTDNNs), obtained through a local feedback connection at the first hidden layer level of TDNNs is investigated. The training is carried out by backpropagation for sequence algorithm. The database used is a subset of the TIMIT database consisting of 38 speakers from the same dialect region. The NTN is compared with the MLP, TDNN, and RTDNN. It is shown that NTN is found to perform better than the other neural networks classifiers. Also, a little bit performance improvement was achieved due to the addition of temporal information for text-independent speaker verification problem using TDNNs and RTDNNs. Finally, we described the experimental results obtained using different neural network models.
In this paper, a technique for audio indexing based on speaker identification is proposed. When speakers are known a priori, a speaker index can be created in real time using the Viterbi algorithm to segment the audio into intervals from a single talker. Segmentation is performed using a hidden Markov model network consisting of interconnected speaker sub- networks. Speaker training data is used to initiate sub-networks for each speaker. Sub- networks can also be used to model silence, or non-speech sounds such as musical theme. When no prior knowledge of the speakers is available, unsupervised segmentation is performed using a non-real time iterative algorithm. The speaker sub-networks are first initialized, and segmentation is performed by iteratively generating a segmentation using the Viterbi algorithm, and retraining the sub-networks based on the results of the segmentation. Since the accuracy of the speaker segmentation depends on how well the speaker sub-networks are initiated, agglomerative clustering is used to approximately segment the audio according to speaker for initialization of the speaker sub-networks. The distance measure for the agglomerative clustering is a likelihood ratio in which speed segments are characterized by Gaussian distributions. The distance between merged segments is recomputed at each stage of the clustering, and a duration model is used to bias the likelihood ratio. Segmentation accuracy using agglomerative clustering initialization matches accuracy using initialization with speaker labeled data.
A new classification system for text-independent speaker recognition is presented. Text- independent speaker recognition systems generally model each speaker with a single classifier. The traditional methods use unsupervised training algorithms, such as vector quantization (VQ), to model each speaker. Such methods base their decision on the distortion between an observation and the speaker model. Recently, supervised training algorithms, such as neural networks, have been successfully applied to speaker recognition. Here, each speaker is represented by a neural network. Due to their discriminative training, neural networks capture the differences between speakers and use this criteria for decision making. Hence, the output of a neural network can be considered as an interclass measure. The VQ classifier, on the other hand, uses a distortion which is independent of the other speaker models, and can be considered as an intraclass measure. Since these two measures are based on different criteria, they can be effectively combined to yield improved performance. This paper uses data fusion concepts to combine the outputs of the neural tree network and VQ classifiers. The combined system is evaluated for text-independent speaker identification and verification and is shown to outperform either classifier when used individually.
An important problem in speech coding is the quantization of linear predictive coefficients (LPC) with the smallest possible number of bits while maintaining robustness to a large variety of speech material and transmission media. Since direct quantization of LPCs is known to be unsatisfactory, we consider this problem for an equivalent representation, namely, the line spectral frequencies (LSF). To achieve an acceptable level of distortion a scalar quantizer for LSFs requires a 36 bit codebook. We derive a 30 bit two-quantizer scheme which achieves a performance equivalent to this scalar quantizer. This equivalence is verified by tests on data taken from various types of filtered speech, speech corrupted by noise and by a set of randomly generated LSFs. The two-quantizer format consists of both a vector and a scalar quantizer such that for each input, the better quantizer is used. The vector quantizer is designed from a training set that reflects the joint density (for coding efficiency) and which ensures coverage (for robustness). The scalar quantizer plays a pivotal role in dealing with regions of the space that are sparsely covered by its vector quantizer counterpart. A further reduction of 1 bit is obtained by formulating a new adaptation algorithm for the vector quantizer and doing a dynamic programming search for both quantizers. The method of adaptation takes advantage of the ordering of the LSFs and imposes no overhead in memory requirements. Subjective tests in a speech coder reveal that the 29 bit scheme produces equivalent perceptual quality to that when the parameters are unquantized.
The Pseudo 2D Hidden Markov Model (PHMM), which is an extension of the 1D HMM, has been shown to be an effective approach in recognition of highly degraded and connected text. In this paper, the PHMM is extended to directly recognize poorly-printed gray-level document images. The performance of the system is further enhanced by the N-best hypotheses search, coupled with duration constraint. Experimental results show that the new system has significantly improved the performance when compared to a similar system using threshold binary images as inputs. The recognition rate improves from 97.7% in binary system to 99.9% in gray-level with modified N-best search, over a testing set with similar blur and noise condition as the training set. For a much more degraded testing set, it improves from 89.59% to 98.51%. This also demonstrates the robustness of the proposed system.
The future ofthe "secure transaction" and the success ofall undertakings that depend on absolute
certainty that the individuals involved really are who and what they represent themselves to be is
dependent upon the successful development of absolutely accurate, low-cost and easy-to-operate
Biometric Identification Systems.
Whether these transactions are political, military, financial or administrative (e.g. health cards,
drivers licenses, welfare entitlement, national identification cards, credit card transactions, etc.),
the need for such secure and positive identification has never been greater -and yet we are only at
the beginning ofan era in which we will see the emergence and proliferation of Biometric
Identification Systems in nearly every field ofhuman endeavor.
Proper application ofthese systems will change the way the world operates, and that is precisely
the goal ofComparator Systems Corporation. Just as with the photo-copier 40 years ago and the
personal computer 20 years ago, the potential applications for positive personal identification are
going to make the Biometric Identification System a commonplace component in the standard
practice ofbusiness, and in interhuman relationships ofall kinds.
The development of new and specific application hardware, as well as the necessary algorithms
and related software required for integration into existing operating procedures and newly
developed systems alike, has been a more-than-a-decade-long process at Comparator -and we are
now on the verge of delivering these systems to the world markets so urgently in need of them.
An individual could feel extremely confident and satisfied ifhe could present his credit, debit, or
ATM card at any point of sale and, after inserting his card, could simply place his finger on a glass
panel and in less than a second be positively accepted as being the person that the card purported
him to be; not to mention the security and satisfaction of the vendor involved in knowing that his
fraud risk had been reduced to virtually zero.
In highly sensitive security applications, such a system would be imperative -and when
combined, if necessary, with other biometric identifiers such as signature and/or voice recognition
for simultaneous verification, one would have a nearly foolproof system.
These are the tools of what we call Transaction Facilitation, and this is the realm of Comparator
Systems Corp. Our technological developments over the last ten years have moved our Company
forward into a position of potential leadership in what is fast becoming a worldwide market, and it
is toward this end that we have applied all of our efforts.
The application of DNA fingerprinting has become very broad in forensic analysis, patient
identification, diagnostic medicine, and wildlife poaching, since every individual's DNA structure is
identical within all tissues oftheir body. DNA fingerprinting was initiated by the use of restriction
fragment length polymorphisms (RFLP). In 1987, Nakamura et aL2 found that a variable number
of tandem repeats (VNTR) often occurred in the alleles. The probability of different individuals
having the same number of tandem repeats in several different alleles is very low. Thus, the
identification of VNTR from genomic DNA became a very reliable method for identification of
individuals. Take the Huntington gene as an example, there are CAG trinucleotide repeats. For
normal people, the number of CAG repeats is usually between 10 and 40. Since people have
chromosomes in pairs, the possibility oftwo individuals having the same VNTR in the Huntington
gene is less than one percent, ifwe assume equal distribution for various repeats. When several allels
containing VNTR are analyzed for the number of repeats, the possibility of two individuals being
exactly identical becomes very unlikely. Thus, DNA fingerprinting is a reliable tool for forensic
analysis. In DNA fingerprinting, knowledge of the sequence of tandem repeats and restriction
endornuclease sites can provide the basis for identification.
A novel technique for automated fingerprint authentication is presented which utilizes pore information extracted from live scanned images. The position of the pores on the fingerprint ridges is known to provide information that is unique to an individual and is sufficient for use in identification. By combining the use of ridge and pore features, we have developed a unique multilevel verification/identification technique that possesses advantages over systems employing ridge information only. An optical/electronic sensor capable of providing a high resolution fingerprint image is required for extraction of pertinent pore information, which makes it unlikely that electronically scanned inked fingerprints would contain adequate pore data that is sufficient, or consistent enough, for use in authentication. The feasibility of this technique has been demonstrated by a working system that was designed to provide secure access to a computer. Low false reject and zero false accept error rates have been observed based on initial testing of the prototype verification system.
We consider the following pair of problems related to orthonormal compactly supported wavelet expansions: (1) Given a wavelet coefficient with its nominal scale and position indices, find the precise location of the transient signal feature which produced it; (2) Given two collections of wavelet coefficients, determine whether they arise from a periodic signal and its translate, and if so find the translation which maps one into the other. Both problems may be solved by traditional means after inverting the wavelet transform, but we propose two alternative algorithms which rely solely on the wavelet coefficients themselves.
We use the Battle-Lemarie scaling function in an algorithm for fast computation of the Fourier transform of a piecewise smooth function f. Namely, we compute for -N <EQ m,n <EQ N (with a given accuracy (epsilon) ) the integrals f(m,n) equals (integral) 10 (integral) 10 f(x,y)e-2(pi imx)e-2(pi iny)dxdy (0.1) in O(ND)+ O(N2logN) operations, where ND is the number of subdomains where the function f is smooth. We consider an application of this algorithm to image processing. Notwithstanding that it might be advantageous to consider an image as a piecewise smooth function f, it is a common practice in image processing to simply take the FFT of the pixel values of the image in order to evaluate the Fourier transform. We propose our algorithm as a tool for the accurate computation of the Fourier transform of an image since the direct evaluation of (0.1) is very costly.
Optical flow is an estimate of the velocity field based on the change of intensity patterns in successive images, and is an important quantity in computational vision for dense images. Because of the aperture problem optical flow computations can be ill-posed. This problem is compounded by derivative estimation errors. This paper presents an aggregate velocity scheme that uses iterative velocity refinement along object edge contours obtained via the Mallat- Zhong-Hwang wavelet and chaining algorithms. By working with edge information and aggregate velocities we avoid the aperture problem; iterative refinement compensates for errors in the derivative estimation. Our approach assigns a common velocity to the edge points of an image. When combined with a constant brightness assumption this yields an overdetermined set of linear equations. Since the data vector and matrix coefficients of this linear system consist of temporal and spatial derivative estimates, respectively, and both are subject to errors, the overdetermined system is solved using a total least squares approach. The resulting velocity estimate is then subtracted from the image sequence and the velocity estimation procedure is repeated for the new image sequence. This approach is very fast and accurate for images that have nearly the same edge velocity vectors as is usually the case for distant objects. A convergence analysis is given for the special case of 1D convected flow and it is shown that spatial and/or temporal smoothing enhances the convergence.
A new language, compiler, and user interface has been developed, which facilitates the research and development of image processing algorithms. Algorithms are written in a high- level language specifically designed for image processing, and are compiled into machine code that performs as well as algorithms hand-cooled in the C language. The new language and compiler are introduced, and many examples are presented. Directions for future work relating to the user interface and support for parallel processing are proposed.
Based on modern invariant theory and symmetry groups, a high level way of defining invariant geometric flows for a given Lie group is described in this work. We then analyze in more detail different subgroups of the projective group, which are of special interest for computer vision. We classify the corresponding invariant flows and show that the geometric heat flow is the simplest possible one. Results on invariant geometric flows of surfaces are presented in this paper as well. We then show how the planar curve flow obtained for the affine group can be used for geometric smoothing of planar shapes and edge preserving enhancement of MRI. We conclude the paper with the presentation of an affine invariant geometric edge detector obtained from the classification of affine differential invariants.