In the quest for greater computer lip-reading performance there are a number of tacit assumptions which are
either present in the datasets (high resolution for example) or in the methods (recognition of spoken visual units
called "visemes" for example). Here we review these and other assumptions and show the surprising result that
computer lip-reading is not heavily constrained by video resolution, pose, lighting and other practical factors.
However, the working assumption that visemes, which are the visual equivalent of phonemes, are the best unit
for recognition does need further examination. We conclude that visemes, which were defined over a century
ago, are unlikely to be optimal for a modern computer lip-reading system.
Human lip-readers are increasingly being presented as useful in the gathering of forensic evidence but, like all humans, suffer from unreliability. Here we report the results of a long-term study in automatic lip-reading with the objective of converting video-to-text (V2T). The V2T problem is surprising in that some aspects that look tricky, such as real-time tracking of the lips on poor-quality interlaced video from hand-held cameras, but prove to be relatively tractable. Whereas the problem of speaker independent lip-reading is very demanding due to unpredictable variations between people. Here we review the problem of automatic lip-reading for crime fighting and identify the critical parts of the problem.
A recent trend in law enforcement has been the use of Forensic lip-readers. Criminal activities are often recorded
on CCTV or other video gathering systems. Knowledge of what suspects are saying enriches the evidence gathered
but lip-readers, by their own admission, are fallible so, based on long term studies of automated lip-reading, we
are investigating the possibilities and limitations of applying this technique under realistic conditions. We have
adopted a step-by-step approach and are developing a capability when prior video information is available for the
suspect of interest. We use the terminology video-to-text (V2T) for this technique by analogy with speech-to-text
(S2T) which also has applications in security and law-enforcement.
Color texture classification has been an area of intensive research activity. From the very onset, approaches to combining color and texture have been the subject of much discussion, and in particular, whether they should be considered joint or separately. We present a comprehensive comparison of the most prominent approaches both from a theoretical and experimental standpoint. The main contributions of our work are: (i) the establishment of a generic and extensible framework to classify methods for color texture classification on a mathematical basis, and (ii) a theoretical and experimental comparison of the most salient existing methods. Starting from an extensive set of experiments based on the Outex dataset, we highlight those texture descriptors that provide good accuracy along with low dimensionality. The results suggest that separate color and texture processing is the best practice when one seeks for optimal compromise between accuracy and limited number of features. We believe that our work may serve as a guide for those who need to choose the appropriate method for a specific application, as well as a basis for the development of new methods.
We consider the problem of classifying textures. First, we consider images where the orientation of the texture is known. Then, we consider the classification of textures where the orientation is unknown. Last, classification in real scenes is considered. A wide variety of techniques are tested using the Outex framework. We introduce a new grayscale multiscale texture classification method based on a class of morphological filters called sieves. The method, denoted Tex-Mex because it extracts TEXture features using Morphological EXtrema filters, is shown to be among the best performing texture feature extraction methods. Tex-Mex features can be computed rapidly and are shown to be more robust and compact than the alternatives. Furthermore, they may be applied over windows of arbitrary size and orientation, a useful attribute when classifying texture in real scenes.
Accurate lip-reading techniques would be of enormous benefit for agencies involved in counter-terrorism and other law-enforcement areas. Unfortunately, there are very few skilled lip-readers, and it is apparently a difficult skill to transmit, so the area is under-resourced. In this paper we investigate the possibility of making the lip-reading task more amenable to a wider range of operators by enhancing lip movements in video sequences using active appearance models. These are generative, parametric models commonly used to track faces in images and video sequences. The parametric nature of the model allows a face in an image to be encoded in terms of a few tens of parameters, while the generative nature allows faces to be re-synthesised using the parameters. The aim of this study is to determine if exaggerating lip-motions in video sequences by amplifying the parameters of the model improves lip-reading ability. We also present results of lip-reading tests undertaken by experienced (but non-expert) adult subjects who claim to use lip-reading in their speech recognition process. The results, which are comparisons of word error-rates on unprocessed and processed video, are mixed. We find that there appears to be the potential to improve the word error rate but, for the method to improve the intelligibility there is need for more sophisticated tracking and visual modelling. Our technique can also act as an expression or visual gesture amplifier and so has applications to animation and the presentation of information via avatars or synthetic humans.
We will discuss the simplified implementation of Recursive Median (RM) filters. It will be shown that every RM filter an alternative implementation. This implies a fast algorithm, [O(1) per pixel on average], for the one-dimensional RM filter. We also consider the case when RM filters are applied in a cascade of increasing filter window lengths, that is, the RM sieve. We will show that the RM sieve can be implemented in constant time per scale by applying only 3-point median operations. Both of the above mentioned fast implementations are viewed in a new light by constructing the corresponding Finite State Machines (FSM), and observing the achievable state reduction. Radical reduction of complexity takes place by implementing standard state reduction techniques. FSM models also open new possibilities for the analysis of these systems. Finally we discuss the benefits of using the RM sieve instead of the RM filter. We consider the streaking problem of the RM filter. It is demonstrated that the RM filter is not in itself a reliable estimator of location. As the cascading element in the structure of the sieve, however, it is very useful. It turns out that the use of RM sieve reduces the streaking problem to manageable level.
The theory of an image decomposition that we refer to as a sieve is developed for images defined in any finite number of dimensions. The decomposition has many desirable properties including the preservation of scale-space causality and the localization
of sharp-edged objects in the transformation domain. The decomposition has the additional properties of manipulability, which means that it is easy to construct pattern recognition systems, and scale-calibration, which means that it may be used for accurate measurement.
This paper provides an introduction to some novel aspects of the transmission line matrix (TLM) numerical technique with particular reference to the modeling of processes in semiconductor materials and devices. It covers the relative merits of different forms of network representation of physical problems. Current progress with TLM algorithms for the Laplace and Poisson equations is reviewed.