Although people or object tracking in uncontrolled environments has been acknowledged in the literature, the accurate
localization of a subject with respect to a reference ground plane remains a major issue. This study describes an early
prototype for the tracking and localization of pedestrians with a handheld camera. One application envisioned here is to
analyze the trajectories of blind people going across long crosswalks when following different audio signals as a guide.
This kind of study is generally conducted manually with an observer following a subject and logging his/her current
position at regular time intervals with respect to a white grid painted on the ground. This study aims at automating the
manual logging activity: with a marker attached to the subject’s foot, a video of the crossing is recorded by a person
following the subject, and a semi-automatic tool analyzes the video and estimates the trajectory of the marker with
respect to the painted markings. Challenges include robustness to variations to lighting conditions (shadows, etc.),
occlusions, and changes in camera viewpoint. Results are promising when compared to GNSS measurements.
Producing off-line captions for the deaf and hearing impaired people is a labor-intensive task that can require up to 18
hours of production per hour of film. Captions are placed manually close to the region of interest but it must avoid
masking human faces, texts or any moving objects that might be relevant to the story flow. Our goal is to use image
processing techniques to reduce the off-line caption production process by automatically placing the captions on the
proper consecutive frames. We implemented a computer-assisted captioning software tool which integrates detection of
faces, texts and visual motion regions. The near frontal faces are detected using a cascade of weak classifier and tracked
through a particle filter. Then, frames are scanned to perform text spotting and build a region map suitable for text
recognition. Finally, motion mapping is based on the Lukas-Kanade optical flow algorithm and provides MPEG-7
motion descriptors. The combined detected items are then fed to a rule-based algorithm to determine the best captions
localization for the related sequences of frames. This paper focuses on the defined rules to assist the human captioners
and the results of a user evaluation for this approach.
Deaf and hearing-impaired people capture information in video through visual content and captions. Those activities
require different visual attention strategies and up to now, little is known on how caption readers balance these two
visual attention demands. Understanding these strategies could suggest more efficient ways of producing captions. Eye
tracking and attention overload detections are used to study these strategies. Eye tracking is monitored using a pupilcenter-
corneal-reflection apparatus. Afterward, gaze fixation is analyzed for each region of interest such as caption area,
high motion areas and faces location. This data is also used to identify the scanpaths. The collected data is used to
establish specifications for caption adaptation approach based on the location of visual action and presence of character
faces. This approach is implemented in a computer-assisted captioning software which uses a face detector and a motion
detection algorithm based on the Lukas-Kanade optical flow algorithm. The different scanpaths obtained among the
subjects provide us with alternatives for conflicting caption positioning. This implementation is now undergoing a user
evaluation with hearing impaired participants to validate the efficiency of our approach.
This paper reports on the development status of a Multimedia Asset Management (MAM) test-bed for content-based indexing and retrieval of audio-visual documents within the MPEG-7 standard. The project, called "MPEG-7 Audio-Visual Document Indexing System" (MADIS), specifically targets the indexing and retrieval of video shots and key frames from documentary film archives, based on audio-visual content like face recognition, motion activity, speech recognition and semantic clustering. The MPEG-7/XML encoding of the film database is done off-line. The description decomposition is based on a temporal decomposition into visual segments (shots), key frames and audio/speech sub-segments. The visible outcome will be a web site that allows video retrieval using a proprietary XQuery-based search engine and accessible to members at the Canadian National Film Board (NFB) Cineroute site. For example, end-user will be able to ask to point on movie shots in the database that have been produced in a specific year, that contain the face of a specific actor who tells a specific word and in which there is no motion activity. Video streaming is performed over the high bandwidth CA*net network deployed by CANARIE, a public Canadian Internet development organization.