Most verbal communications use cues from both the visual and acoustic modalities to convey messages. During the production of speech, the visible information provided by the external articulatory organs can influence the understanding of the language, by interpreting the combined information into meaningful linguistic expressions. The task of integrating speech and image data to emulate the bimodal human interaction system can be depicted by developing automated systems. These systems have a wide range of applications such as videophone systems, where the interdependencies between image and speech signals can be exploited for data compression and in solving the task of lip synchronization which has been a major problem. Therefore the objective of this work is to investigate and quantify this relationship such that the knowledge gained will assist in longer term multimedia and videophone research.