The structure of document images plays a significant role in document analysis thus considerable efforts have
been made towards extracting and understanding document structure, usually in the form of layout analysis
approaches. In this paper, we first employ Distance Transform based MSER (DTMSER) to efficiently extract
stable document structural elements in terms of a dendrogram of key-regions. Then a fast structural matching
method is proposed to query the structure of document (dendrogram) based on a spatial database which facilitates
the formulation of advanced spatial queries. The experiments demonstrate a significant improvement in
a document retrieval scenario when compared to the use of typical Bag of Words (BoW) and pyramidal BoW
Indexing and searching for WWW pages is relying on analyzing text. Current technology cannot process the text embedded in images on WWW pages. This paper argues that this is a significant problem as text in image form is usually semantically important (e.g. headers, titles). The results of a recent study are presented to show that the majority (76%) of words embedded in images do not appear elsewhere in the main text and that the majority (56%) of ALT tag descriptions of images are incorrect of do not exist at all. Research under way to devise tools to extracted text from images based on the way humans perceive color differences is outlined and results are presented.