As video databases become increasingly important for full exploitation of multimedia resources, this paper aims at describing our recent efforts in feasibility studies towards building up a content-based and high-level video retrieval/management system. The study is focused on constructing a semantic tree structure via combination of low-level image processing techniques and high-level interpretation of visual content. Specifically, two separate algorithms were developed to organise input videos in terms of two layers: the shot layer and the key-frame layer. While the shot layer is derived by developing a multi-featured shot cut detection, the key frame layer is extracted automatically by a genetic algorithm. This paves the way for applying pattern recognition techniques to analyse those key frames and thus extract high level information to interpret the visual content or objects. Correspondingly, content-based video retrieval can be conducted in three stages. The first stage is to browse the digital video via the semantic tree at structural level, the second stage is match the key frame in terms of low-level features, such as colour, shape of objects, and texture etc. Finally, the third stage is to match the high-level information, such as conversation with indoor background, moving vehicles along a seaside road etc. Extensive experiments are reported in this paper for shot cut detection and key frame extraction, enabling the tree structure to be constructed.