With the increasing size of multimedia content, the chief issues in multimedia computing today are shifting the focus from bandwidth to management. MPEG7, formally named "Multimedia Content Description Interface", is a standard for describing the multimedia content data, which aims to resolve the problems of management and query. Many possible application scenarios have been discussed in thousands of materials, especially surveillance, which is usually considered as one of the most suitable applications of MPEG7. In this paper, we propose a video surveillance and retrieval system, which employs MPEG7 as basic storing format. This clear, standard, human-readable format makes sure our system can interact with other MPEG7-based systems. Taking advantage of it, videos and metadata used to describe videos are completely split, storing in different files. This causes significant decreasing of query time, since the system only need to process metadata, if the metadata include enough information for searching. We employ a kernel-based tracking algorithm to extract semantic information from videos, which is fast enough for real-time processing. We also propose a flexible architecture, which has rich extensibilities, making it possible to improve the employed algorithm with few changes and to add new visual descriptor without recompiling the program. This system has been tested against several real-world surveillance videos. The experiment results show that this system is robust and suitable for many scenarios.