In this paper, a multi-metric evaluation protocol is proposed to evaluate performance of user-assisted video object extraction systems. Evaluation metrics are the essential element in performance evaluation methodology. Recent works on video object segmentation/extraction are mostly restricted to a single objective metric to judge the overall performance of algorithms. Motivated by a novel framework for performance evaluation on image segmentation using Pareto front, we propose a multi-metric evaluation protocol, including metrics for contour-based spatial matching, temporal consistency, user workload and time consumption. Taking the characteristic of a user-assisted video object extraction system into consideration, we formulate the metrics in a way simple but close to the assessment of human visual system. For spatial matching, we define three types of errors: sharp error, smooth error and mass error, which can precisely score an extraction result. The time consistency is introduced to evaluate the stability over time of a system. In addition, as far as a user-assisted system is concerned, the workload of users is also in our metric. Incorporating multi-metric into one 4-D fitness space, we adopt the Pareto front to find the best choice of a system with optimal parameters. The tests of our evaluation method show that the multi-metric protocol is effective.