Multiple object tracking (MOT) is a common computer vision problem that focuses on detecting objects and maintaining their identities through a sequence of image frames. Until now, there have been three main approaches to improve MOT performance: 1) improving the detector’s quality, 2) improving the tracker’s quality, or 3) creating novel approaches to jointly model detection and tracking. In this work, we argue that there is a fourth, simpler way to improve MOT performance, by fusing multiple multiple object trackers together. In this paper, we introduce a novel approach, TrackFuse, that aims to fuse the final tracks from two different models into a single output, similar to classification ensembling or weighted box fusion for object detection. The fundamental assumption of TrackFuse is that multiple trackers will fail uniquely, and similarly, multiple detectors will fail uniquely too. Thus, by fusing the output of multiple approaches to MOT, we can improve tracking performance. We test our approach on combinations of several high performing approaches to tracking and show state-of-the-art results on the MOTA metric on a held out validation set of the MOT17 dataset, compared to individual tracking models. Furthermore, we consistently show that fusing multiple object trackers provides a performance boost on multiple metrics compared to results of individual model outputs sent for fusion. Our code will be released soon.
|