Monocular 6D pose estimation is a fundamental task in computer vision, and this paper focuses on class-level 6D pose estimation that can predict the pose of previously unknown objects. In the work carried out on the basis of RGB-D images, previous approaches have paid less attention to the distinction between different structural features in them and the consistency of their contributions in the pose estimation task when using deep learning for feature extraction, and thus most of the fusion is a direct stitching of heterogeneous information features. In particular, the experimental results of the current work on 6D pose estimation surface that there are still limitations in its pose estimation for multi-class target training. The sparse structure of the point cloud would make most methods easy to ignore the effective features. Moreover, the existing method studies do not sufficiently explore the complementarity between the effective information of heterogeneous features, which can lead to the fusion of the optimal combination methods that lack the contribution of each, thus bringing a large amount of redundant information and consuming computational resources. Therefore, this paper designs a more effective point cloud feature extraction method for dynamic graph structure for this task. To address the inherent requirements of complementary fusion, we design an adaptive method for further feature extraction on heterogeneous data and a fusion method with effective self-attentive disequilibrium contributions to extract core information from potential feature complements quickly, accurately, and efficiently. We conducted experiments on popular benchmark datasets, such as the NOCS-REAL [1] dataset. The experimental results show that our proposed method can perform the multi-class target 6D pose estimation task end-to-end and has a good performance on these datasets while achieving a real-time inference speed of almost 20 FPS.
|