Self-supervised learning for interventional image analytics: toward robust device trackers

Abstract. Purpose The accurate detection and tracking of devices, such as guiding catheters in live X-ray image acquisitions, are essential prerequisites for endovascular cardiac interventions. This information is leveraged for procedural guidance, e.g., directing stent placements. To ensure procedural safety and efficacy, there is a need for high robustness/no failures during tracking. To achieve this, one needs to efficiently tackle challenges, such as device obscuration by the contrast agent or other external devices or wires and changes in the field-of-view or acquisition angle, as well as the continuous movement due to cardiac and respiratory motion. Approach To overcome the aforementioned challenges, we propose an approach to learn spatio-temporal features from a very large data cohort of over 16 million interventional X-ray frames using self-supervision for image sequence data. Our approach is based on a masked image modeling technique that leverages frame interpolation-based reconstruction to learn fine inter-frame temporal correspondences. The features encoded in the resulting model are fine-tuned downstream in a light-weight model. Results Our approach achieves state-of-the-art performance, in particular for robustness, compared to ultra optimized reference solutions (that use multi-stage feature fusion or multi-task and flow regularization). The experiments show that our method achieves a 66.31% reduction in the maximum tracking error against the reference solutions (23.20% when flow regularization is used), achieving a success score of 97.95% at a 3× faster inference speed of 42 frames-per-second (on GPU). In addition, we achieve a 20% reduction in the standard deviation of errors, which indicates a much more stable tracking performance. Conclusions The proposed data-driven approach achieves superior performance, particularly in robustness and speed compared with the frequently used multi-modular approaches for device tracking. The results encourage the use of our approach in various other tasks within interventional image analytics that require effective understanding of spatio-temporal semantics.


Introduction
][3] Tracking the tip of the catheter as a visual guidance facilitates navigation to the desired anatomy.Furthermore, the tip of the catheter serves as an anchor point separating the catheter from the vessel structures.The anchor point can provide a basis for mapping angiography (high-dose X-ray with injected contrast agent) to fluoroscopy (low-dose X-ray), thereby reducing usage of contrast for visualizing vessels. 1,4 ][7] Fig 1 : Tracking error (↓) versus average speed (↑) for catheter tip tracking in coronary X-ray sequences acquired during procedures such as invasive coronary angiography (ICA) or percutaneous coronary intervention (PCI): (a) showing average tracking error; and (b) showing maximum tracking error.Note that the average tracking error has 2 different scales indicated with a horizontal break-point for better visualization.Runtime is measured on a Tesla V100 GPU.
However, tracking the tip of the catheter in X-ray images can be challenging in presence of various occlusions due to contrast agent and other devices.This is in addition to the cardiac and breathing motion of the patient.Recently, self-supervised learning methods have been developed with the aim to learn general features from unlabeled data to boost the performance in various natural sequence imaging tasks.Most self-supervised pretraining methods learn such features by identifying and removing inherent redundancies from sequence image data.VideoMAE 8 conducts temporal downsampling on the pixel level followed by symmetrical masking over all the sampled frames with a high masking ratio of 90%.This deliberate design choice prevents the network from learning fine inter-frame correspondences.SiamMAE 9 improves upon this baseline by using highly asymmetric masking.However, the proposed asymmetric masking requires feeding in the first frame entirely with 0% masking which increases the compute complexity quadratically and prevents the network from learning spatio-temporal features over a longer period of time.
The space-time semantics in interventional cardiac image sequences differ from natural videos in terms of both redundancies and motion.For example, visibility may largely vary based on Xray dosage along with varying motion based on the acquisition frame-rate, patient's breathing and cardiac motion.In angiography sequences, vessels have high structural similarity with devices such as catheters and guidewires and can gradually appear or disappear over time.
To address these challenges, in this work we bring the following contributions in terms of both self-supervised pretraining and the downstream device tracking: 1. We pretrain a spatio-temporal encoder on a large database of interventional cardiac X-ray sequences from over 20,000 patients (over 16,000,000 frames) for robust device tracking.
2. We propose a novel frame interpolation masked auto-encoder (FIMAE) to learn generalized spatio-temporal features from this dataset.The pretrained spatio-temporal features play an essential role in feature extraction and feature matching for tracking.Our pretrained features efficiently capture the underlying temporal motion needed for tracking, which is typically Overview of key differences between our approach and previous approaches for device tracking.
accomplished through highly optimized supplementary modules in other device tracking models. 10,11 To the best of our knowledge, this is the first approach which leverages spatio-temporal pretrained features to replace a commonly used Siamese-like architecture for single object tracking.

4.
A lightweight Vision Transformer (ViT) 12 based model is designed that leverages the learned features to replace a traditional two-stage tracking encoder for feature extraction and feature fusion into one spatio-temporal encoder for highly accurate and robust real-time device tracking with an inference speed of 42 fps on a single Tesla V100 GPU (refer to Figs. 1 and 2).
5. We conduct comprehensive numerical experiments and demonstrate that our method outperforms other state-of-the-art tracking methods in robustness, accuracy and speed.
6.We conduct a comprehensive analysis of our model's robustness in handling long temporal sequences and demonstrate its ability to maintain consistent performance across diverse scenarios, including angiography, fluoroscopy, and sequences featuring additional obstructions caused by other devices.

Related Work
0][31][32] Some methods use symmetrical masking on temporally downsampled video frames to reduce space-time redundancies over a long time period 33 . 8In contrast, others 9 use asymmetrical masking to learn inter-frame correspondence between frame pairs.However, we propose a method for both reducing space-time redundancies over a long time period along with learning fine inter-frame correspondence.
[45] Historical-Trajectory-based Natural Image Tracking approaches leverage prompt-based methods to integrate relevant information.In particular, the temporal information is passed into the network as prompts to incorporate the historical trajectory information.ARTrack 46 employs a decoder that receives these encodings as well as coordinates of the searched object from previous frames as spatio-temporal prompts for a trajectory proposal.Another approach, SwinTrack, 47 uses a multi-head cross-attention decoder that leverages both the encoder output and a motion token, which represents the past object trajectory given previous bounding box predictions.
Device tracking in X-Ray: Specifically for the tracking of devices in X-Ray images, multiple approaches have been proposed, including multiple Siamese-based architectures similar to those in natural image object tracking. 34,48 ther methods such as Cycle Ynet 10 employ a semi-supervised approach to address the lack of annotated frames in the medical domain, or leverage deep learningbased Bayesian filtering for catheter tip tracking. 1One of the most recent approaches, ConTrack, 11 uses a Siamese architecture and a transformer-based feature fusion model.To further refine the tracking, it incorporates a RAFT 49 model applied to catheter body masks for estimating optical flow.

Methods
We propose a novel Frame Interpolation Masked Autoencoder (FIMAE) approach to train a transformer model to extract spatio-temporal features based on a large internal dataset D u .The model is designed specifically to learn inter-frame correspondences over a large number of frames.The pretrained encoder is then used as backbone for the downstream tracking task using supervised learning on a dataset D l (with expert annotations).The pretraining method and the tracking pipeline are explained in the following subsections.

Self-supervised Model Training
Learning space-time embeddings Given the unlabeled dataset D u , n frames are sampled from an arbitrary sequence S k ∈ D u , ∀k > 0, where S k,n = [I  Masking strategy based on frame interpolation In order to learn features that capture fine spatial information and fine temporal correspondences between frames, we propose a novel masking strategy based on frame interpolation, that overcomes the limitation of the symmetrical tube masking proposed by VideoMAE. 8Recall that the VideoMAE approach is limited in capturing fine inter-frame correspondences.Traditionally, in the domain of natural imaging, the frame interpolation task 50,51 is defined as the as a sum of forward warping and backward warping of any two neighboring frames (indexed by t > 0): where τ θ 1 denotes the forward warping operator and τ θ 2 denotes the backward warping operator (parametrized by θ 1 , θ 2 ).However, the change of appearance in coronary vessel structures in presence of contrast can be much more complex than natural images.Hence, a linear operation of forward and backward warping can limit the potential of the network.In our case, we reformulate this to a learning problem, seeking to optimize the parameters θ of a deep neural network to learn a combined warping operation F as: In our approach, we use tube masking for every alternate frame with a ratio of 75% and combine it with frame masking.However, with such high tube masking ratio, further masking an entire intermediate frame for frame interpolation can make the task extremely challenging.In addition, masking an entire frame may also lead the network to never attend to certain patch positions during training.Hence, we mask the intermediate frame randomly to a high ratio of 98%, instead.See Fig. 4 for a schematic visualization.Let p t ∈ Ω tube be the token indices of the tube masked tokens for frame t, where Ω tube denotes the set of all tube masked token indices.Similarly, q t ∈ Ω f rame refers to the frame masked token indices for frame t in all randomly frame masked token indices.Mathematically, if ρ is the probability for masking, Ω tube ∼ Bernoulli(ρ tube ) where different time t shares the same value.On the other hand, Ω f rame ∼ Bernoulli(ρ f rame ), and is drawn uniquely for each frame at t. Let p ′ t ∈ Ω ′ tube and q ′ t ∈ Ω ′ f rame be the sets of remaining visible token indices.Combining tube and frame masking strategies, we obtain the following reconstruction objective for any 3 given frames: where 0 < t < n − 1 denotes the index of an arbitrary frame from the sampled sequence, I t (p ′ t ) denotes the visible patches of frame I t with tube/frame masking applied.The 3-frame objective shown in Eq. 3 can be generalized to all n frames.

Encoder-Decoder Training
The unmasked patches are passed through a ViT encoder which adopts a joint space-time attention.That is, each token for frame t, is projected and flattened into D m -dimensional vector query, key and value embedding: (q t , k t , v t ).The joint space-time attention is based on the concatenated vectors: where the variables (Q, K, V ) are defined as for n sampled consecutive frames.The encoded visible patches are then concatenated with learnable masked tokens.A lightweight transformer decoder attends to the encoded patches and the masked token to reconstruct the initially masked patches.The decoder incorporates additional positional encoding to ensure the correct positions of the masked and unmasked patches as per the original frames.
Pretraining Loss Function We use a weighted mean squared error (MSE) loss, L = L tube + γL frame between the masked tokens and the reconstructed ones in the pixel space based on the masking strategy, where γ is the weighting factor: where I is the input image, Î is the reconstructed image, and 0 ≤ η ≤ (n − 2)/2.We use a weighted loss for reconstruction to compensate for the imbalance between low masked frames (less reconstruction tokens) and highly masked frames (more reconstruction tokens).The variable γ is defined as the ratio of number of Ω tube tokens and the number of Ω f rame tokens.

Downstream Application: Device Tracking
Particularly for tracking the tip of the catheter, our goal is to track its location, ŷt = (u t , v t ) at any time t, t > 0 given a sequence of X-ray images {I t } n t=1 with a known initial location of the catheter tip y 1 = (u 1 , v 1 ) on the labeled dataset D l .We consider the sequences S k ∈ D l , ∀k > 0 to have only few annotated labels, S k,n = [(I 1 , y 1 ), (I 2 ), ..., (I 7 , y 7 ), (I 8 ), ...].To identify the location of the tip of the catheter at current search frame, existing approaches build a correlation with a template frame.The template frame is usually a small crop around the catheter tip location from a previously predicted frame.Similar to ConTrack, during training we use three template frames that are cropped from the first annotated frame and the previous two annotated frames, respectively.We use the current frame for template frames if no previously annotated frames are available.During inference, the initial location of the catheter tip serves as the first template crop and is kept intact.The rest two template frames are updated dynamically based on the model's predictions.

Feature transfer
The spatio-temporal transformer backbone inputs three template frames and a search frame as four distinct frames.We interpolate the positional encoding from the pretraining frame positions appropriately to ensure that the network distinguishes each template and search frame as distinct frames.In particular, each template frame and the search frame correspond to the positions of center crops of individual frames in the pretraining setup.Therefore, the encoder inputs Concat(te 1 , te 2 , te 3 , se), where te 1,2,3 and se are template patches and search patches respectively.Given that transformers are isotropic models, we obtain an encoded feature set, f c = Concat(f te 1 , f te 2 , f te 3 , f se ).The spatio-temporal transformer backbone is trained to extract fine inter-frame correspondences.Hence, this results in a joint feature extraction and feature matching between the template frames and the search frame.The overview of the proposed model is depicted in Fig. 3.

Multi-task Transformer Decoder
We use a lightweight Transformer decoder similar to the original Transformer model. 52First, all the features f c are projected to a lower dimension d m .The decoder uses two learnable query tokens (h d , m d ), each for a heatmap head and a mask head.
Then, each layer first computes attention on the query tokens as per Eq. 4. It is followed by crossattention with encoded features f c , where key and value embeddings are computed by projecting the features f c to dimension d m .The resulting query tokens are then correlated with the search features, unflattened and passed through a CNN head: The final tip coordinates are obtained by ŷ = max(P h ), where P h and P m refer to predicted heatmap of the catheter tip and predicted mask of the catheter respectively.We compute soft dice loss L dice = L h + λL m , for both heatmap and mask predictions, given by: where G represents ground truth labels, and λ is the weight for weighting mask loss.

Dataset
An unlabeled internal dataset D u of coronary X-ray sequences is utilized to pretrain our model.D u consists of 241,362 sequences collected from 21,589 patients, comprising 16,342,992 frames in total.It contains both fluoroscopy ("Fluoro") and angiography ("Angio") sequences.We randomly sample 10 frames at a time, with varying temporal gaps between them, ranging from 1 to 4 frames.We repeat the last frame in sequences where the number of frames is less than 10.The model is then pretrained for 200 epochs with a learning rate of 1e−4.For the downstream tracking task, we use dataset D l .Note that D l ∩ D u = ∅.The distribution of field of view for both D u and D l is depicted in Fig. 5 and is estimated based on the Positioner angles.The Positioner Primary Angle is defined in the transaxial plane at the imaging device's isocenter with zero degrees in the direction perpendicular to the patient's chest and +90 degrees at the Patient left hand side and -90 at the Patient right hand side.The Positioner Secondary Angle is defined in the Sagittal Plane at the imaging device's isocenter with zero degrees in the direction perpendicular to the patient's chest.Fig. 5 shows that the distribution of the sequences in both the datasets are concentrated around similar Positioner angles.Other attributes from both the datasets D l and D u are depicted in Table 1.The annotations on the frames in D l represent the coordinates of the tip of the catheter, which are converted to Gaussian heatmaps with standard deviation of ≈ 5mm.Mask annotations of the catheter body are also available for a subset of these annotated frames.In an average, the catheter body takes up 0.009% of the total area of a frame.The training and validation set consists of 2,314 sequences totaling 198,993 frames, out of which 44,957 have annotations.In this set, 2,098 sequences are Angio and only 216 sequences are Fluoro.The test set consists of 219 sequences, where all 17,988 frames are annotated.For evaluation, we split the test set into three categories: 94 Fluoro sequences (8,494 frames, 82 patients), 101 Angio sequences (6,904 frames, 81 patients), and 24 Devices sequences (2,593 frames, 10 patients). 11The latter category, "Devices", covers all sequences where sternal wires are present, which cause occlusion, thus further increasing the difficulty of catheter tip tracking.Examples of these cases are illustrated in Fig. 6.The SNR of the image intensity at the catheter tip with respect to the background is shown in Table 2, further quantifying the challenge of tracking.The SNR was calculated based on the following formula: SN R = 20 log 10 P w σ f (11)   where P w is the mean intensity in the window of 6×6 (≈ 2mm × 2mm) and σ f denotes the standard deviation of the intensity of the background in the window of 30 × 30 (≈ 10mm × 10mm) with the catheter tip as the centre of both windows.We follow the same image pre-processing pipeline as ConTrack, i.e., we resample and pad to size of 512 × 512 with 0.308 mm isotropic pixel spacing.We use 160×160 crops for search image, and 64×64 crops for template images.We train our model for 100 epochs, with a learning rate of 2e−4 using AdamW optimizer and Cosine Annealing scheduler with warm restarts.We evaluate our work against state-of-the-art methods, explore the impact of the proposed pretraining strategy, and investigate whether complex additional tracking refinement modules are necessary.All the evaluations are performed based on expert annotations.

Performance Evaluation
Benchmarking against State-of-the-Art We report the performance of our model against the state of the art device tracking models in Table 3.Here, we evaluate the euclidean distance error in mm between the prediction and the ground truth annotations.Overall, our method demonstrates the best performance on the test dataset, excelling in both precision and robustness.Our approach significantly reduces the overall maximum error, e.g., by 66.31% against the comparable version of ConTrack (ConTrack-mtmt) and by 23.20% against ConTrack-optim, a highly optimized solution leveraging multi-stage feature fusion, multi-task learning and flow regularization.In comparison to previous state-of-the-art approaches, our approach results in fewer failures, as depicted by the error distribution in Fig. 7. Atleast 95% of the all test cases has an error below the average diameter of the vessels (≈ 4mm).Notably, our approach stands out from other tracking models by eliminating the need for a two-stage process involving the extraction of spatial features and subsequent matching using feature fusion.Instead, our spatio-temporal encoder jointly performs both.Other approaches often require two or more forward passes for the two-stage processing to incorporate varying template-search size, which increases computational complexity.This is further amplified by the inclusion of additional modules, such as multi-task decoders and the flowrefinement network in ConTrack-optim. 11In contrast, our model accomplishes the task with a single forward pass for both the multiple templates and the search frame.The only additional modules in our model are the two CNN heads for multi-task decoding.This design choice enables us to achieve a significantly higher real-time inference speed of 42 fps on a single Tesla V100 GPU without compromising on accuracy, as shown in Fig. 1.Despite Cycle Ynet 10 also relying on multiple forward passes for feature extraction, its simplicity and computationally friendly CNN architecture allows it to reach higher speed, albeit at the expense of accuracy and robustness.
Impact of Pretraining Next, we focused on the impact of pretraining by comparing tracking performance utilizing our proposed pretraining strategy (FIMAE) against current prevalent pretraining methods for sequential image processing, see Table 4.The findings indicate that pretraining on domain-specific data, as opposed to natural images (VideoMAE-Kinetics), offers sig-

Performance without Complexity
The strength of our approach comes from the pretrained spatio-temporal features that facilitate effective feature matching between the template frames and the search frame.Another key advantage is its prior understanding of the inherent cardiac/respiratory motion.This knowledge significantly reduces or even eliminates the impact of additional modules such as flow refinement.Our approach thereby achieves high robustness in tracking, with minimal variations across different additional modules such as multi-task.To illustrate this, Fig. 9(a) highlights the relative stability of the maximum error across different versions of our model compared to the high volatility observed in ConTrack under different module configurations.In addition, ConTrack reaches its best performance only when utilizing all modules, in particular including flow-refinement, which in turn leads to increased inference time.Contrary to ConTrack, adding the flow refinement module to our model even reduced its performance marginally in terms of accuracy (1.54 mm) and robustness (max error of 11.38 mm).We postulate that this is attributable to the fact that while flow refinement can indeed learn intricate temporal correspondences between the previous and current frames, it can also propagate noise originating from inaccurately predicted catheter masks.
To further assess the robustness of the tracking systems, we introduce Tracking Success Score (TSUC), which draws parallels with most tracking benchmarks prevalent in single object tracking in the natural image domain. 53TSUC is computed as the ratio of number of instances (frame or sequence), in which the distance error falls below a specific threshold, to the total number of instances.To establish a relevant threshold, we set it at twice the average vessel diameter in our test dataset (≈ 8mm).Fig. 9(b) and Fig. 9(c) summarize the results for sequence-level and framelevel TSUC respectively.Our approach consistently achieves an impressive 99.08% sequencelevel TSUC across all additional modules, with only a small drop to 98.61% in the multi-task configuration.At the frame level, our optimal version (multi-task multi-template) yields a TSUC  of 97.95%, compared to 93.53% for ConTrack under the same configuration.ConTrack achieves its best frame-level TSUC of 95.44% using the flow-refinement variant.
The robustness of a method is also influenced by its ability to effectively handle long sequences, as the accuracy of current frame predictions is dependent on previous frame predictions, resulting in gradual accumulation of errors over time.We examine the mean TSUC for sequences exceeding a certain frame count (nframes) in Fig. 10.The plot shows that our method consistently demonstrates stable TSUC values across various sequence lengths, indicating its robust performance.Conversely, different versions of the ConTrack exhibit a gradual decline in mean TSUC as the frame count threshold increases, suggesting a reduced reliability in predicting outcomes over extended sequences.
Performance breakdown for different cases We further conduct detailed comparison with the best-performing state-of-the-art method, ConTrack, for the different image categories defined earlier, see Fig. 11.We further compare our model's performance with ConTrack for the challenging cases, i.e., angiography and devices, via percentile plots in Fig. 12.In the cases of angiography, our method shows 15% improved accuracy and 45% reduction in maximum error.Similarly for the devices (occlusion) category, where we achieve 43% better accuracy and 60% reduction in maximum error (Fig. 11 and Fig. 12).Our model's performance on Angio and Devices cases is compared qualitatively with ConTrack in Fig. 13.The example cases in the figure shows the effectiveness of our approach in the presence of complex occlusions from the vessels and sternal wires.ConTrack achieves a better performance than our method in Fluoro cases with a slightly better median and lesser maximum error.However, for Fluoro, ConTrack achieves a TSUC of 99.01%(inaccurate in 1 sequence) compared to our model's TSUC of 97.69% (inaccurate in 3 sequences).The inaccuracy of our model is seen in sequences where the visibility of the catheter is faint due to low-dose X-rays.We hypothesise that this is due to Transformer's architecture that uses 16 × 16 non-overlapping patches making it less effective towards faint visibility in low-dose X-rays compared to CNNs in ConTrack, which uses overlapping 3 × 3 windows.

Ablations
The following ablation studies investigate the impact of three key components on overall tracking performance.
Positional Encoding As reported in Table 5, the positional encoding strategy has notable impact on downstream task performance.The naive positional encoding simply applies 1D sine-  cosine positional encoding over all the patches, and hence loses the temporal information about the patches, resulting in unsatisfactory results.If learnable positional encoding is used, the temporal positions are still needed to be learnt leading to sub-optimal performance.Interpolating from the central patch positions of the pretrained frames (frame-aware positional encoding) gives the best results.
Masking Ratio We further compare the performance of different intermediate frame masking ratios in Table 6.Best results are obtained with an intermediate frame masking ratio of 98%.While results with 95% are largely equivalent, there is a notable reduction in performance when the entire frame is masked, which may be due to the lack of patches and its relative positions information during pretraining.Effect of initialization Recall that the first template crop during both training and inference was obtained from the initial catheter tip location and is not updated.We explore its impact in Table 7.To assess its importance, we conduct two experiments.First, we dynamically update the initial template frame during inference, like the rest.Second, we introduce random noise (2 to 16 pixels) to the initial tip location instead of updating the template.Our findings highlight the crucial role of initialization in tracking.Updating the initial template frame worsens performance due to greater accumulated prediction errors over time compared to the original setup.Additionally, even small noise levels of 2 pixels can noticeably affect performance, increasing the maximum error by 5 pixels.

Modality Bias
The distribution between Angio and Fluoro varies to some degree in terms of dosage and presence of contrasted vessel structures.We remind the reader that in our training dataset the distribution of Angio:Fluoro sequences was 2098:216 of the total of 2314 sequences.4. transitioning from Fluoro to using all data for training has a positive effect on the Fluoro test performance -we hypothesize this is because the 216 Fluoro sequences are complement with many more non-contrasted frames from all Angio sequences to substantially increase the dataset, and thereby improve performance.
Furthermore, the challenges posed by device obstruction exhibit nuanced differences between fluoro and angio, contributing to reduced performance when the model is trained on a single modality.In this study, we present Frame Interpolation Masked Autoencoder (FIMAE), a Masked Imaging Modeling (MIM) approach, which is introduced for the purpose of acquiring generalized features from a large unlabeled dataset containing more than 16 million interventional X-ray frames, with the objective of device tracking.FIMAE overcomes the limitation of tube masking as proposed in VideoMAE, and applies frame interpolation-based masking for capturing fine inter-frame correspondences.The acquired features are subsequently applied to the task of device tracking within fluoroscopy and angiography image sequences.Our pre-trained FIMAE encoder surpasses all prevalent MIM-based pretraining methods for sequential imaging processing.The spatio-temporal features acquired during the pretraining phase significantly influence the extraction and matching of features for the purpose of device tracking.We demonstrate that an efficient spatio-temporal encoder can replace the frequently utilized Siamese-like architecture, yielding a computationally lightweight model that maintains a high degree of precision and robustness in the tracking task.By adopting our methodology, we achieve a noteworthy 23.2% reduction in maximum tracking error, even without the incorporation of supplementary modules such as flow refinement, when compared to the state-of-the-art multi-modular optimized approach.This performance enhancement is accompanied by a frame-level TSUC score of 97.95% at 3× faster inference speed than the state-of-the-art method.The results also show that our approach achieves superior tracking performance, particularly in the challenging cases where occlusions and distractors are present.
Limitations and Future Work Our investigation is primarily centered on leveraging pre-trained features for the tracking of devices within X-ray sequences.Consequently, we contend that the pretrained model can be further extended to other tasks within interventional image analytics, such as stenosis detection, guidewire localization, and vessel segmentation.Furthermore, the absence of annotated frames within our sequential imaging dataset imposes a constraint on the utilization of historical trajectory information, a commonly exploited approach in recent single object tracking methodologies in the natural imaging domain.Thus, a more comprehensive investigation is needed to effectively make use of this information in our specific context.

Disclaimer
The concepts and information presented in this paper are based on research results that are not commercially available.

Disclosures
There are no conflicts of interest.

Fig 2 :
Fig 2:Overview of key differences between our approach and previous approaches for device tracking.

Fig 3 :
Fig 3: Overview of our framework.First, the encoder is trained to learn spatio-temporal features from a large unlabeled dataset of angiography and fluorscopy using Frame Interpolation Masked Autoencoder (FIMAE) (left).Then, the weights are transfered into ViT encoder for feature extraction and feature matching for tracking the catheter tip (right).

Fig 5 :
Fig 5: Distribution of the datasets based on the Field of View (Positioner Primary angle and Positioner Secondary angle): The left plot denotes the unlabled dataset (D u ) and the right plot denotes the catheter tip dataset (D l ).

Fig 6 :
Fig 6: Visualization of tip of the catheter in fluoroscopy, angiography and cases with other devices.

Fig 7 :
Fig 7: Percentile plot of Cycle YNet, ConTrack and ours (a) for all test cases and (b) zoomed in for percentiles from 90th to 100th.95th percentile of our method's performance is lesser than the average diameter of the vessels (≈ 4mm).

Fig 8 :
Fig 8: Qualitative Results.Comparison of different methods on a challenging sequence of angiography, where tracking receives obstruction from vessels and sternal wires (other devices).Note that the images have been cropped around the region of interest for better visualization.The mean error depicted in the figure is the average error computed over the entire sequence.

Fig 10 :
Fig 10: Robustness with respect to the sequence length: mean TSUC for all sequences greater than the frame count (nframes).Note the dataset consists of only 4 sequences with frame count greater than 210.

Fig 11 :
Fig 11: Breakdown of different cases in a violin plot for comparison of our method with ConTrack.

Fig 12 :
Fig 12: Percentile plots of different versions of ConTrack and Ours for (a) Angio Cases and (b) Device cases.

Fig 13 :
Fig 13: Visualization of predictions of ConTrack and our model in two Angio sequences (top two) and an extra device case (bottom).Note that the frames are sampled randomly from the sequence for visualization.

Table 1 :
Dataset Statistics (Range and median) for Unlabeled Dataset (D u ) and Catheter Tip

Table 2 :
SNR of different categories in catheter tip dataset (D l )

Table 3 :
Comparison study of sequence-level tracking errors (mean euclidean distance) and runtime for different methods for catheter tip tracking in coronary X-ray sequences.Best numbers are marked in bold black and the second best is marked in blue.We also show the performance of different versions of ConTrack.ConTrack-base refers to its base version which has no additional modules, ConTrack-mtmt refers to multi-task and multi-template version and ConTrack-optim is its final optimal version which has all modules including flow refinement.

Table 4 :
Study of effect of pretraining startegies on the performance of the catheter tip tracking.Pretraining is performed either on our internal dataset (denoted as D u ) or on natural images (in case of the first approach).However, even when including the models trained on D u (VideoMAE and SiamMAE) into the comparison, our model surpasses all by more than 30% across all reported metrics.VideoMAE lacks fine temporal correspondence between frames, leading to non-efficient feature matching between template and search frames.While SiamMAE has the ability to learn inter-frame correspondence, it relies on only two frames at a time, which is insufficient to fully capture the underlying motion.Qualitative results are shown in Fig.8, based on a challenging angiography sequence with contrast-based device obstruction and other visible sternal wires.The figure shows how our model is able to handle this challenging case by not losing track of the tip of the catheter where the other models fail to differentiate the catheter from the sternal wires.

Table 5 :
Effect of different positional encoding incorporated in the downstream task.

Table 6 :
Tracking performance with FIMAE trained with different intermediate frame masking ratios.i.e. masking ratio of Ω f rame .

Table 7 :
Significance of initialization in catheter tip tracking: How the performance is affected if first template frame is updated or some noise is introduced to the initial tip coordinates.Our objective in this study is to develop a model that exhibits strong performance across both modalities.We present the results of training on individual modalities compared to training on combined data in Table8.Our findings indicate that training solely on one modality results in suboptimal performance on the other modality.Notably, while training on Angio data yields an improvement in Angio performance, training exclusively on Fluoro data fails to enhance performance in Fluoro.We hypothesize that a possible reason for this effect is the imbalance of 2098:216 (Angio to Fluoro sequences), with the following effects: 1. 2098 Angio sequences is a large enough dataset to ensure good Angio performance when training on this data alone; 2. 216 Fluoro sequences is too little to power the training of a large transformer model, leading to inferior results when training/testing on Fluoro only; 3. transitioning from Angio to using all data for training has a negative effect on the Angio test performance -we hypothesize that adding the few Fluoro sequences to training increases the complexity of the training problem, as the distribution of Angio training cases is enhanced with the distribution of Fluoro cases, based on only 216 examples; and

Table 8 :
Performance variation across modalities based on modality-specific training.