Pruning strategies for efficient online globally consistent mosaicking in fetoscopy

Abstract. Twin-to-twin transfusion syndrome is a condition in which identical twins share a certain pattern of vascular connections in the placenta. This leads to an imbalance in the blood flow that, if not treated, may result in a fatal outcome for both twins. To treat this condition, a surgeon explores the placenta with a fetoscope to find and photocoagulate all intertwin vascular connections. However, the reduced field of view of the fetoscope complicates their localization and general overview. A much more effective exploration could be achieved with an online mosaic created at exploration time. Currently, accurate, globally consistent algorithms such as bundle adjustment cannot be used due to their offline nature, while online algorithms lack sufficient accuracy. We introduce two pruning strategies facilitating the use of bundle adjustment in a sequential fashion: (1) a technique that efficiently exploits the potential of using an electromagnetic tracking system to avoid unnecessary matching attempts between spatially inconsistent image pairs, and (2) an aggregated representation of images, which we refer to as superframes, that allows decreasing the computational complexity of a globally consistent approach. Quantitative and qualitative results on synthetic and phantom-based datasets demonstrate a better trade-off between efficiency and accuracy.


Introduction
Twin-to-twin transfusion syndrome (TTTS) is a fetal condition that affects monochronic diamniotic pregnancies in which the presence of a certain pattern of intertwin vascular connections, known as anastomoses, results in an imbalance of the blood flow between twins. If this condition remains untreated, the outcome is generally fatal for both twins. 1 Minimally invasive surgery is today the standard of care. The surgeon explores the placenta using a fetoscope with a small field of view to find and photocoagulate all anastomoses. Despite the wide literature in endoscopic scenarios, 2,3 the challenges found in fetoscopic imagery complicate the application of available techniques. To illustrate the complexity of the fetoscopic site, Fig. 1 shows three examples of TTTS procedure where the reduced field of view can be observed. The surgeon must remember the explored areas to navigate through the placental surface building a mental map of the placenta. This is an extremely challenging task even for the best-trained surgeons due to this limited field of view, lack of texture, and turbidity, which hinders the visualization and also precludes any assistance from others. As a solution, mosaicking has been suggested to increase the field of view by stitching the images together in a common space, forming a map of the area.
The online creation of such map would allow for a more effective exploration 4,5 and localization of the anastomoses. Online approaches for mosaicking [6][7][8] usually rely on approximations that either summarize past information or do not make use of all available information. In some applications, a simple pairwise estimation, 5,9 where subsequent images are registered, can be appropriate for real-time operation. However, these approaches accumulate drift.
To mitigate the drift, a globally consistent method might be used. Bundle adjustment 10 is considered to be the offline reference method in globally consistent mosaicking due to its accuracy, which comes from the effective use of all the available information in a probabilistic way. When the number of frames increases, however, the limitations in computational complexity become more evident, and online operation becomes prohibitive.
In Ref. 6, the authors proposed to use an electromagnetic tracker (EMT) to guide the estimation and mitigate the drift. However, due to their computational complexity or suboptimality, the proposed strategies were not suitable to obtain accurate online mosaics with clinically acceptable update times.
Some efforts have been made to reduce the computational time of globally consistent approaches. For example, Schroeder et al. 11 proposed closed-form initial estimates to accelerate its convergence. Steedly et al. 12 used the concept of keyframes in an offline setting to denote a set of the most important frames; an all-to-all strategy analogous to bundle adjustment is used with the keyframes whereas a pairwise web connects the rest of frames enforcing only local consistency between them. This permits to reach mosaics of the order of thousands of frames. Despite the computational advantage of this approach, the reduction in the number of connections between nonkeyframe images leads to a decrease in robustness given that not all the information available has been used. The use of the redundancy provided by nonkeyframe images becomes essential when dealing with fetoscopic images due to their relatively poor quality.
An alternative to Ref. 12 consists of matching exclusively the images that are highly likely to share visual corresponding points, avoiding incoherent attempts. A proposed technique for overlap detection in monocular systems is to use a bag of words 13 (BoW), which creates a dictionary of visual words and assigns a signature per image. This signature summarizes the number of occurrences of the visual words in a histogram. In the case that the scene is revisited, the signature of the current image is very similar to the signature of the images initially acquired. However, this strategy works in an offline fashion while we aim for a sequential approach. This is the case for the work of Garcia-Fidalgo et al. 14 where an incremental BoW approach is used. However, pose information provided by an EMT system is much stronger than visual information in the placenta given that some uniform textureless areas of the placenta can lead to challenging scenarios. Therefore, we propose to use the EMT system to infer the topology of the cameras imaging the scene.
The organization of this paper is as follows: In Sec. 2, we review the background in mosaicking using the EMT system, to further detail our methodological contributions. The presentation of our experimental suite is done in Sec. 3, demonstrating a decrease in the computational complexity of bundle adjustment while maintaining similar accuracy. Then, we comment on the results in Sec. 4 and conclude the study in Sec. 5.

Methods
In this section, we present the preliminary background to then detail our algorithms. We start by introducing the scenario and stating the assumptions made to simplify the problem throughout the paper.
• We consider a placenta to be a plane. In Ref. 15, the authors demonstrated that this assumption greatly simplifies the problem.
• We assume the placenta to be static.
• Despite the image is obtained using a fetoscope, we use a pinhole camera model. 13,15 • The EMT field is not perfect. Yet, we assume we are working on the center of the field, and that therefore, that the distortions due to inhomogeneities in the electromagnetic field are neglibible. 6 • We assume 6 the each EMT measurement z k to be centered on the true measurement x k , with a multivariate Gaussian noise of covariance matrix Σ EMT .
We now introduce some concepts that are essential for the rest of the paper: how the visual aligned is performed in a pairwise and globally consistent manner, the link between poses and their imaged scenes over a planar surface, and how the EMT system can be introduced into the mosaicking pipeline.

Background in Mosaicking
Given a sequence of K images I ¼ fI k g K k¼1 , the goal of mosaicking is to find a two-dimensional representation of the scene or mosaic M∶Ω M → R 3 (RGB) where Ω M ⊂ R 2 is denoted the mosaic space. Provided that the camera observes a planar scene, a homography H exists between corresponding points in two views p ¼ ½ p x p y T , p 0 ¼ ½ p 0 x p 0 y T , which lie in their respective image spaces Ω p , Ω p 0 ⊂ R 2 . This homography can be parametrized as a 3 × 3 matrix such that E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 1 ; 3 2 6 ; 4 3 3 Then, Eq. (1) can be written asp 0 ∝ Hp where H is defined up to scale and the tilde indicates that the points are expressed in homogeneous coordinates. 16 If the same planar scene is observed from both views, a homography H can be directly inferred from image matching information. 17 In this work, we use a landmark-based registration approach 18-20 since these approaches usually allow for sparse feature detection which can be faster than using the information in the whole image.
Pairwise mosaicking (PM) relies on estimations of pairwise homographies H kþ1;k which project any point from the space Ω k of image I k onto the space Ω kþ1 of image I kþ1 . Pairwise homographies are then used to compose homographies from a fixed reference, which without loss of generality, can be placed on the first frame as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 2 ; 3 2 6 ; 1 9 4 where the product operator denotes the left matrix multiplication, corresponding to the composition of homographies, e.g., Once the homographies relating every frame to the reference are computed, every pixel in every image is projected onto the mosaic space. To further clarify the nomenclature, we name absolute homographies H k with a single subindex meaning that they project any point from the reference Ω M to the image frame space Ω k , whereas pairwise homographies H kþ1;k are defined as mapping points from Ω k to Ω kþ1 . When performing PM, any residual error in the estimation of pairwise homographies is accumulated through the chain in Eq. (2), resulting in a wrong placement of the images in the mosaic that degenerates in an uncontrolled way over time.
Globally consistent approaches such as bundle adjustment 10 take into account the relationship between all pairs of images in the sequence to create globally consistent mosaics. 21 However, they are generally not fast enough for online operation; the main reasons are the acquisition of the correspondences between all images and the run-time of the nonlinear optimization procedure.
Let us define L as the set of all pairs of image indices in which correspondences have been successfully acquired. We aim to obtain a set of estimated homographiesĤ 1 ; : : : ;Ĥ K−1 , where the hat denotes that they are an estimate, that minimizes the reprojection errors of the matching points in all images in the sequence, namely: 1 ; : : : ;Ĥ K−1 ¼ arg min (3) N l;m is the number of matches found in the pair fl; mg, fð·Þ is the conversion from homogeneous to Cartesian coordinates so that p ¼ fðpÞ, and σ 2 v is the variance associated with the location of a feature once propagated to the space of its matched pair. This variance is associated with the fact that we model the error between a fixed point in an image, and its corresponding pair propagated from the other image as a Gaussian random variable. Note that, since one frame is defined as the reference (here chosen as the first frame, without loss of generality), we only require to estimate K − 1 homographies, the remaining one being set to identity.
Let X ¼ fx k g K k¼1 be the set of corresponding true camera poses. In addition, let Z ¼ fz k g K k¼1 be the respective EMT measurements of these poses, where we assume z k to be a noisy instance of the true camera x k . We parametrize each camera pose x k ¼ ½r k ; t k T as a rotation r k and translation t k . The vector z k is parametrized in the same way. The three-parameter rotation vector r k is extracted from the skew symmetric matrices ½r k x ∈ soð3Þ, which can be converted into the rotation matrix R k ¼ expð½r k x Þ ∈ SOð3Þ. 22 SOð3Þ refers to the special orthogonal group of matrices of dimension 3 × 3 with determinant 1, and soð3Þ is its corresponding lie algebra. With a rotation matrix R k and a translation t k , we can compose the rigid transformation T k ∈ SEð3Þ, where SEð3Þ is the special Euclidean group corresponding to rigid transformations in three-dimensional space E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 4 ; 6 3 ; 1 8 7 A set of camera poses (provided by the EMT system) on their own do not give us enough information to create a mosaic. To that end, a set of homographies is necessary, requiring also the planar structure modeling the scene. We now detail how to obtain a set of homographies from both poses and the surface plane. Provided that only camera poses are measured, we need to establish a link between these and imagery.
Let us consider a virtual camera with its virtual image plane located in the origin of coordinates, which we use as a reference, whose image plane space is Ω M . We can then link the spaces of the images obtained by the cameras, and their respective motions provided that the imaged plane is known. Since it is so, there is a homography H defined that propagates any point in the virtual image to any other image through the following equation, that for convenience we define as gð·Þ E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 5 ; 3 2 6 ; 6 6 4 where K is the intrinsic camera calibration matrix, and v corresponds to the unit normal vector to the imaged surface n observed from the origin of coordinates, divided by the distance d between origin of coordinates and the plane E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 6 ; 3 2 6 ; 5 8 9 v ¼ n d : We use this relation in order to relate the visual content in the images with EMT readings of the camera poses. The complete derivation of Eq. (5) can be found in Ref. 16.
In the following work, Ref. 6 introduced a probabilistic graphical model that infers the set of poses X and planar structure v by solving the minimization problem whereX ,v are the set of estimated camera poses and plane, respectively, the cost C v represents the sum of all reprojection errors between matching landmarks with pairs of indices fl; mg ∈ L found between all images where the homographies H l , H m are obtained using the three components; the two camera poses x l , x m , and plane v as in Eq. (5). The EMT information is incorporated in the second cost C EMT as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 8 ; 3 2 6 ; 3 6 4 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 9 ; 3 2 6 ; 3 0 0 This prevents the poses to be estimated from drifting provided that the EMT measurements have a common reference, which do not allow drift to separate the cameras from the measured position. In other words, the solution obtained as a result of Eq. (7) reflects a guided estimation of the parameters using the EMT system.

Pruning Strategies for Sequential Bundle Adjustment
In this section, we present our two pruning strategies: (i) the use of the EMT system to identify and discard nonoverlapping frames for which no matching should be attempted, and (ii) the introduction of the concept of superframe; an extension of a typical frame that allows for more efficient mosaicking schemes. We present two pipelines that increasingly incorporate the contributions mentioned above.

Frame pruning using the EMT system
Given the small field of view in fetoscopy and the relatively large area to explore, only a small subset of images shares visual content. Global alignment schemes typically attempt to find correspondences between each pair of images, although in many cases such as ours, the majority of image pairs do not overlap spatially and thus cannot be matched. A given image will typically only match to a small spatially adjacent subset. For this reason, we propose to use the information of an EMT system to find plausible image candidates, bypassing the computation of unnecessary failed matching attempts and reducing, thereby the computational complexity of bundle adjustment. At a given time instant k, we have the current EMT measurement z k and the previous estimation of the plane v k−1 by solving the nonlinear optimization problem posed by Eq. (7). Therefore, we can compute a noisy homography using Eq. (5) as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 0 ; 6 3 ; This estimation of the homography, even if not very accurate, can be used to decide whether two images are likely to overlap, thus allowing to filter out spatially unreasonable candidates. To do so, once the homography is obtained, we project the corners of the bounding box containing the circular fetoscopic region of interest through H EMT k onto the mosaic space as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 1 ; 6 3 ; 4 7 1p where the pointsp k;c andp Ω M k;c represented in homogeneous coordinates, are the original and propagated image corners to the mosaic space Ω M , respectively, where c denotes the corner index. Then, the overlap between the current and previous frames can be easily and efficiently obtained. Kekec et al. 8 showed how this problem can be seen as a simple convex polygon intersection problem. The estimation of the overlap given two convex polygons can be efficiently solved using the separating axis theorem (SAT); 8 a theorem from computer graphics that states that if a straight line can be drawn between two convex polygons, then the polygons do not overlap.

Pairwise compression
A practical problem in mosaicking 12 algorithms is that hundreds of correspondences are typically stored and used in the optimization procedure per image as can be seen in Eq. (9). Instead, we parametrize the pairwise relation as five equivalent correspondences in the following way: (i) after the pairwise homography estimation using RANSAC simultaneously with the correspondences acquisition, 17 we take the bounding box that includes all the interest points in the image. In particular we take the topcentered, bottom-centered, left, and right point locations as well as the center of the bounding box to account for the spread of these points in the original image space. (ii) We propagate them using the estimated homography that relates both images to obtain the second set of points, which completes the collection of correspondences. These correspondences will then be kept for the optimization, reducing the run-time of the cost function greatly.
The more correspondences we acquire, the more correspondences must be taken into account in the cost function. This might not be a problem at the beginning of the sequence, but as the number of correspondences increases, the computational cost increases as well, hindering online operation. We now introduce an additional strategy that generalizes a regular frame in a more efficient representation for mosaicking. Figure 2(a) shows the proposed pipeline proposed in this section while Fig. 2(b) shows the diagram of the pipeline proposed in Sec. 2.2.3. In both pipelines, the pairwise compression mentioned in Sec. 2.2.2 is applied. Contributions to the mosaicking pipeline are outlined with a blue border.

Superframe representation
The superframe representation is a generalization of a frame that incorporates one or more frames. The main idea is to partition the image set I ¼ fI k g K k¼1 into subsets of W images grouped into N superframes with K ¼ NW. Since it is a generalization of an image, the superframe can be incorporated into the standard bundle adjustment pipeline reducing its computational burden drastically. Let us formally define the concept of superframe.
We define a superframe as a representation of a subset of W frames I i ¼ fI k g k∈K i that encodes the most salient information of the region observed by all images in the superframe which are indexed by u being an integer. If the lead image was taken in isolation, then this would be equivalent to the concept of keyframes. 12 In contrast, the superframe uses information of all the images in the window.
Analogously to the standard pipeline, interest point locations P k and descriptors D k are extracted for all frames within each superframe. We propose to use the lead image space as a common space where all interest points in the superframe lay, defining the superframe as S i ¼ fP S i ; D S i g. To propagate the locations of interest points into this common space, we definẽ P k ∈ R 3×N k as a matrix containing all point locations in homogeneous coordinates, where N k is the number of interest points found in image k. If expressed this way, the points can be propagated onto the lead image space as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 2 ; 3 2 6 ; 3 2 6 while the descriptors are grouped as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 3 ; 3 2 6 ; 2 8 1 where the homography H LF i ;k is the homography that relates the frame k within the superframe with the lead frame LF i . This homography is obtained by first running a local, small-scale visual bundle adjustment with all W images following the methodology described in Sec. 2.1. Within the superframe, the aligned homographies are then considered fixed in the rest of the pipeline. Figure 3(a) depicts the process of the creation of a superframe, and Fig. 3(b) shows the differences in matching between two images (top) and two superframes (bottom).

Superframe in the mosaicking pipeline
To integrate the superframe into the mosaicking pipeline, we extend the probabilistic framework 6 summarized in Sec. 2.1. However, now superframes replace single frames. To highlight why this is possible, let us describe the simple scenario where two superframes are created. In that case, their descriptors can be matched directly since the input to the matching algorithm is two sets of descriptors. From there, the matching process is analogous to the one described in the PM in Sec. 2.1. Figure 2(c) shows the diagram of the pipeline using the superframe, marking in blue the contributions in the pipeline. In applying directly, 6 only one EMT measurement is assumed to be associated with each frame. Therefore, there would be W − 1 EMT measurements unused per superframe. Instead, we propose to modify the pipeline to include all EMT measurements in the window W for better placement of the superframe in the mosaic. These measurements can be used to obtain homographies H EMT k using Eq. (5) and the previously estimated v i−1 resulting from Eq. (7). Figure 4 shows the nomenclature and indexing,  displaying each homography composed using the EMT system. Using the EMT measurements and the fixed homographies H LF i ;k found in the initial bundle adjustment to build the superframe,, we can obtain measurements of the lead homography H EMT;k LF i coming from each of the EMT measurements in a given superframe as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 4 ; 6 3 ; 3 5 2 H EMT;k We consider these homographies as measurements of the true homography that places the lead image into the mosaic space, and formulate the problem as where λ and λ 0 are weights. The two first costs are precisely the costs defined before in Eqs. (9) and (8), and the proposed additional term for the global alignment cost function C H is defined as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 6 ; 6 3 ; 2 0 4 ϵ k ðc; E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 7 ; 6 3 ; where p c represents Q control points. Specifically, we used the four corners and the center point of an image as control points. Setting P hVis is not trivial since it depends on the accuracy of the EMT system and previous estimation of the normal vector. Therefore, we use it conservatively setting a standard deviation of 10 pixels.
Once all homographies from the mosaic space to the lead frames of each superframe are obtained, one must place the superframes into the mosaic space. To do that, each one of the images within the superframe has to be placed in the mosaic space. Therefore, we must obtain individual homographies H k for each frame in the sequence that relates it with the mosaic space. One must note, however, that when running bundle adjustment on the superframes, the reference is in the space of the first lead frame, which corresponds to index LF 1 . For simplicity, we denote the estimated homographies as H LF i , skipping the second subindex, which would be LF 1 . However, one must keep in mind that they are actually from the first lead frame LF 1 to LF i as H LF i ;LF 1 . Therefore, to compose a mosaic in the space of the first image, not the first superframe, one must use E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 8 ; 3 2 6 ; 2 6 8 Once the absolute set of homographies is obtained, one can project the images onto the mosaic space to be blended.

Efficient implementation of the pipeline using superframes
The main idea of the superframe is to summarize many frames into a region by first performing a local bundle adjustment of W frames, each with a complexity of OðW 2 Þ. This entails all-to-all matching of all images within the superframe. On the other side, the complexity of a full bundle adjustment (FBA) is quadratic in the number of frames OðK 2 Þ. By using superframes instead of single frames, one allows reducing the number of effective images used in a general bundle adjustment. This procedure now takes into account K W superframes, so the total complexity is only The first term accounts for the small bundle adjustments of size W to build each superframe and then the second term accounts for the complete bundle adjustment of the superframe. For example, in a sequence of K ¼ 100 frames, where the window W ¼ 5, we only have N ¼ 20 superframes. Therefore, the construction of the superframes would take ∼500τ þ 400τ 0 whereas a single FBA would take 10;000τ, where τ and τ 0 are timings related to interest point detection and matching, which differ for single frame and superframe, depending on the number of frames W used to create each superframe.
The number of interest points contained in a superframe would be the sum of the number of interest points detected in each image. As W increases, this number can quickly become too large and prohibitive for matching. To reduce the number of interest points in the superframe drastically, we put two strategies in place. First, we greedily select only interest points resulting in being inlier matches within every frame in the creation of the superframe. Intuitively, interest points that have already been matched should have a higher probability of being matched in the superframe compared to other points that might not be as well described. The second strategy is to keep a precomputed a KD-tree, 23 implementing an approximate nearest neighbors strategy for efficient matching in the creation of the superframe.
In addition, the matching candidate selection strategy mentioned in Sec. 2.2.1 is adapted to be used with superframes instead of single frames. To apply it, only the meaning of overlap for two superframes has to be redefined: two superframes overlap if any of their single frames overlap. In this way, we can incorporate the computational advantage of using superframes that leverage the EMT system to filter out spatially incoherent matching candidates.

Experiments
In this section, we present our experimental suite to demonstrate the computational advantage of the proposed pruning strategies compared against a baseline implementation of bundle adjustment. We start by presenting the datasets, algorithms, metrics, and implementation details.

Datasets
We generated two datasets formed by a set of images and EMT data; synthetic (SYN, 273 frames, 373 × 378 px) and a phantom based (PHB, 701 frames, 780 × 781 px) datasets. SYN is a translation-based raster scan generated by simulating ground truth camera motions, from which EMT measurements are generated by applying Gaussian noise, 6 and for which the images were extracted as projections of the scene observed by the cameras. In that case, the scene consisted of a planar, high-resolution digital image. PHB was recorded by imaging a printed image of a placenta. The setup consisted of a camera head IMAGE1 H3-Z SPIES mounted on a 3-mm straight scope 26007 AA 0°(Karl Storz Endoskope, Tuttlingen, Germany), an EMT system NDI Aurora with a planar field generator, and a Mini 6 DoF sensor. A collection of homographies by semiautomated registration of each fetoscopic image to the original image of the placenta, where a landmark-based approach was used to align the images after the initial manual alignment.

Algorithms
As a baseline, we compare our algorithms to both the established FBA as well as the standard PM approach, which is very fast but accumulates drift. We apply the matching compression strategy presented in Sec. 2.2.2 as well as the use of the EMT system to discard inconsistent matching attempts (SAT). Our second contribution is the introduction of the superframe (SF) and its incorporation to the mosaicking pipeline. By default, we have chosen W ¼ 5. We explicitly name SFðWÞ to a run where W corresponds to the size of the window used in the superframe.

Metrics and Implementation Details
We compare the homography H and the ground truth homography G by computing the mean residual error ϵ k of a projected grid of N g points ρ i from the image space to the mosaic space. Then, the error ϵ between two mosaics is computed as the mean of individual errors ϵ k . This error is defined in the mosaic space, which is the space in which the final composition is performed. Note that we use ϵ k with superindex k to refer to the residual error between two sequential mosaics at each time instant whereas ϵ k with subindex k refers to the error of the alignment of a single image k. To make this metric independent to the choice of reference space, we compute the average of the errors using all images as a reference, as in E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 9 ; 3 2 6 ; 5 0 2 In terms of feature-based method, we used SURF. 20 The matching was performed using fast approximate nearest neighbors. 23 The fetoscope was precalibrated using the Matlab Camera Calibration Toolbox. We also precomputed and applied the Hand-eye Calibration 24 matrix from a sequence of images of a checkerboard as well as synchronized sensor poses. The window size of the superframe was determined from the trade-off analysis as W ¼ 5, which is presented later in Sec. 3.3. In addition, given that superframes contain many more correspondences than actual frames, a kd-tree 23 can be trained and stored per superframe. Then, fast approximate nearest neighbor can be used to accelerate the matching between superframes. The experiments have been performed in an Ubuntu 16.04 with Intel Core I7 at 2.5 GHz and 6 GB of memory.
Next, we evaluate sequential bundle adjustment and evidence the decrease in computational time that the proposed strategies provide.

Speedup: Correspondence Acquisition
In this section, we analyze the computational cost of a sequential bundle adjustment in an online setting. More precisely, we focus on correspondence acquisition.
In Fig. 5, we display the run-time of correspondences acquisition in the three algorithms for both datasets. As expected, SAT is faster than FBA in both cases, and SF shows the best in terms of computational cost. In the case of the phantom-based dataset, we can observe that SAT is not necessarily faster all the time. This is due to the trajectory that the camera has followed in this particular dataset. If the overlap between images is too large, then the algorithm suggests all images as possible candidates.
To further evidence where the computational savings have happened spatially, we analyze the percentage of potential image candidates that have been selected using an occupation matrix. This matrix shows whether there has been a match from image i (corresponding to row i) to image j (corresponding to column j). For a more informative representation, we have color-coded the matrices. Out of all image matches that would have been obtained using an FBA (white), only a portion of them would be selected by SAT (green). Due to the use of symmetrical matching, we only display the upper right triangle in the occupancy matrix. Figure 6 shows that, in the case of SYN (a), 100% of the proposed candidates are correct, and no matches have been missed, saving a total of 54.2% matching attempts. As mentioned before, the path followed by the camera in the PHB [ Fig. 6(b)] makes saving matching attempts more complicated. 100% of the proposed candidates are correct; however, there is only a total saving of 5.2% attempts, which happened to be toward the end of the sequence.
We now study the run-time of the nonlinear optimization as the second bottleneck for bundle adjustment to reach sequential operation.

Speedup: Nonlinear Optimization
The computational cost of the nonlinear optimization is due to the run-time of the cost functions in Eqs. (9) and (8) and the number of iterations to convergence. In contrast to standard bundle adjustment where we do not have any initial estimate a priori, subsequent estimations can be used as an initial estimate in a sequential version of bundle adjustment, providing a speedup at each iteration due to the proximity to the minimum. Although this is a desirable property, the question that might prevent desirable update rates is whether the number of iterations to convergence grows with time.
In Fig. 7, we show the number of iterations to convergence in both SYN and PHB datasets.
The fact that the approaches show a constant tendency over time is crucial since it indicates that the number of iterations in the nonlinear optimization does not seem to be a bottleneck for offline operation. Given the textureless nature of fetoscopic images, using only a subset of all frames is not a very robust option. Instead, we would like to use the redundancy in the complete video sequence. Figure 8 shows the optimization times for FBA, SAT, and SF. It can be seen how if compression is used (SAT), the slope decreases with the number of iterations. Now that we have demonstrated the decrease in computational burden in the proposed algorithms, we need to evaluate their accuracy and show that despite the improvement in speed, there is no significant loss in accuracy.

Efficiency
In this section, we analyze the performance of the proposed algorithms for classical pipelines. In particular, we computed the error in each sequential mosaic for NV and FBA as a baseline, comparing it to SAT, and SF with different window sizes W ¼ 1;3; 5;7; 9 in SYN and PHB datasets, respectively. Figure 9 shows the errors for each of the methods. The error in NV spikes from the beginning. This is well known due that there is no global correction and therefore, drift is accumulated. While SF results in the best performance regarding computational advantage, the decrease in accuracy for all the window sizes except W ¼ 1 for which the behavior should be very similar to SAT. When the window increases, the accuracy of the SF remains stable.
Since the data reflects this trade-off between computational power and accuracy, we propose to analyze this trade-off with the graph in Fig. 10. This graph shows the error in the y axis and the time per iteration in the x axis. We have plotted each algorithm in a different color. In here, temporal evolution is not shown; however, one can observe the temporal trail by looking at the lines connecting the dots, taking into account that first iterations are the fastest. For an efficient algorithm, we would like the error and computational time to be low. Therefore, we identify the most efficient algorithm as the one whose results lie in the bottom left corner. The same tendency can be seen for both datasets; despite being efficient, the NV drifts in large measure, making the algorithm not worth pursuing. The FBA is not a good option concerning the trade-off since the computational time is too large, yet it does keep the accuracy very low. SAT performs similarly, albeit much more efficiently than FBA. On top of that, all versions of SF perform more efficiently than SAT. In particular, we see that as we increase the window size, the efficiency starts to be better until W ¼ 5. The selection of the best algorithm is not trivial since W ¼ 3 shows a faster start yet slower end than W ¼ 5 in both datasets due to the need for matching more superframes. After this, efficiency starts to fall systematically. We believe that the increase of computational time that takes to perform the first bundle adjustment causes this effect, growing with W. Since many of the operations in the creation of the superframe could be performed in parallel, we expect slightly better results for a parallelized version of the approach.
In Figs. 11 and 12, we show mosaics for SYN and PHB datasets, respectively, for the NV, FBA, and SFW5. While the figure shows some steps of the sequential estimation of the mosaic, we encourage the reader to visualize the videos included in the supplementary material for a more thorough visualization of the results.

Discussion
Bundle adjustment is in general slow due to the mentioned reasons. In this work, we propose the use of the EMT system to identify potential matching candidates, avoiding unnecessary matching attempts. However, the computational savings will greatly depend on the motion of the camera. For example, if the camera remains in the same position, then the EMT system will determine that all the frames are candidates being equivalent to an all-to-all matching scenario. On the contrary, if the camera moves, the EMT system is able to identify a reduced set of potential candidates. A simple solution to this problem could be set up a preprocessing step of filtering out the frames that overlap more than a certain area threshold.
The accuracy of the EMT is limited. Discarding nonmatching candidates is a process that can be made more loose or aggressive by just increasing or decreasing the size of the images when determining overlapping frames. This will depend on the accuracy of the EMT when projected to the image space, i.e., the noise on the EMT system, but also the camera parameters, and the placement of the planar object. In our datasets, results seem to suggest that taking the image size is enough to discard incoherent matching candidates.
However, the estimation of the overlap is dependent on having a reliable normal estimation. When the camera has not observed enough scene, the estimation of the normal vector is less accurate due to the lack of depth perception in a small baseline. When more parts of the scene are observed, the normal vector converges, and only a few constant number of iterations are necessary for convergence in subsequent iterations. In fact, the results suggest that only a constant number of iterations are enough for the nonlinear optimization to converge. This is a significant result that reveals that the number of steps to convergence is not a problem for achieving clinically feasible update rates. The obvious exception to this point is if there are outliers in an image. If so, then the cost function does not reflect the problem to solve and as a consequence, the number of iterations to converge grows, not reaching the desired minimum. Therefore, it is essential to obtain an outlier-free set of correspondences. The creation of a superframe requires a first bundle adjustment. This process, even if it takes some time, is compensated by the fact that the general bundle adjustment takes only K W frames into account. The efficiency gain is expected, whenever, the number of superframes N is less than the number of frames K. However, errors in the local bundle adjustment impact in the second, global, bundle adjustment since it is only optimizing for the lead frame in the superframe. If there is an error, one of the homographies of the superframe, then the interest points will be erroneously placed within the space of the lead frame in the superframe. Then, a geometrical validation such as RANSAC could filter them out for being geometrically inconsistent with a homographic model. This results in a trade-off between the accuracy and the window size W. A second trade-off is that concerning the window size concerning efficiency. When the first bundle adjustment is large, then the number of efficient superframes to optimize in the global bundle adjustment is smaller and vice-versa. However, the larger is the superframe, the more rigid the system is, and this makes it more prone to alignment errors. Also, as the window size is larger, the distance between lead frames also increases and matching becomes more complicated.  In the described case where a fixed window is used, the optimal W will greatly depend on the motion on the camera making the choice of the value not trivial. We have chosen W ¼ 5 as the best trade-off between accuracy and computational cost in the evaluation. An interesting fact is that the accuracy of the mosaic does not seem to decrease much between different instances of superframes even though its computational time does. We propose a system in which single frames are not shared by the superframes, i.e., each frame index is only in one superframe. However, an interesting research line could be to inspect the impact of incorporating each index in more than one superframe. This implies that the number of frames in the nonlinear optimization procedure grows, trading an increase in the runtime of the algorithm for more robustness.

Conclusions
In this work, we propose two different pruning strategies allowing the sequential application of bundle adjustment for mosaicking. To make this possible, we tackle the computational bottlenecks of bundle adjustment using two different pruning strategies. First, we use the EMT system to discard a high percentage of spatially inconsistent matching attempts. Second, we introduce the concept of superframe and introduce it into the mosaicking pipeline. This concept is a generalization of an image, which can be used to reduce the computational complexity drastically in both the correspondence acquisition and optimization phases. We show a large decrease in computational complexity with respect to a standard bundle adjustment, on both synthetic and phantom-based datasets. This makes possible the use of a sequential bundle adjustment, which achieves a better compromise between efficiency and accuracy than standard approaches. The results of this work open new avenues to online operation, leading the state of the art in online mosaicking one step closer to clinical translation.

Disclosures
The authors declare no conflicts of interest.
Danail Stoyanov is an associate professor at UCL. He received his PhD in computer science from Imperial College London specializing in medical image computing. Previous to that, he studied electronics and computer systems engineering at King's College London. His research interests and expertise are in surgical vision and computational imaging, surgical robotics, image-guided therapies, and surgical process analysis.
Tom Vercauteren is a professor of interventional image computing at King's College London where he holds the Medtronic/RAEng Research Chair in Machine Learning for Computer-Assisted Neurosurgery. Prior to this, he was an associate professor at UCL and deputy director of the Wellcome/EPSRC Centre for Interventional and Surgical Sciences (WEISS). He has ten years of MedTech industry experience with a first-hand track record in translating innovative research. His main interest is in translational computer-assisted intervention.
Sebastien Ourselin is head of the School of Biomedical Engineering and Imaging Sciences and chair of Healthcare Engineering at King's College London. He was a professor at UCL, London, United Kingdom. Prior to that, he founded and led the CSIRO BioMediA Lab, Australia. He is an associate editor for Transactions on Medical Imaging, the Journal of Medical Imaging, Scientific Reports, and Medical Image Analysis. His research interests include image registration, segmentation, statistical shape modeling, surgical simulation, image-guided therapy, and minimally invasive surgery.