Intraoperative on-the-fly organ-mosaicking for laparoscopic surgery

Abstract. The goal of computer-assisted surgery is to provide the surgeon with guidance during an intervention, e.g., using augmented reality. To display preoperative data, soft tissue deformations that occur during surgery have to be taken into consideration. Laparoscopic sensors, such as stereo endoscopes, can be used to create a three-dimensional reconstruction of stereo frames for registration. Due to the small field of view and the homogeneous structure of tissue, reconstructing just one frame, in general, will not provide enough detail to register preoperative data, since every frame only contains a part of an organ surface. A correct assignment to the preoperative model is possible only if the patch geometry can be unambiguously matched to a part of the preoperative surface. We propose and evaluate a system that combines multiple smaller reconstructions from different viewpoints to segment and reconstruct a large model of an organ. Using graphics processing unit-based methods, we achieved four frames per second. We evaluated the system with in silico, phantom, ex vivo, and in vivo (porcine) data, using different methods for estimating the camera pose (optical tracking, iterative closest point, and a combination). The results indicate that the proposed method is promising for on-the-fly organ reconstruction and registration.


Introduction
The amount of minimally invasive surgeries performed yearly is increasing rapidly. This is largely due to the numerous benefits these types of intervention have on the patient side: shorter stay in hospital, less trauma, minimal scarring, and lower chance of postsurgical complications. There are several drawbacks for the surgeon, though: limited hand-eye coordination, no haptic feedback, no direct line of sight, and a limited field of view.
Computer-assisted surgery tries to alleviate some of these drawbacks by providing the surgeon with information relevant to the state of the intervention. Prior to the intervention, preoperative data are acquired for diagnosis and surgical planning. Elaborate equipment (e.g., CT or MRI) generates precise data and also allows imaging from the interior of the body. Threedimensional (3-D) models created from this data can provide the surgeon with a virtual view inside the patient during surgery. To this end, the models have to be registered to the current surgical scene, i.e., the current location and orientation of the real structure have to match those of the virtual one. The available tools for intraoperative imaging (e.g., endoscope) are limited in image quality and field of view. But they can be used to create intraoperative surface models that enable the registration process with the preoperative data.
Many groups have explored ways to obtain intraoperative surface models. To sample an intraoperative surface, Herline et al. 1 used a probe in which the tip was moved over the visible parts of the liver. The probe was localized with an active position sensor. To avoid possible tissue damage, newer approaches commonly rely on ranged sensors. Laser range scanners used by Clements et al. 2 offer high reconstruction quality for conventional liver surgery. The downside is the need of additional hardware in the operating room. Dumpuri et al. 3 extended this approach to take intraoperative soft tissue deformation into account. After an initial rigid registration of the laser scan and CT surfaces, the residual closest point distances between the rigidly registered surfaces are minimized using a computational approach. The method was further refined by Rucker et al. 4 using a tissue mechanics model subjected to boundary conditions, which were adjusted for liver resection therapy.
For registering preoperative data in laparoscopic surgery, the organ surface can be observed with optical laparoscopic sensors that provide a 3-D-reconstruction of a single video frame. There are many methods for reconstructing 3-D surface structures. 5 The most commonly used methods rely on multiple view geometry. Through correspondence analysis between two or more images, a 3-D-reconstruction can be obtained via triangulation. Structure from motion (SfM) uses one camera with images from at least two different perspectives for triangulation. A similar approach is the stereo camera. It uses two image sensors, which can be calibrated to each other. The known transformation between the two stereo images allows a more precise reconstruction. Instead of using naturally given correspondences, shape from shading algorithms use structured light for active triangulation. The structured light has to be projected onto the scene, which is proving to be difficult in surgical practice. The methods mentioned previously only reconstruct a small field of view, and due to the homogeneous structure of tissue, a single frame, in general, will not provide enough detail to rule out geometrical ambiguities (i.e., an intraoperative surface patch has multiple possible matches on the preoperative model surface) during registration.
To remedy this problem, Plantefève et al. 6 used anatomical landmarks to achieve a stable initial registration. The preoperative landmarks were labeled automatically while the intraoperative labeling required manual interaction. After the initial registration, a biomechanical model and the established correspondences between the landmarks were used to counteract intraoperative soft tissue deformation and movement.
To expand the reconstructed surface, methods to associate multiple frames are needed. One of these is the procedure of localizing the camera in the world while simultaneously mapping it, known as simultaneous localization and mapping (SLAM) in literature. SLAM is a well-known approach in robotic mapping and has also found its way into computer-assisted laparoscopic surgery. Mountney et al. 7 introduced an SLAM approach using a stereo endoscope to map the soft tissue of the liver. They worked with a sparse set of image texture features, which are tracked by an extended Kalman filter. In later work, the system was expanded to compensate breathing motions. 8 To recover from occlusions or sudden camera movements, Puerto-Souza et al. 9,10 developed a robust feature matching, the hierarchical multiaffine (HMA) algorithm. In tests with real intervention data sets, the HMA algorithm exceeded the existing feature-matching methods in the number of image correspondences, speed, accuracy, and robustness.
SLAM can also be achieved through a single moving camera. With the previous mentioned SfM technique, reconstructing 3-D scene information is possible. In the work of Grasa et al. 11,12 this method is used to create a sparse reconstruction of a laparoscopic scene in real time. However, reconstructions from single camera solutions have the problem that they do not provide an absolute scale. To approach this problem, Scaramuzza et al. 13 used nonholonomic constraints. Recently, Newcombe et al. 14 introduced the KinectFusion method, which provides dense reconstructions of medium-sized (nonmedical) scenes in real time using a Microsoft Kinect for data acquisition. In the work of Haase et al., 15 an extension of Newcombe et al. 14 is used to reconstruct the surgical situs with multiple views taken by a 160 × 120 pixels time-of-flight camera.
In this paper, we present a system that combines 3-D reconstructions generated online by a stereo endoscope from multiple viewpoints, while simultaneously segmenting structures on-thefly. It is based on our previous work 16 and was extended by a detailed description of the method and an extensive evaluation on in silico, phantom, ex vivo, and in vivo data. In our system, the reconstructions and the segmentations are combined into one organ model. To compute a 3-D point cloud from a stereo image pair, the hybrid recursive matching (HRM) algorithm outlined by Röhl et al. 17 was used. It was compared with other 3-D surface reconstruction methods by Maier-Hein et al. 18 and achieved the best results. The segmentation of the organ of interest is done on the basis of color images. Using a random forest based classifier, 19 each pixel is labeled as part of an organ of interest or background. The resulting point clouds and their respective labels are then integrated into a voxel-volume using a KinectFusion based algorithm. 14 Given enough viewpoints, the voxel-volume will contain a combined model more suited for registration than the model generated from single shot.
The novelty of the approach presented in this work is the application of a stereo endoscope, a modality already available in the surgical workflow, to reconstruct an entire scene from multiple viewpoints online, while simultaneously segmenting one or more organs of interest. Our main contributions are as follows: • Mosaicking of frame reconstruction parts using a frameto-model registration with the possible use of a tracking device (e.g., NDI Polaris).
• Dense surface model that is generated online and is available after each image frame.
• Per-frame segmentation of organs is achieved through a fast graphics processing unit (GPU) random forest approach.
• Global segmentation allows accumulation of the singleframe segmentation probabilities for each global surface point. The combined segmentation results lead to a higher and more robust recognition rate.
In the following, we will present a more detailed description of our reconstruction workflow, followed by an evaluation using in silico, phantom, ex vivo, and in vivo data (porcine). Three methods for determining the camera pose are also evaluated: optical tracking, iterative closest point (ICP) tracking, and a combination of these two methods. The evaluation and workflow are described in the context of laparoscopic liver surgery.

Methods
Our system for reconstructing the scene consists of multiple steps ( Fig. 1). First, we reconstruct a 3-D point cloud from stereo image frames. At the same time, the organs of interest are segmented in the video image. Afterward, the reconstruction is combined with the segmentation results and integrated into a truncated signed distance (TSD) volume. From this volume, a mosaicked model of the combined reconstructions can be retrieved. Using a TSD volume allows us to incorporate information from different viewpoints to create a larger model than from a single view, while simultaneously reducing noise in the model.

Reconstruction and Segmentation
The stereo endoscope provides left and right camera images, which are first preprocessed to remove distortion and to rectify the image pair. Using correspondence analysis, 17 we first calculate a disparity map between the two images and then triangulate those matches, resulting in a dense 3-D point cloud R i in camera coordinates for each time step. The preprocessing and the correspondence analysis were both implemented on the GPU. Every pixel in the scene is simultaneously classified using a random forest 19 into foreground, e.g., liver, and background. As features, the hue and saturation channels from the HSV color space and the color-opponent dimensions a and b from the LAB color space were used. The classifier thus provides a mapping C i ðpÞ → f1: : : ng, p ∈ R i from 3-D point to a class-label for each time step.
The random forest was trained on multiple previously labeled image. We trained a forest consisting of 50 trees with a maximum depth of 10. To allow real-time processing, the classification portion of the random forest was ported to the GPU.

Integration into Truncated Signed
Distance-Volume Assuming the pose P i of the camera in each time step is known, the point clouds R i can be transformed into the world coordinate system R W i ¼ P i ðR i Þ. At every time step, R W i is integrated into a TSD volume S i ðpÞ → ½F i ðpÞ; K i ðp; jÞ; W i ðpÞ, where p is a voxel in the volume. The TSD value F i ðpÞ and the weight W i ðpÞ are computed as suggested in Ref. 14.
E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 1 ; 6 3 ; 4 9 9 F i ðpÞ ¼ E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 2 ; 6 3 ; where W R i ðpÞ is the weight of voxel p in the current frame. It can be used to weight the TSD value computed for the current frame F R i correlated to the measurement uncertainty, or set uniformly to one. F R i can be computed as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 3 ; 6 3 ; 3 7 6 F R i ðpÞ ¼ Ψ½λ −1 kt i − pk 2 − R i ðxÞ; E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 4 ; 6 3 ; 3 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 5 ; 6 3 ; 3 1 9 x ¼ bKT −1 i pc; E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 6 ; 6 3 ; 2 9 3 ΨðηÞ ¼ minð1; η μ ÞsgnðηÞ; η ≥ −μ undefined; else ; where K is the camera calibration matrix, _ x is the homogenized image coordinate x, b:c is the nearest neighbor lookup, T i is the camera transformation, and t i is the translation part of T i . λ −1 converts the ray distance kt i − pk 2 to a depth value in the camera coordinate system. The function ΨðηÞ specifies the area of influence of R i over the voxels F R i . The parameter μ is responsible for the maximal distance before the influence of a point on a voxel is truncated.
We included K i ðp; jÞ in the volume to account for class membership of p.
E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 7 ; 6 3 ; 1 4 1 K i ðp; jÞ ¼ E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 8 ; 6 3 ; 9 2K where R W i ðpÞ represents the point in R W i that lies in p and j stands for the classifier category (e.g., background and target structure).
The class membership C i ðpÞ at the current time step can then be computed as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 9 ; 3 2 6 ; 6 9 7 C i ðpÞ ¼ argmax j∈f1: : : ng K i ðp; jÞ: (9) This way of smoothing class membership over time allows our system to cope with potential misclassifications.

Camera Pose
To integrate the point cloud R i into the TSD volume, the pose P i of the camera at time step i has to been known. In this paper, we consider three methods for estimating P i .
1. ICP: We adopt the assumption of Newcombe et al. 14 that the pose of the camera changes only slightly between frames. By registering R i with a ray cast of the TSD volume using the projective data association ICP algorithm, 20 we estimate P i . With the small movement assumption and the special ICP variant, all pixels can be used in real time.
2. Polaris: We use the NDI Polaris optical tracking system to track both camera and the patient.
3. Mixed: We combine the two methods by using the tracking information as a seed for the ICP.

Results
We performed five experiments to evaluate our system using in silico, phantom, ex vivo, and in vivo livers. For each liver, a reference was computed by laser scan or CT. In each experiment, we moved the stereo endoscope over the liver and used the captured images to reconstruct and segment the liver simultaneously. For each experiment, three mosaicked models, each with a different method for tracking the camera pose, were constructed as described in Sec. 2.3. Afterward, we computed the average distance of each intraoperative reconstructed point to the reference for each model. To reduce the influence of tracking errors, the mosaicked porcine liver models were registered to the reference using ICP. For the purpose of comparison, we also computed the average distance of the unprocessed single frame point clouds R W i to the ground truth. The camera pose used for transforming each point cloud into the world coordinate system was given by an NDI Polaris optical tracking system. For the two silicone and the first ex vivo experiment, a calibrated phase alternating line (PAL) stereo endoscope with a fixed camera unit and a PC workstation ( Both configurations took, on average, ∼0.25 s for one frame integration, implying a frame rate of ∼4 fps. More run-time information is available in Table 1.

In Silico
In order to evaluate the mosaicking without the errors induced by the stereo matching (HRM), we used a simulation framework to generate a circular image sequence of a textured CT-liver model (Fig. 2). For each of the 320 images, depth map and camera position were computed. With the simulated input data, an accurate mosaicked reconstruction of the model was achieved ( Table 2).
The simulation was also used to create noisy depth data to evaluate the mosaicking behavior on imperfect data. The noise was generated through a Perlin noise model, as it is similar to the errors made by HRM. Three different noise levels, noise 1 (mean error 1.12 mm AE 0.86), noise 2 (2.30 mm AE 1.68), and noise 3 (3.36 mm AE 2.49), were used. The results are showing that the mosaicking is reducing the noise and is producing a more accurate model than the single-shot reconstructions (Fig. 3).

Phantom Liver
After verifying the method in silico, we performed five phantom experiments with three silicone livers (Fig. 4). The first two livers were recorded with both the Wolf and HD stereo endoscope, and the third only with the HD stereo endoscope (Fig. 5). The first (Wolf endoscope 1 and HD endoscope 1) and third (HD endoscope 3) liver were placed on a flat surface, whereas the second liver (Wolf endoscope 2 and HD endoscope 2) was placed inside a 3-D printed patient phantom (Fig. 4). As previously mentioned, an NDI Polaris optical tracking system was used for endoscope position tracking. To evaluate the ICP only approach, Polaris tracking data from the first image frame served as registration to the reference model.
The results show that the use of an HD stereo endoscope increases the quality and stability of the method (Table 3). In combination with the Wolf stereo endoscope, our method produces the best results with Polaris mode. With the HD stereo endoscope, the best results shift toward mixed mode. Figure 6 illustrates an example of a failed reconstruction using the ICP for frame-to-model registration. Multiple consecutive frames to model registrations with high errors in position or orientation usually lead to a fracture in the final reconstruction, i.e., the spatial relation of the reconstructed parts before and after the ICP failure(s) is erroneous.
To determine if the models created by our approach are suitable for registering a preoperative model in absence of soft tissue deformation, we transformed the model for silicone 1 using multiple random rigid transformations. Thereupon, we performed a rough registration of the model to the reference laser scan with fast point feature histograms 21 and fine-tuned it with the use of ICP. The average distance error for 600 random transformations was 13.19 mm AE 23.39, with 90% having an    Table 3 The RMS error between the mosaicked models and the reference models in mm. The use of the HD stereo endoscope caused an overall improvement of the results. The first and second recording (lines 1 and 2) illustrate the strong influence of the HRM quality in the final reconstruction. The error levels in single shot (single HRM reconstructions) are reflected in the results of the final reconstruction.

Silicone
Endoscope position

Ex Vivo Porcine Liver
As a first step into the real operating environment, two ex vivo porcine liver experiments were conducted. In the first experiment (liver 1), the Wolf stereo endoscope was used, and reference data were provided by a laser scan. For the second experiment (liver 2), we used the high-resolution HD Storz stereo endoscope and CT imaging as reference data (Fig. 7). The results from liver 1 are comparable to the phantom data, both using the same hardware, showing that the HRM can cope with real liver texture. The second experiment using the HD Storz stereo endoscope reduced the root mean square error from 4.21 to 1.51 mm (Table 4). While slightly different experiment settings could cause small differences, the grave change is certainly due to the better image quality and resolution.

In Vivo Porcine Liver
To evaluate our system in an in vivo setting, we performed an animal experiment. At first, the pig was prepared for surgery and placed on the CT table (Fig. 8). After applying a pneumoperitoneum as well as placing ports for the endoscope and instruments, we recorded several image sequences featuring a sweep of the porcine liver. Shortly after each sequence, a CT scan was taken in order to evaluate the sequence, using the liver model acquired through the scan. To minimize breathing deformation between the two image modalities, respiration was paused between scan acquisitions.  The previous results in the second ex vivo experiment agree with the in vivo results. Both were obtained using the HD Storz stereo endoscope (Fig. 9). As in the previous ex vivo experiment, the error in all three sequences was smallest in mixed mode ( Table 5). The mean error of the three mixed mode results is 0.86 mm.

Discussion
In this paper, we presented an approach enabling the reconstruction and segmentation of organs from multiple viewpoints online during laparoscopic surgery. We have clearly demonstrated that mosaicking multiple reconstructions reduces the distance error when compared to single-shot reconstructions. Table 4 The RMS error between the mosaicked models and the ground truth models in mm. The HD stereo endoscope outperforms the older Wolf stereo endoscope clearly. This demonstrates the importance of image quality and image resolution for the reconstruction result.   Furthermore, we have shown that using a mosaicked model for rigid registration produces a significantly smaller error (dropping from 90 to 13 mm). The comparison between results from the Wolf stereo endoscope and the HD stereo endoscope allows an insight into the correlation of image quality and final reconstruction result. The data suggest that image quality and image resolution are important for two steps. First, the HRM reconstruction needs a certain image quality to produce satisfactory results, e.g., good illumination, resolution, and little distortion. For the HRM, on the other hand, the increased sensor noise greatly reduces the reconstruction quality (as shown in Fig. 10). For the ICPbased methods (ICP only and mixed), the bad frame reconstruction not only affects the mosaicked model directly, but also the frame-to-model registration, as the ICP uses the frame 3-D reconstruction to register the frame to the model created so far. Without the use of Polaris tracking, multiple consecutive bad frame registrations usually lead to a complete fail of the mosaicking attempt. The Polaris localization method allows a higher HRM error tolerance since the patches are at least placed at the correct location.
In our experiments, Polaris tracking was necessary to achieve the best results. But advances in hardware, like HD stereo endoscopes, will make image-based tracking more robust. As shown in our ex vivo experiments, the ICP-only error dropped 78% due to the use of the better HD endoscope. Also, the mixed mode exceeds the pure Polaris method when used with the HD endoscope, meaning, the small localization errors were reduced by the ICP. This is a synergetic process as Polaris provides a good initial alignment needed for a stable ICP.
There are limitations of our work. Objects, like instruments, moving between camera and organ lead to reconstruction errors. Although the instruments are likely classified as background, they are still integrated into the voxel volume. This causes an erroneous morphing of the underlying previously captured organ surface. To fix this problem, the instruments have to be specifically classified in the image and the associated pixels then excluded from the integration process. We are currently working on a stable automatic classification of instruments to solve this problem. A general problem is the HRM reconstruction quality. Slight deviations from suitable illumination settings can lead to bad reconstruction results as shown in Fig. 10. Therefore, careful monitoring of the capture settings is needed. Since our method relies on surface sweeps, a sufficient space for endoscopic movement is required. Not enough surface area is captured for reconstruction otherwise. Finally, the frame-tomodel ICP registration modes (mixed and only) are likely not suitable for organs with uniform appearances (e.g., prostate) or would at least create a higher error as with distinct shaped organs (e.g., liver or kidney).
Future research will focus on accounting for dynamic scenes, as currently only static scenes were considered, meaning that soft tissue deformation was not taken into account. Due to the shown limitations of the frame-to-model ICP registration, evaluating other methods for localization should be considered to lessen the dependency on optical tracking systems. Especially, feature-based approaches, taking advantage of the veined surface of organs and color information in general, are a promising addition to depth data only methods.