Quantitative comparison of automatic results for multi-organ segmentation by means of Dice scores often does not yield satisfactory results. It is especially challenging, when reference contours may be prone to errors. We developed a novel approach that analyzes regions of high mismatch between automatic and reference segmentations. We extract various metrics characterizing these mismatch clusters and compare them to other metrics derived from volume overlap and surface distance histograms by correlating them with qualitative ratings from clinical experts. We show that some novel features based on the mismatch sets or surface distance histograms performed better than the Dice score. We also show how the mismatch clusters can be used to generate visualizations to reduce the workload for visual inspection of segmentation results. The visualizations directly compare reference to automatic result at locations of high mismatch in orthogonal 2D views and 3D scenes zoomed to the appropriate positions. This can make it easier to detect systematic problems of an algorithm or to compare recurrent error patterns for different variants of segmentation algorithms, such as differently parameterized or trained CNN models.
Adaptive radiotherapy (RT) planning requires segmentation of organs for adapting the RT treatment plan to changes in the patient’s anatomy. Daily imaging is often done using cone-beam CT (CBCT) imaging devices which produce images of considerably lower quality than CT images, due to scatter and artifacts. Involuntary patient motion during the comparably long CBCT image acquisition may cause misalignment artifacts. In the pelvis, most severe artifacts stem from motion of air and soft tissue boundaries in the bowel, which appear as streaking in the reconstructed images. In addition to low soft tissue contrast, this makes segmentation of organs close to the bowel such as bladder and uterus even more difficult. Deep learning (DL) methods have shown to be promising for difficult segmentation tasks. In this work, we investigate different, artifact-driven sampling schemes that incorporate domain knowledge into the DL training. However, global evaluation metrics such as the Dice score, often used in DL segmentation research, reveal little information about systematic errors and no clear perspective how to improve the training. Using slice-wise Dice scores, we find a clear difference in performance on slices with and without air detected. Moreover, especially when applied in a curriculum training scheme, the specific sampling of slices on which air has been detected might help to increase robustness of deep neural networks towards artifacts while maintaining performance on artifact-free slices.
The segmentation of organs at risk is a crucial and time-consuming step in radiotherapy planning. Good automatic methods can significantly reduce the time clinicians have to spend on this task. Due to its variability in shape and low contrast to surrounding structures, segmenting the parotid gland is challenging. Motivated by the recent success of deep learning, we study the use of two-dimensional (2-D), 2-D ensemble, and three-dimensional (3-D) U-Nets for segmentation. The mean Dice similarity to ground truth is ∼0.83 for all three models. A patch-based approach for class balancing seems promising for false-positive reduction. The 2-D ensemble and 3-D U-Net are applied to the test data of the 2015 MICCAI challenge on head and neck autosegmentation. Both deep learning methods generalize well onto independent data (Dice 0.865 and 0.88) and are superior to a selection of model- and atlas-based methods with respect to the Dice coefficient. Since appropriate reference annotations are essential for training but often difficult and expensive to obtain, it is important to know how many samples are needed for training. We evaluate the performance after training with different-sized training sets and observe no significant increase in the Dice coefficient for more than 250 training cases.
The segmentation of target structures and organs at risk is a crucial and very time-consuming step in radiotherapy planning. Good automatic methods can significantly reduce the time clinicians have to spend on this task. Due to its variability in shape and often low contrast to surrounding structures, segmentation of the parotid gland is especially challenging. Motivated by the recent success of deep learning, we study different deep learning approaches for parotid gland segmentation. Particularly, we compare 2D, 2D ensemble and 3D U-Net approaches and find that the 2D U-Net ensemble yields the best results with a mean Dice score of 0.817 on our test data. The ensemble approach reduces false positives without the need for an automatic region of interest detection. We also apply our trained 2D U-Net ensemble to segment the test data of the 2015 MICCAI head and neck auto-segmentation challenge. With a mean Dice score of 0.861, our classifier exceeds the highest mean score in the challenge. This shows that the method generalizes well onto data from independent sites. Since appropriate reference annotations are essential for training but often difficult and expensive to obtain, it is important to know how many samples are needed to properly train a neural network. We evaluate the classifier performance after training with differently sized training sets (50–450) and find that 250 cases (without using extensive data augmentation) are sufficient to obtain good results with the 2D ensemble. Adding more samples does not significantly improve the Dice score of the segmentations.