3D ultrasound (US) transducers will improve the quality of image-guided medical interventions if an automated detection of the needle becomes possible. Image-based detection of the needle is challenging due to the presence of other echogenic structures in the acquired data, inconsistent visibility of needle parts and the low quality in US imaging. As the currently applied approaches for needle detection classify each voxel individually, they do not consider the global relations between the voxels. In this work, we introduce coherent needle labeling by using dense conditional random fields over a volume, along with 3D space-frequency features. The proposal includes long-distance dependencies in voxel pairs according to their similarities in the feature space and their spatial distance. This post-processing stage leads to better label assignment of volume voxels and a more compact and coherent segmented region. Our ex-vivo experiments based on measuring the F-1, F-2 and IoU scores show that the performance improves a significant 10-20 % compared with only using the linear SVM as a baseline for voxel classification.
Advanced image analysis can lead to automated examination to histopatholgy images which is essential for ob- jective and fast cancer diagnosis. Recently deep learning methods, in particular Convolutional Neural Networks (CNNs), have shown exceptionally successful performance on medical image analysis as well as computational histopathology. Because Whole-Slide Images (WSIs) have a very large size, the CNN models are commonly applied to classify WSIs per patch. Although a CNN is trained on a large part of the input space, the spatial dependencies between patches are ignored and the inference is performed only on appearance of the individual patches. Therefore, prediction on the neighboring regions can be inconsistent. In this paper, we apply Con- ditional Random Fields (CRFs) over latent spaces of a trained deep CNN in order to jointly assign labels to the patches. In our approach, extracted compact features from intermediate layers of a CNN are considered as observations in a fully-connected CRF model. This leads to performing inference on a wider context rather than appearance of individual patches. Experiments show an improvement of approximately 3.9% on average FROC score for tumorous region detection in histopathology WSIs. Our proposed model, trained on the Camelyon171 ISBI challenge dataset, won the 2nd place with a kappa score of 0.8759 in patient-level pathologic lymph node classification for breast cancer detection.
Vestibular schwannomas (VS) are benign brain tumors that can be treated with high-precision focused radiation with the Gamma Knife in order to stop tumor growth. Outcome prediction of Gamma Knife radiosurgery (GKRS) treatment can help in determining whether GKRS will be effective on an individual patient basis. However, at present, prognostic factors of tumor control after GKRS for VS are largely unknown, and only clinical factors, such as size of the tumor at treatment and pre-treatment growth rate of the tumor, have been considered thus far. This research aims at outcome prediction of GKRS by means of quantitative texture feature analysis on conventional MRI scans. We compute first-order statistics and features based on gray-level co- occurrence (GLCM) and run-length matrices (RLM), and employ support vector machines and decision trees for classification. In a clinical dataset, consisting of 20 tumors showing treatment failure and 20 tumors exhibiting treatment success, we have discovered that the second-order statistical metrics distilled from GLCM and RLM are suitable for describing texture, but are slightly outperformed by simple first-order statistics, like mean, standard deviation and median. The obtained prediction accuracy is about 85%, but a final choice of the best feature can only be made after performing more extensive analyses on larger datasets. In any case, this work provides suitable texture measures for successful prediction of GKRS treatment outcome for VS.
In current clinical practice, the resectability of pancreatic ductal adenocarcinoma (PDA) is determined subjec- tively by a physician, which is an error-prone procedure. In this paper, we present a method for automated determination of resectability of PDA from a routine abdominal CT, to reduce such decision errors. The tumor features are extracted from a group of patients with both hypo- and iso-attenuating tumors, of which 29 were resectable and 21 were not. The tumor contours are supplied by a medical expert. We present an approach that uses intensity, shape, and texture features to determine tumor resectability. The best classification results are obtained with fine Gaussian SVM and the L0 Feature Selection algorithms. Compared to expert predictions made on the same dataset, our method achieves better classification results. We obtain significantly better results on correctly predicting non-resectability (+17%) compared to a expert, which is essential for patient treatment (negative prediction value). Moreover, our predictions of resectability exceed expert predictions by approximately 3% (positive prediction value).
Volumetric Laser Endomicroscopy (VLE) is a promising technique for the detection of early neoplasia in Barrett’s Esophagus (BE). VLE generates hundreds of high resolution, grayscale, cross-sectional images of the esophagus. However, at present, classifying these images is a time consuming and cumbersome effort performed by an expert using a clinical prediction model. This paper explores the feasibility of using computer vision techniques to accurately predict the presence of dysplastic tissue in VLE BE images. Our contribution is threefold. First, a benchmarking is performed for widely applied machine learning techniques and feature extraction methods. Second, three new features based on the clinical detection model are proposed, having superior classification accuracy and speed, compared to earlier work. Third, we evaluate automated parameter tuning by applying simple grid search and feature selection methods. The results are evaluated on a clinically validated dataset of 30 dysplastic and 30 non-dysplastic VLE images. Optimal classification accuracy is obtained by applying a support vector machine and using our modified Haralick features and optimal image cropping, obtaining an area under the receiver operating characteristic of 0.95 compared to the clinical prediction model at 0.81. Optimal execution time is achieved using a proposed mean and median feature, which is extracted at least factor 2.5 faster than alternative features with comparable performance.
Over the past decade, the imaging tools for endoscopists have improved drastically. This has enabled physicians to visually inspect the intestinal tissue for early signs of malignant lesions. Besides this, recent studies show the feasibility of supportive image analysis for endoscopists, but the analysis problem is typically approached as a segmentation task where binary ground truth is employed. In this study, we show that the detection of early cancerous tissue in the gastrointestinal tract cannot be approached as a binary segmentation problem and it is crucial and clinically relevant to involve multiple experts for annotating early lesions. By employing the so-called sweet spot for training purposes as a metric, a much better detection performance can be achieved. Furthermore, a multi-expert-based ground truth, i.e. a golden standard, enables an improved validation of the resulting delineations. For this purpose, besides the sweet spot we also propose another novel metric, the Jaccard Golden Standard (JIGS) that can handle multiple ground-truth annotations. Our experiments involving these new metrics and based on the golden standard show that the performance of a detection algorithm of early neoplastic lesions in Barrett's esophagus can be increased significantly, demonstrating a 10 percent point increase in the resulting F1 detection score.
Esophageal cancer is one of the fastest rising forms of cancer in the Western world. Using High-Definition (HD) endoscopy, gastroenterology experts can identify esophageal cancer at an early stage. Recent research shows that early cancer can be found using a state-of-the-art computer-aided detection (CADe) system based on analyzing static HD endoscopic images. Our research aims at extending this system by applying Random Forest (RF) classification, which introduces a confidence measure for detected cancer regions. To visualize this data, we propose a novel automated annotation system, employing the unique characteristics of the previous confidence measure. This approach allows reliable modeling of multi-expert knowledge and provides essential data for real-time video processing, to enable future use of the system in a clinical setting. The performance of the CADe system is evaluated on a 39-patient dataset, containing 100 images annotated by 5 expert gastroenterologists. The proposed system reaches a precision of 75% and recall of 90%, thereby improving the state-of-the-art results by 11 and 6 percentage points, respectively.
Recently, compressed-sensing based algorithms have enabled volume reconstruction from projection images acquired over a relatively small angle (θ < 20°). These methods enable accurate depth estimation of surgical tools with respect to anatomical structures. However, they are computationally expensive and time consuming, rendering them unattractive for image-guided interventions. We propose an alternative approach for depth estimation of biopsy needles during image-guided interventions, in which we split the problem into two parts and solve them independently: needle-depth estimation and volume reconstruction. The complete proposed system consists of the previous two steps, preceded by needle extraction. First, we detect the biopsy needle in the projection images and remove it by interpolation. Next, we exploit epipolar geometry to find point-to-point correspondences in the projection images to triangulate the 3D position of the needle in the volume. Finally, we use the interpolated projection images to reconstruct the local anatomical structures and indicate the position of the needle within this volume. For validation of the algorithm, we have recorded a full CT scan of a phantom with an inserted biopsy needle. The performance of our approach ranges from a median error of 2.94 mm for an distributed viewing angle of 1° down to an error of 0.30 mm for an angle larger than 10°. Based on the results of this initial phantom study, we conclude that multi-view geometry offers an attractive alternative to time-consuming iterative methods for the depth estimation of surgical tools during C-arm-based image-guided interventions.
Ultrasound imaging is employed for needle guidance in various minimally invasive procedures such as biopsy guidance, regional anesthesia and brachytherapy. Unfortunately, a needle guidance using 2D ultrasound is very challenging, due to a poor needle visibility and a limited field of view. Nowadays, 3D ultrasound systems are available and more widely used. Consequently, with an appropriate 3D image-based needle detection technique, needle guidance and interventions may significantly be improved and simplified. In this paper, we present a multi-resolution Gabor transformation for an automated and reliable extraction of the needle-like structures in a 3D ultrasound volume. We study and identify the best combination of the Gabor wavelet frequencies. High precision in detecting the needle voxels leads to a robust and accurate localization of the needle for the intervention support. Evaluation in several ex-vivo cases shows that the multi-resolution analysis significantly improves the precision of the needle voxel detection from 0.23 to 0.32 at a high recall rate of 0.75 (gain 40%), where a better robustness and confidence were confirmed in the practical experiments.
Ultrasound-guided needle interventions are widely practiced in medical diagnostics and therapy, i.e. for biopsy guidance, regional anesthesia or for brachytherapy. Needle guidance using 2D ultrasound can be very challenging due to the poor needle visibility and the limited field of view. Since 3D ultrasound transducers are becoming more widely used, needle guidance can be improved and simplified with appropriate computer-aided analyses. In this paper, we compare two state-of-the-art 3D needle detection techniques: a technique based on line filtering from literature and a system employing Gabor transformation. Both algorithms utilize supervised classification to pre-select candidate needle voxels in the volume and then fit a model of the needle on the selected voxels. The major differences between the two approaches are in extracting the feature vectors for classification and selecting the criterion for fitting. We evaluate the performance of the two techniques using manually-annotated ground truth in several ex-vivo situations of different complexities, containing three different needle types with various insertion angles. This extensive evaluation provides better understanding on the limitations and advantages of each technique under different acquisition conditions, which is leading to the development of improved techniques for more reliable and accurate localization. Benchmarking results that the Gabor features are better capable of distinguishing the needle voxels in all datasets. Moreover, it is shown that the complete processing chain of the Gabor-based method outperforms the line filtering in accuracy and stability of the detection results.
The growing traffic density in cities fuels the desire for collision assessment systems on public transportation. For this application, video analysis is broadly accepted as a cornerstone. For trams, the localization of tramway tracks is an essential ingredient of such a system, in order to estimate a safety margin for crossing traffic participants. Tramway-track detection is a challenging task due to the urban environment with clutter, sharp curves and occlusions of the track. In this paper, we present a novel and generic system to detect the tramway track in advance of the tram position. The system incorporates an inverse perspective mapping and a-priori geometry knowledge of the rails to find possible track segments. The contribution of this paper involves the creation of a new track reconstruction algorithm which is based on graph theory. To this end, we define track segments as vertices in a graph, in which edges represent feasible connections. This graph is then converted to a max-cost arborescence graph, and the best path is selected according to its location and additional temporal information based on a maximum a-posteriori estimate. The proposed system clearly outperforms a railway-track detector. Furthermore, the system performance is validated on 3,600 manually annotated frames. The obtained results are promising, where straight tracks are found in more than 90% of the images and complete curves are still detected in 35% of the cases.
Prematurely born infants receive special care in the Neonatal Intensive Care Unit (NICU), where various physiological parameters, such as heart rate, oxygen saturation and temperature are continuously monitored. However, there is no system for monitoring and interpreting their facial expressions, the most prominent discomfort indicator. In this paper, we present an experimental video monitoring system for automatic discomfort detection in infants’ faces based on the analysis of their facial expressions. The proposed system uses an Active Appearance Model (AAM) to robustly track both the global motion of the newborn’s face, as well as its inner features. The system detects discomfort by employing the AAM representations of the face on a frame-by-frame basis, using a Support Vector Machine (SVM) classifier. Three contributions increase the performance of the system. First, we extract several histogram-based texture descriptors to improve the AAM appearance representations. Second, we fuse the outputs of various individual SVM classifiers, which are trained on features with complementary qualities. Third, we improve the temporal behavior and stability of the discomfort detection by applying an averaging filter to the classification outputs. Additionally, for a higher robustness, we explore the effect of applying different image pre-processing algorithms for correcting illumination conditions and for image enhancement to evaluate possible detection improvements. The proposed system is evaluated in 15 videos of 8 infants, yielding a 0.98 AUC performance. As a bonus, the system offers monitoring of the infant’s expressions when it is left unattended and it additionally provides objective judgment of discomfort.
This paper proposes an original moving ship detection approach in video surveillance systems, especially con- centrating on occlusion problems among ships and vegetation using context information. Firstly, an over- segmentation is performed to divide and classify by SVM (Support Vector Machine) segments into water or non-water, while exploiting the context that ships move only in water. We assume that the ship motion to be characterized by motion saliency and consistency, such that each ship distinguish itself. Therefore, based on the water context model, non-water segments are merged into regions with motion similarity. Then, moving ships are detected by measuring the motion saliency of those regions. Experiments on real-life surveillance videos prove the accuracy and robustness of the proposed approach. We especially pay attention to testing in the cases of severe occlusions between ships and between ship and vegetation. The proposed algorithm outperforms, in terms of precision and recall, our earlier work and a proposal using SVM-based ship detection.
The use of contextual information can significantly aid scene understanding of surveillance video. Just detecting people and tracking them does not provide sufficient information to detect situations that require operator attention. We propose a proof-of-concept system that uses several sources of contextual information to improve scene understanding in surveillance video. The focus is on two scenarios that represent common video surveillance situations, parking lot surveillance and crowd monitoring. In the first scenario, a pan–tilt–zoom (PTZ) camera tracking system is developed for parking lot surveillance. Context is provided by the traffic sign recognition system to localize regular and handicapped parking spot signs as well as license plates. The PTZ algorithm has the ability to selectively detect and track persons based on scene context. In the second scenario, a group analysis algorithm is introduced to detect groups of people. Contextual information is provided by traffic sign recognition and region labeling algorithms and exploited for behavior understanding. In both scenarios, decision engines are used to interpret and classify the output of the subsystems and if necessary raise operator alerts. We show that using context information enables the automated analysis of complicated scenarios that were previously not possible using conventional moving object classification techniques.
In port surveillance, video-based monitoring is a valuable supplement to a radar system by helping to detect smaller ships in the shadow of a larger ship and with the possibility to detect nonmetal ships. Therefore, automatic video-based ship detection is an important research area for security control in port regions. An approach that automatically detects moving ships in port surveillance videos with robustness for occlusions is presented. In our approach, important elements from the visual, spatial, and temporal features of the scene are used to create a model of the contextual information and perform a motion saliency analysis. We model the context of the scene by first segmenting the video frame and contextually labeling the segments, such as water, vegetation, etc. Then, based on the assumption that each object has its own motion, labeled segments are merged into individual semantic regions even when occlusions occur. The context is finally modeled to help locating the candidate ships by exploring semantic relations between ships and context, spatial adjacency and size constraints of different regions. Additionally, we assume that the ship moves with a significant speed compared to its surroundings. As a result, ships are detected by checking motion saliency for candidate ships according to the predefined criteria. We compare this approach with the conventional technique for object classification based on support vector machine. Experiments are carried out with real-life surveillance videos, where the obtained results outperform two recent algorithms and show the accuracy and robustness of the proposed ship detection approach. The inherent simplicity of our algorithmic subsystems enables real-time operation of our proposal in embedded video surveillance, such as port surveillance systems based on moving, nonstatic cameras.
This paper presents an automatic ship detection approach for video-based port surveillance systems. Our approach combines context and motion saliency analysis. The context is represented by the assumption that ships only travel inside a water region. We perform motion saliency analysis since we expect ships to move with higher speed compared to the water flow and static environment. A robust water detection is first employed to extract the water region as contextual information in the video frame, which is achieved by graph-based segmentation and region-based classification. After the water detection, the segments labeled as non-water are merged to form the regions containing candidate ships, based on the spatial adjacency. Finally, ships are detected by checking motion saliency for each candidate ship according to a set of criteria. Experiments are carried out with real-life surveillance videos, where the obtained results prove the accuracy and robustness of the proposed ship detection approach. The proposed algorithm outperforms a state-of-the-art algorithm when applied to the same sets of surveillance videos.
Esophageal cancer is the fastest rising type of cancer in the Western world. The recent development of High-Definition (HD) endoscopy has enabled the specialist physician to identify cancer at an early stage. Nevertheless, it still requires considerable effort and training to be able to recognize these irregularities associated with early cancer. As a first step towards a Computer-Aided Detection (CAD) system that supports the physician in finding these early stages of cancer, we propose an algorithm that is able to identify irregularities in the esophagus automatically, based on HD endoscopic images. The concept employs tile-based processing, so our system is not only able to identify that an endoscopic image contains early cancer, but it can also locate it. The identification is based on the following steps: (1) preprocessing, (2) feature extraction with dimensionality reduction, (3) classification. We evaluate the detection performance in RGB, HSI and YCbCr color space using the Color Histogram (CH) and Gabor features and we compare with other well-known features to describe texture. For classification, we employ a Support Vector Machine (SVM) and evaluate its performance using different parameters and kernel functions. In experiments, our system achieves a classification accuracy of 95.9% on 50×50 pixel tiles of tumorous and normal tissue and reaches an Area Under the Curve (AUC) of 0.990. In 22 clinical examples our algorithm was able to identify all (pre-)cancerous regions and annotate those regions reasonably well. The experimental and clinical validation are considered promising for a CAD system that supports the physician in finding early stage cancer.
Interactive free-viewpoint selection applied to a 3D multi-view video signal is an attractive feature of the rapidly
developing 3DTV media. In recent years, significant research has been done on free-viewpoint rendering algorithms
which mostly have similar building blocks. In our previous work, we have analyzed the principal
building blocks of most recent rendering algorithms and their contribution to the overall rendering quality. We
have discovered that the first step, Warping determines the basic quality level of the complete rendering chain.
In this paper, we have analyzed the warping step in more detail since it leads to ways for improvement. We have
observed that the accuracy of warping is mainly determined by two factors which are sampling and rounding
errors when performing pixel-based warping and quantization errors of depth maps. For each error factor, we
have proposed a technique that can reduce the errors and thus increase the warping quality. Pixel-based warping
errors are reduced by employing supersampling at the reference and virtual images and we decrease depth map
errors by creating depth maps with more quantization levels. The new techniques are evaluated with two series of
experiments using real-life and synthetic data. From these experiments, we have observed that reducing warping
errors may increases the overall rendering quality and that the impact of errors due to pixel-based warping is
much larger than that of errors due to depth quantization.
Interactive free-viewpoint selection applied to a 3D multi-view signal is a possible attractive feature of the
rapidly developing 3D TV media. This paper explores a new rendering algorithm that computes a free-viewpoint
based on depth image warping between two reference views from existing cameras. We have developed three
quality enhancing techniques that specifically aim at solving the major artifacts. First, resampling artifacts are
filled in by a combination of median filtering and inverse warping. Second, contour artifacts are processed while
omitting warping of edges at high discontinuities. Third, we employ a depth signal for more accurate disocclusion
inpainting. We obtain an average PSNR gain of 3 dB and 4.5 dB for the 'Breakdancers' and 'Ballet' sequences,
respectively, compared to recently published results. While experimenting with synthetic data, we observe that
the rendering quality is highly dependent on the complexity of the scene. Moreover, experiments are performed
using compressed video from surrounding cameras. The overall system quality is dominated by the rendering
quality and not by coding.
Virtual views in 3D-TV and multi-view video systems are reconstructed images of the scene generated synthetically
from the original views. In this paper, we analyze the performance of streaming virtual views over
IP-networks with a limited and time-varying available bandwidth. We show that the average video quality
perceived by the user can be improved with an adaptive streaming strategy aiming at maximizing the average
video quality. Our adaptive 3D multi-view streaming can provide a quality improvement of 2 dB on the average
- over non-adaptive streaming. We demonstrate that an optimized virtual view adaptation algorithm needs to
be view-dependent and achieve an improvement of up to 0.7 dB. We analyze our adaptation strategies under
dynamic available bandwidth in the network.
We describe our work on text-image alignment in context of building a historical document retrieval system. We
aim at aligning images of words in handwritten lines with their text transcriptions. The images of handwritten
lines are automatically segmented from the scanned pages of historical documents and then manually transcribed.
To train automatic routines to detect words in an image of handwritten text, we need a training set - images of
words with their transcriptions. We present our results on aligning words from the images of handwritten lines and
their corresponding text transcriptions. Alignment based on the longest spaces between portions of handwriting
is a baseline. We then show that relative lengths, i.e. proportions of words in their lines, can be used to improve
the alignment results considerably. To take into account the relative word length, we define the expressions for
the cost function that has to be minimized for aligning text words with their images. We apply right to left
alignment as well as alignment based on exhaustive search. The quality assessment of these alignments shows
correct results for 69% of words from 100 lines, or 90% of partially correct and correct alignments combined.
In our effort to contribute to the closing of the "semantic gap" between images and their semantic description, we are building a large-scale ontology of images of objects. This visual catalog will contain a large number of images of objects, structured in a hierarchical catalog, allowing image processing researchers to derive signatures for wide classes of objects. We are building this ontology using images found on the web. We describe in this article our initial approach for finding coherent sets of object images. We first perform two semantic filtering steps: the first involves deciding which words correspond to objects and using these words to access databases which index text found associated with an image (e.g. Google Image search) to find a set of candidate images; the second semantic filtering step involves using face recognition technology to remove images of people from the candidate set (we have found that often requests for objects return images of people). After these two steps, we have a cleaner set of candidate images for each object. We then index and cluster the remaining images using our system VIKA (VIsual KAtaloguer) to find coherent sets of objects.