PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.
This PDF file contains the front matter associated with SPIE Proceedings Volume 13089, including the Title Page, Copyright information, Table of Contents, and Conference Committee information.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
During interventional procedures, clinicians require immediate imaging feedback. The requirement can be fulfilled by the online reconstruction of dynamic magnetic resonance imaging (dMRI), where the current frame is reconstructed using only previous data. Unfortunately, existing online reconstruction algorithms fall short in terms of reconstruction quality and time delay, failing to meet the demands of modern interventional treatments. Addressing this issue, we introduce a more effective and efficient online reconstruction algorithm. To expedite the reconstruction process, our approach employs the concept of subspace tracking. Initially, each dMRI frame is interpreted as a subspace component and an error component, with the subspace component posited to reside on a Grassmannian manifold of one-dimensional subspace. Then, the current frame's subspace component is updated along the geodesic, drawing on the prior frame's reconstruction result, while the error component is updated following the gradient direction. Since the update for each frame is completed in a single step without iteration, the proposed algorithm can achieve rapid reconstruction. To enhance the quality, we collect the first five frames of under sampled Cartesian k-space data. Using the low-rank plus sparse reconstruction method for offline processing, we generate high-quality initial estimates of subspace components. This step substantially improves the overall reconstruction quality. Our comparative study against various online reconstruction methods at different acceleration rates on an open cardiac cine dataset demonstrates that our algorithm outperforms others in achieving high spatial-temporal resolution reconstruction with minor delay.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Machine learning is good at learning general knowledge and predictive knowledge from known and limited environments, and perform well in similar environments. However, their performance in unknown environments is not satisfactory. Machine learning focus on correlation learning, but correlation is not causality. The non-causal part of the correlation forms spurious correlation that affect the model's generalization ability. Therefore, it is necessary to examine machine learning from a causal perspective. By constructing structural causal models, it is found that the targets of different domains are instrumental variables of each other and can be mutually represented. They have same essential features and different domain features, which lead to causal correlation and spurious correlation with labels respectively. In this paper, a causal-inspired contrastive learning supervised model is designed to strengthen essential features and weaken domain features. With the target of improving the ability of model to capture causal correlations and reducing the interference of spurious correlations, we join transfer and autoencoder with image classification model as a cross-domain contrastive learning model. Experiments show that the proposed framework has a simple and easy-to-implement structure, it performs well on public datasets. By adopting visualization technology, the effects of this method are intuitively demonstrated.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In practical scenarios, sparse view computed tomography (CT) is a successful method for reducing the X-ray radiation dose and imaging time. Generating high-quality images with undersampled projection data using conventional techniques is difficult. To reconstruction high quality CT image from the situation of undersampling, we proposed a weighted sparse constraint reconstruction (WSCR) framework in this work. Motivated by the concept of deep convolutional neural network, we expand the iterative reconstruction scheme for constructing reweighted sparse representation statistically, with a specific number of iterations for training based on data-driven approach, resulting in the development of a reconstruction network based on WSCR. The WSCR network is composed of several iteration blocks, with each block containing two modules: image reconstruction and sparse representation network modules. The CT image is updated using a deep learning-based prior constraint in the image reconstruction module. To further restrict the process of reconstructing the image, the reweighted feature map and filter updating are implemented using a sparse representation network. The results of the experiment showed that our suggested WCSR network was able to achieve superior performance in removing artifacts and preserving edges.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Although the fusion of Panchromatic(PAN) images and Multi-spectral(MS) images can improve the resolution of remote sensing MS images, the joint adjustments of PAN images and MS images will lead to the original data being completely lost. With the requirement of on-orbit image processing and multiple applications, obtaining fusion images while retaining the source images has become a problem that must be considered. To deal with the problem that the original data is lost after fusion, this proposed method embeds the PAN image object and compression data into the fused image. The experimental results show that the proposed method can reconstruct the object data of the PAN image losslessly and reconstruct the entire PAN image in high quality. The fused image indicators also have no impact.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Unsupervised learning methods in computer vision have achieved remarkable success, exceeding the performance of supervised learning methods. It is noteworthy that current unsupervised learning methods share certain similarities, particularly in their data augmentation techniques. Masking, a type of data augmentation, can be utilized for both contrastive learning and masked image modelling. This paper presents a novel deep learning approach on visual unsupervised learning. It integrates previous methods such as contrastive learning, perceptual learning, self-distillation and masked image modelling. In our method, we treat the network that handles the original images as the teacher network, and the network that handles the masked images as the student network. The student network employs the representations extracted by the projection head for contrastive learning, while the features generated by the decoder are employed for masked image modeling. The process of self-knowledge distillation is facilitated by perceptual learning between the teacher and student networks. This model aligns with the main idea of contrastive learning, which aims to pull similar images closer while pushing dissimilar images further apart. Simultaneously, it reflects the main idea of masked image modelling, which enables the extraction of semantic information from large scale masked pixel reconstruction tasks. Additionally, we compare the effect of self-supervised methods to the performance of the model. Our results show that with only 75 epochs of fine-tuning, our 29M-parameter model achieves 78.5% top-1 accuracy on the ImageNet-1k dataset.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Most of the current weakly supervised semantic segmentation (WSSS) methods follow the pipeline of first generating pseudo-masks via class activation maps (CAMs) and then using the pseudo-masks to train a fully supervised semantic segmentation model. However, these methods essentially make use of weak labels (lacking pixel-level annotations) to generate pseudo-masks, and as a cost, even with high-quality pseudo-masks, noisy labels are still inevitably present, which will have an impact on the subsequent segmentation performance. Therefore, in this paper, we propose a method based on positive and negative hybrid learning to improve WSSS in terms of mitigating the effect of noise. Instead of using the generated pseudo-masks directly for the training of the segmentation network, our method first designs the K-L divergence criterion to distinguish the pseudo-masks into clean labels and noisy labels, and then applies positive and negative learning to the clean labels and noisy labels to train the semantic segmentation network so as to effectively improve the performance of the segmentation network. The experimental results on the general dataset PASCAL VOC 2012 val set and test set show that the mean Intersection over Union (mIoU) values of the proposed method reach 70.6% and 71.7%, respectively, which outperforms current weakly supervised semantic segmentation methods and proves the effectiveness of the proposed method.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Melanoma is one of the most malignant tumors in the world, with high morbidity and mortality, and poses a serious threat to human health. Thanks to the powerful learning ability of deep learning technology, deep learning method based on convolutional neural network has made remarkable progress in the field of melanoma image segmentation, but the segmentation results are poor when the skin condition is complex and the image is not ideal. In this paper, a feature augmentation network with Transformer termed FAuNet is designed for melanoma image segmentation by embedding reconstruction feature (Refa) module, enhanced feature pyramid (EFP) module, and communicating shallow features with the encoder segment. At the encoder stage, the features are first passed through the Transformer model to make it have global information while retaining the shallowest information to establish the association between the encoder and decoder. After that, Refa module is used to recover the features step by step, and EFP module is combined to connect the features horizontally to enrich the feature information. Then, the shallow features of encoding layer and decoding layer are fused to make up for the spatial information loss caused by downsampling and upsampling. Experiments on ISIC2017 and ISIC2018 datasets demonstrate the proposed method performs favorably against the state-of-the-art methods.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Gastric cancer is a serious health threat and pathological images are an important criterion in its diagnosis. These images can help doctors accurately identify cancerous regions and provide important evidence for clinical decision-making. Thanks to the remarkable achievements of deep learning technology in the field of image processing, an increasing number of superior image segmentation models have emerged. The Swin-Unet model has achieved great success in the field of image segmentation. However, when applied to the image segmentation of gastric cancer pathological section data, the segmentation boundary appears jagged. We have put forth two potential solutions. Initially, we devised an attention connection module to supplant the skip connections within the model, thereby enhancing the model’s predictive precision. Subsequently, we engineered a prognostic processing unit that inputs the model’s predictive outcomes and employs a Conditional Random Field (CRF) for further predictive computations. The enhanced model increases the DSC by 2% and decreases the HD by 17%. Additionally, the issue of jagged boundaries in prediction results has been better optimized. We conducted comparative and ablation experiments, and the results showed that our improved method increased the accuracy of the model’s predictions and reduced the jaggedness of the results at the boundary.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Recently, facial attribute editing has been widely used in human-computer interaction and entertainment social fields. However, most existing facial attribute editing methods have some limitations such as low segmentation granularity and inability to accurately edit regions. To overcome the problems, the Semantic Rendering Generative Adversarial Networks which combines semantic segmentation and color rendering for facial attribute editing is presented. Firstly, asemantic segmentation network, which has limited operations to the target area due to without modifying any attribute-unrelated details, was constructed to generate masks of attribute-related regions. Secondly, to effectively generate color masks for synthesizing higher-quality images, a color rendering network model was derived by merging Transformer-based UNet encoder and ColorMapGAN decoder as the generator of the color rendering network. To verify the effectiveness of the proposed method, the constructed models had been trained on CelebA and CelebAMask-HQ datasets The experimental results shown that the proposed method can not only finely segment attribute-related and unrelated areas but also generate more realistic face images, compared with several existing facial attribute editing methods.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
This paper proposes an improved UNet segmentation algorithm to further improve the segmentation performance and adaptability to meet problems such as complex retinal blood vessel structures, low image contrast, and inaccurate segmentation of detail areas. To achieve this goal, two main strategies are adopted which are the residual module introducing depthwise separable convolution and the SE (Squeeze and Excitation) attention mechanism. First, a new residual module is designed by combining depthwise separable convolutional network and a residual network to replace the traditional convolution operation in the original UNet. This module not only improves the network’s feature learning and expression capabilities but also enhances its ability to capture details and feature changes in images. Second, the SE attention mechanism is introduced to adaptively adjust the weights according to the importance of the channels in the feature map, allowing the network to focus more on channels containing important feature information. The experimental results show that compared to retinal blood vessel segmentation algorithms in recent years, the algorithm proposed in this paper performs better in performance.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Historical Tibetan document hold a significant place within surviving ancient literature in China for their representation of Tibetan culture and history. All of these documents were handwritten and thus possess various problems such as adhesion, blurred handwriting, and background stains. Segmentation of syllables from the text is a crucial step in analyzing images of Historical Tibetan document that must be completed prior to syllable recognition. Syllable segmentation is challenged by pre-syllable and post-syllable stroke sticking, tsheg sticking, and syllable stroke sticking. In this paper, we enhance the K-Net model to improve its efficacy in segmenting syllables in Historical Tibetan document texts. The main work includes: (1) to better classify the sticking syllables, we modify the backbone network to ResNeXt; (2) before entering into the kernel learning, we convert the feature mask and the segmentation mask into a convolution to increase the correct rate of mask prediction; (3) information proofreading in each step of kernel updating, and the mask prediction obtained from kernel learning is convolved with the feature mask to obtain a new mask prediction with higher correct rate; and (4) streamlining part of the instance segmentation code to lighten the network model. The experimental results demonstrate that the suggested technique can solve the syllable segmentation issue of syllable-to-syllable stroke sticking and tsheg-to-syllable stroke sticking proficiently. Consequently, syllable segmentation of mAcc attains 95.66%, and mIoU attains 76.59% for the Historical Tibetan document.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Aiming at the problem that there are often errors in brain tumor image segmentation due to uneven gray scale and blurred boundaries in brain MRI images, a brain tumor segmentation method based on the combination of graph cutting algorithm and MAP_MRF algorithm and cellular automata model is proposed. A MRF_GCGMM model is established to automatically select the initial seed point based on the grayscale features of the image, and then modify the state transfer function of the cellular automata to achieve accurate segmentation of the brain tumor area. Experimental results show that the proposed algorithm has a higher segmentation accuracy of brain tumor images in the BraTs dataset than that of traditional brain tumor segmentation methods.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Minimal paths are widely used in the image segmentation problems. Most existing methods only exploit local point-wise image features to track minimal paths for delineating the boundaries, increasing the risk of shortcuts especially in the case of complicated scenario. In this work, we introduce a new circular minimal path model invoking a graph-based boundary proposals grouping scheme and an adaptive cut for interactive image segmentation. The boundary proposals are comprised of edge segments, incorporating the nonlocal connectivity information into the proposed model. The target contours are made up of boundary proposals and minimal paths selected by a graph-based optimization way. The adaptive cut can disconnect the image domain such that the target contours are imposed to pass through this cut only once, allowing to deal with a great variety of segmentation tasks. The effectiveness of the proposed model has been validated on image segmentation tasks.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In this paper, we propose a method to remove the imaging ambiguity caused by scattering in foggy days based on the utilization of generalized Gaussian function. We use the generalized Gaussian function to simulate the light transmission performance in foggy weather and then obtain its attenuation charateristics for further achieving image defogging results. We also propose a zonal dehazing method according to the distribution of fog in hazed image, thereby further improving the quality of the dehazing image. We use real fog data sets to quantitatively analyze and visualize the results. The simulation results show that the image information entropy and visible edge ratio obtained by the algorithm are improved, which verifies the effectiveness and superiority of the proposed algorithm.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Automatic rebar counting in the bundled steel bar image require urgent solution in the steel and construction application scenarios. Traditional image processing methods perform badly because the rebar end faces are in irregular shape and highly dense. Focusing on the characteristics of dense small objects in the bundled rebar images, this paper proposes YOLACT_REBAR instance segmentation network. We make improvements from two aspects: network structure and inference process. For network structure improvement, in view of the characteristics of the steel bar end face being a small target, we optimize the original model by connecting the second layer of its backbone network to FPN (Feature Pyramid Network) to obtain larger feature maps, thereby enhancing the model's accuracy in dense small target segmentation. For the inference process, in view of the dense characteristics of steel bar end faces, traditional Non-Maximum value Suppression is used to replace the fast Non-Maximum value Suppression, and the IOU (Intersection-Over-Union) deduplication strategy is used to remove redundant bounding boxes, thereby reducing the network's false detection. Experimental results on self-constructed dataset show that compared with the original YOLACT model and other object detection models, our proposed model demonstrate improved Precision, Recall and mIOU performance. The proposed YOLACT_REBAR model can make a precise segmentation of each rebar end face, which will facilitate subsequent applications such as rebar counting and automatic plate welding.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Medical image segmentation aims to categorize pixels into different regions according to their corresponding tissues / organs in medical image. In recent years, due to Transformer's outstanding ability in the field of computer vision, various visual Transformers has been exploited in this task. However, these models often suffer from quadratic complexity in the self-attention and multi-scale information interaction. In this paper, we propose a novel dual attention and pyramid-aware network, DAPFormer, to solve the aforementioned limitations. It effectively combines efficient and channel attention into a dual attention mechanism to capture spatial and inter-channel relationships in the feature dimensions, meanwhile maintains computational efficiency. Additionally, we use pyramid-aware module to redesign the skip connection, modeling the cross-scale dependencies and addressing complex scale variations. Experiments on multi-organ cardiac and skin lesion segmentation datasets demonstrate that DAPFormer outperforms state-of-the-art methods.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
It is a popular way to incorporate the active contour evolution scheme into the multiscale image decomposition and reconstruction procedure, so as to enhance the image segmentation accuracy. However, most of these models are carried out by the level set formulation which cost much computation time. In this paper, we propose a new image segmentation model that combines the circular geodesic model with an adaptive cut and the multiscale image processing. As a consequence, the proposed model can blend the benefits from both of the geodesic models and the multiscale image analysis method. Experimental results show that the proposed multiscale geodesic model indeed outperforms the circular geodesic model with an adaptive cut in solving the image segmentation problem in the presence of strong noise.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Infrared Small Target Detection (IRSTD), widely implemented in both military and civilian contexts, stands as a pivotal technology within the realm of target detection. Yet, the accurate extraction of small target regions is often hampered by the interference from complex backgrounds. To overcome this issue, we introduce a novel approach: a Contextual Semantic Information Network founded on region features. As an initial step, we develop a target region retention backbone network. This innovation aids in preserving more potential target regions, thereby resolving the issue of small targets being lost as the deepening of the network. Next, we devise a regional feature enhancement module, specifically tailored for the potentially targeted area. This module is designed to effectively boost the region's target characterization capabilities. Finally, we employ a global context module to excavate the inherent feature information, thereby compensating for any missing details within the target region. When tested on the SIRST dataset, our experimental results unequivocally demonstrate that the proposed methodology substantially augments both the accuracy and robustness of IRSTD in complex settings, exceeding the performance of existing detection methods.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Deep learning and neural networks have ascended to prominence as the prevailing methodologies for the computational manipulation of remote sensing imagery. The collection and utilization of single remote sensing data inevitably encounter limitations, such as the arduous task of subdividing akin characteristics within hyperspectral data. Endeavors to collaborate with multiple remote sensing data sources are intensifying, aiming to attain more refined and comprehensive observational outcomes. In this article, we propose a comprehensive system framework for the processing of multi-source remote sensing image classification, named the adaptive pooling transformer network (APTnet). Primarily, Convolutional Neural Networks (CNNs) are harnessed to extract distinctive attributes from both hyperspectral images (HSI) and synthetic aperture radar images (SAR). Furthermore, we propose a technique for the fusion of remote sensing data, known as the cross-modal correction fusion method (CMCF), enabling interactive learning and rectification of multi-source features across diverse strata. Additionally, an adaptive pooling Transformer method (APT) is put forth to fortify the existing attributes and amplify their capacity for expression. The experiments conducted on the Berlin and Augsburg datasets exhibit commendable performance, complemented by exhaustive comparative analyses with prevailing methodologies.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
With the rise of smart devices, there is an escalating need for lightweight methods in human pose estimation (HPE). Although existing 2D HPE techniques have demonstrated impressive performance on public datasets, they still suffer from high model complexity and latency issues in practical applications. To address the challenge, this paper proposes a novel approach for lightweight 2D HPE. By utilizing shuffle blocks instead of the traditional ResNet, we significantly reduce the model size and computational requirements. Moreover, our method employs the SimCC algorithm to transform the pose estimation task into a coordinate classification task. By discretizing the continuous coordinate values into multiple sub-pixel intervals, we effectively reduce the quantization error encountered in traditional heatmap-based methods. To further enhance the precision of our model, we incorporate a self-attention mechanism into the network, thereby leveraging its benefits for improved accuracy. This mechanism enables refined joint point representation to improve the robustness of the feedforward network with a gated linear unit in the Transformer layer. We conduct a comprehensive evaluation of our method on the MPII and COCO datasets, assessing its performance in terms of model parameters, computational complexity, and accuracy. Furthermore, we perform ablation experiments on the MPII dataset to analyze the individual impact of each component in our approach. The experimental results demonstrate that our lightweight model achieves similar pose estimation performance compared to other lightweight models while having 60% computational complexity.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In recent years, hyperspectral image (HSI) classification technology has received more attention. Deep learning methods have been gradually applied to the classification of HSIs. Convolutional neural networks-based model, especially the residual networks (ResNets), have shown its excellent performance. In HSI samples, there are usually some noise pixels near the center pixel, which are not conductive for the extraction of spectral-spatial features and will have a negative impact on the classification performance. In our previous study, a spectral similarity-based spatial attention module integrated with 3D ResNet was designed to highlight the effect of center pixels on spatial attention. However, during the generation of the spatial attention, the characteristics of different similarities are ignored. Meanwhile, the employment of 3D ResNet may generate a large amount of redundancy to waste the computing resources. Therefore, an improved spectral-similarity-based spatial attention module with pre-activation and a 3Dinverted residual attention network, is proposed. The pre-activation strategy is designed to follow the feature of each similarity measurement, in which the Canberra distance is invoked to simplify the computational complexity of similarity. A3Dinverted residual module is integrated with squeeze-and-excite attention modules to handle spatial and spectral-spatial features more efficiently. Experimental results on three public HSI data sets demonstrate a better classification performance of the proposal comparing with some of the state of the arts.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Tensor ring (TR) decomposition is an effective method to achieve deep neural network (DNN) compression. However, there are two problems with TR decomposition: setting TR rank to equal in TR decomposition and selecting rank through an iterative process is time-consuming. To address the two problems, A TR network compression method by Bayesian optimization (TR-BO) is proposed. TR-BO involves selecting rank via Bayesian optimization, compressing the neural network layer via TR decomposition using rank obtained in the previous step, and, finally, further fine-tuning the compressed model to overcome some of the performance loss due to compression. Experimental results show that TR-BO achieves the best results in terms of Top-1 accuracy, parameter, and training time. For example, on the CIFAR-10 dataset Resnet20 network, TR-BO-1 achieves 87.67% accuracy with a compression ratio of 13.66 and a running time of only 2.4 hours. Furthermore, TR-BO has achieved state-of-the-art performance on the CIFAR-10/100 benchmark tests.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
A lightweight network model based on YOLOx is proposed for the problems of limited resources of transmission line UAV inspection platform, high complexity of target detection algorithm and slow inference speed. First, the lightweight ShuffleNetV2_Plus network is used as the backbone network for feature extraction, and the Depthwise Convolution (DWConv) in the ShuffleNetV2 network is expanded by replacing 3×3DWConv in the ShuffleUnit module with 5×5DWConv in the ShuffleUnit module, and prune the convolution layer of the model, and prune the 1×1Pointwise Convolution (PWConv) in the ShuffleUnit basic unit module to reduce the network parameters while increasing the network perceptual field. At the same time, add the Efficient Channel Attention (ECA) module in the neck feature fusion part to make the network better focus on important regions and improve the target detection accuracy at a small computational cost. Finally, the ordinary convolution in the YOLOx detection decoupling head is replaced with Depthwise Separable Convolution (DSConv) to further reduce the model complexity. The results show that the inference time of the lightweight network model proposed in this paper is only 5.8ms, the model parameters are only 4.361MB, and the FLOPs are only 10.725G, and the detection accuracy is high on the combined self-built transmission line dataset.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Concrete crack detection can reflect the condition of concrete structure in time, facilitate the staff to arrange maintenance, and is crucial to ensure the normal operation of facilities. Traditional manual concrete crack detection methods have been unable to meet the current real needs. The development of deep learning has injected new vitality into crack detection. However, due to the splicing marks of concrete structures, some fake cracks similar to real cracks will inevitably occur, and most of the existing deep learning models cannot effectively identify them. In order to enhance the reliability of deep learning detection of concrete cracks, this paper proposes a crack identification method based on Mask RCNN and track similarity measurement. By extracting the crack Mask results output by Mask RCNN, the track morphological characteristics of all cracks are established, and the authenticity of the crack results output by the model is further determined by track similarity measurement. In addition, in order to ensure the accuracy of model detection, this study carried out horizontal and vertical inversion, brightness adjustment and background color change data enhancement on the training data set, and introduced tunnel crack data collected under the condition of strong light source in dark environment. Through the experimental verification on the data set and real environment, the experimental results show that the method can effectively identify the true and false cracks and has advantages in improving the model detection of cracks.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Rail health monitoring plays an important role in the recent railway system. In order to implement acoustic emission technique in rail health monitoring, the mechanical noise caused by wheel-rail interactions should be eliminated with digital signal processing techniques in advance. This paper proposed a wavelet subband least mean square (LMS) adaptive filter, which can be used to effectively eliminate the strong noise and detect crack signals in railway. In such a method, the noisy input signals and reference noise should be first transformed into multi-scale wavelet coefficients. Then the decomposed noisy input and reference of the same channel was input into the separate LMS adaptive filters and the parameters of the filters were optimized by the metric of similarity coefficient. Finally, the denoised wavelet components in different levels were integrated into the whole denoised signal. Experimental tests clearly demonstrate that, compared to the conventional LMS method and adaptive wavelet filtering method, the proposed algorithm can improve the detectability, obtain a higher SNR, and at the same time retain the details of the crack signals under the actual railway noise interference.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Battery defect detection is an important task in the battery production line. Realizing full automation for battery defect detection is of great significance for battery factories to ensure the production safety and reduce the operation and maintenance cost. This paper proposes a battery defect detection method that integrates the traditional image processing and deep learning. Firstly, we propose a battery localization method based on the image processing technique. Afterwards, we propose to employ a deep neural network for the training of battery defect detection. Comprehensive experimental results show that the recognition accuracy of the proposed method reaches 99% for battery rolled edge wrinkles, pits, scratches, etc. Moreover, the processing time is less than 0.2 seconds for a single battery image. This is a significant improvement for the task of battery defect detection compared to other advanced detection methods.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In recent years, deep learning algorithms, such as convolutional neural networks, have shown promising results in pavement crack detection. However, in practical engineering applications, existing pavement crack detection methods often rely on block-level crack labeling due to the challenges in producing pixel-level pavement crack labeling images. This approach is often accompanied by the difficulty of recognizing fine cracks in pavements. In this paper, we propose a method for detecting pavement cracks based on adversarial and depth-guided networks. The method consists of two components: a pixel-level pavement crack marker extraction algorithm based on edge detection, and a pavement crack detection algorithm based on adversarial and depth-guided networks (UCRGNet). The former can extract pixel-level crack markers from block-level cracks, thus effectively addressing the issues of generating pixel-level crack marker images and achieving finer marker granularity. The latter is based on the concept of generative confrontation, which improves the network's feedback to small crack regions by providing necessary supervision to the generated pavement crack segmentation images. Additionally, it incorporates a bootstrap filtering module and an attention mechanism to address the issue of information loss, thereby enhancing the model's ability to accurately identify fine cracks. The pavement crack detection method proposed in this paper is based on adversarial and depth-guided networks. It has been tested on the NCDataset dataset, and the results demonstrate that its accuracy, precision, and recall in recognizing pavement cracks are higher compared to other similar algorithms. Specifically, the method achieves an accuracy of 95.89%, precision of 67.96%, and recall of 65.93%.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Traditional supervised learning has achieved high accuracy in crack detection tasks. However, due to the complex pavement situation, it often requires millions of data to train model in industry. It is difficult to process large-scale data to make it suitable for training. The very diverse data also result in the learning bottleneck of traditional supervised learning. In addition, the redundancy of network units in the actual forward inference of the model will waste lots of computing resources and slow the speed, which is also limiting overall detection effect. To solve these problems, we propose an automatic pavement crack detection algorithm based on reinforcement learning to optimize traditional supervised training. Firstly, based automatic pruning strategy in reinforcement learning, the detection efficiency of traditional supervised learning model can be improved with little even no loss of accuracy. Secondly, after the training of traditional supervised learning model, selecting and optimizing data with poor detection effect can break the learning bottleneck. The experimental results show that the proposed algorithm can significantly improve the accuracy and inference efficiency of crack detection under the actual complex pavement situation, and has achieved remarkable detection effect on the laboratory engineering project data set CiCS500 and several other public data sets.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Magnetic Resonance Imaging (MRI) is a crucial medical imaging technique, but MR images are often corrupted by noise. To address this issue, the higher-order singular value decomposition (HOSVD) denoising method is one of the mainstream approaches to remove noise in MR images. Notably, the Iterative Low-Rank HOSVD (ILR-HOSVD) algorithm has demonstrated its supremacy in terms of peak signal-to-noise ratio (PSNR). However, ILR-HOSVD overlooks the potential impact of the orthogonal bases under high noise conditions. Moreover, the reliance on multiple iterative optimizations in ILR-HOSVD results in considerable computational overhead, leading to prolonged denoising times. In this study, we propose the Coarse-Fine HOSVD (CF-HOSVD) denoising algorithm, which consists of two stages: a coarse HOSVD (C-HOSVD) denoising stage for pre-filtering and a fine HOSVD (F-HOSVD) denoising stage based on low-rank tensor approximation theory. Specifically, the conventional HOSVD denoising algorithm is first applied to pre-filter the MR images, representing the C-HOSVD, and then the orthogonal bases generated by the prefiltered images with a lower noise level are used to assist the F-HOSVD denoising. The proposed CF-HOSVD algorithm was evaluated on simulated noise-free datasets from the BrainWeb and MRXCAT, and compared with state-of-the-art traditional denoising methods. The experimental results demonstrate the superiority of CF-HOSVD, as it consistently outperforms other methods in terms of PSNR, structure similarity index metric (SSIM), and visual effect.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Image shadow removal is an essential image preprocessing task. In practical production environments, effective image shadow removal methods can significantly enhance the performance of subsequent image-based tasks. However, current methods for image shadow removal still encounter issues such as artifacts, color deviations, and blurriness due to factors including the capturing environment and algorithmic efficiency. This paper proposes an image shadow removal method using spatial attention, integrating physical and deep learning models. By incorporating multi-scale feature learning and preserving spatial details, the approach integrates shadow spatial attention modules, perceptual loss, and edge loss to improve the shadow removal effect. Experimental results demonstrate that the proposed method achieves a PSNR value of 36.14dB and an SSIM exceeding 98% in the ISTD dataset, with the RMSE reduced to 6.54. These outcomes affirm the efficiency and superiority of the proposed method in addressing the challenges of image shadow removal tasks.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Image restoration is a popular and challenge task, which is regarded as a classical inverse problem. Condat-V ũ primal-dual algorithm based on proximal operator is one of successful optimization methods. It is further reformulated as a primal-dual proximal network, where one iteration in the original algorithm corresponds to one layer in the network. The drawback of primal-dual network is that blur kernels should be given as prior information, however, it is usually very hard to be known in the real situation. In this work, we propose a deep encoder-decoder primal-dual proximal network, named ED-PDPNet. In each layer, the blur kernels and the projections between the primal and dual variables are designed as encoder-decoder modules, in this way, the network can be learned in an end-to-end way and all the parameters in the primal-dual algorithm are learned. The proposed method is applied on the MNIST and BSD68 datasets for image restoration. The preliminary results show that the proposed method by combining simple encoder-decoder modules obtained very promising and competitive performance compared to the state-of-the-art methods. In addition, the proposed network is shown to be a lightweight network with fewer learning parameters in comparison to the recent popular transformer-based method.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
At present, two-stage networks are widely used in image restoration methods, but existing two-stage network often generates inpainting results with distorted structures and blurry textures, especially when reconstructed object is more complex. The main reason is insufficient structure prior and inaccurate, which leads to generating wrong results in texture generation stage. In order to solve this problem, a novel Image Inpainting based on Edge and Smooth Structures Prediction is proposed. The edge structure and smooth structure are completed in structure reconstruction stage, and reconstructed edge structure and smooth structure are simultaneously used as a prior to guide texture generation stage fills in damaged area. The proposed method is evaluated on publicly available datasets Paris StreetView, CelebA-HQ and Places2, and many experiments show that proposed method obtains excellent results under subjective and objective indexes compared with mainstream approaches.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Wave-front coding (WFC) is a technique that adds a phase plate to an optical system's pupil exit surface. This technique first encodes the image so that the system's optical transfer function is insensitive to out-of-focus throughout a wide depth-of-focus (DoF), resulting in a blurred image. Furthermore, image noise would inevitably be generated because of device part failure or external interference. As a result, we would acquire a noisy and blurred image. The image needs to be restored to improve the imaging system's DoF. A series of experiments have been carried out to find the optimal image restoration solution. In this paper, we proposed a three-step image restoration scheme for noisy blurred images after cubic phase plate. At first, we used the Min-Max Average Pooling-based Filter (MMAPF) to remove salt-and-pepper noise. Then we used Shearlet to remove Gaussian noise. Finally, for image deblurring, we used the Amended Landweber (ALW) algorithm. Experiments on real infrared images show that this scheme has excellent performance and robustness for image restoration when applied to a cubic phase plate.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Digital Image Transformation and Processing Methods
Midline shift is an important clinical indicator of the severity of hemorrhagic stroke and holds significance in a physician's clinical diagnosis. Segmentation-based methods for midline shift assessment are prevalent in the field, but the full utilization of global information within network remains a challenge. In addition, empirical integration with clinical knowledge is also essential. In this study, we developed a two-stage method for automatic midline shift assessment. Firstly, in the midline identification step, we proposed a Dual-Path U-Transformer to segment the brain midline. The Dual-Path U-Transformer can better capitalize on global information by integrating self-attention mechanism, while still retaining the characteristics of U-Net in making full use of local information and combining high and low dimensional features. In the second stage, according to clinical knowledge learned from clinical expert, we calculated the maximum shift distance for assessment of brain midline shift, determining whether each case has surgical indication. In the experiments process, we use 5-fold cross validation to train and validate the proposed model. Compared with traditional U-Net based method and transformer-based method, the proposed Dual-Path U-Transformer based method performed the best HD and Dice performance on our inhouse dataset. And the experiment results confirmed that the Dual-Path U-Transformer based exhibited excellent accuracy in the second stage of midline shift assessment.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In recent years, lip-reading techniques have been actively researched for estimating speech content only from visual information without audio information. Large databases are available for English but not enough for other languages. Therefore, this paper constructs a new database for improving the accuracy of Japanese lip-reading. In previous research, we asked collaborators to record utterance scenes to build a database. This paper uses YouTube videos. We download a weather forecast video from the “Weathernews” YouTube channel. We constructed a database that can be used for lip-reading by applying video and audio processing. Furthermore, we selected 50 Japanese words from our database and applied an existing deep-learning model. As a result, we obtained a word recognition rate of 66%. We have established a method for constructing a lip-reading database using YouTube, although there are still problems with the scale of the database and recognition accuracy.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Aiming at the problems of blurred edge structure, loss of texture details, distortion and slow operation speed in medical image fusion, this paper proposes a medical image fusion model based on residual network. The network mainly consists of an encoder, fuser and decoder. A feature extraction module MSDN consisting of residual attention mechanism and dense blocks is designed in the encoder for extracting multi-scale deep features of the source image. A learnable fusion network is used in the fuser to replace the manually designed fusion rules, eliminating the adverse effects of manually designed fusion strategies on the fusion effect. The decoder obtains the fused image by layer-bylayer decoding and up-sampling. We use a two-stage training strategy to train the fusion model; in the first stage, the image reconstruction task is used to drive the training of the encoder-decoder; in the second stage, the trained encoder-decoder is fixed, and the residual fusion network is trained using an appropriate loss function. The experimental results show that the subjective visual effect of the fused image contains rich texture details and color information, and the comprehensive performance of the objective evaluation index is better than that of the comparison algorithm.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In order to address the problem of cost-effective deployment of high precision in UAV indoor positioning, this paper suggests an affordable and non-distorting technique of in-room location by QR Codes. The approach focuses on resolving the problems that QR codes are difficult to deploy in the actual environment and easy to stain after prolonged usage, and that it is necessary to compute the nose angle for navigation. First, QR code contamination problem is solved through the redundant error tolerance algorithm. Secondly, the problem of sparse QR code deployment is solved through multiple UAV height lifting and correction. Finally, calculate the difference between the UAV and QR code and the map coordinates and the aircraft head angle of the UAV are inversely calculated. The actual measurement shows that the proposed method has good performance and application scenarios, the positioning accuracy is less than 5 cm, and the attitude orientation deviation is not more than 5°, meeting the requirements of unmanned indoor positioning.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Filter pruning is a widely employed model compression technique, with the inter-channel method currently recognized as the most efficient approach for filter pruning. However, existing inter-channel methods have not fully explored the independence between convolutional channels. In this paper, we propose to use the Schatten p-norm to extract rank information between convolutional channels and measure the importance of a specific channel by analyzing the change in rank information after its removal. The principle underlying our pruning approach is that a smaller change in rank information corresponds to a lesser degree of importance for the channel. Besides, to reduce the computation time required for calculating channel importance, we propose employing a prototype-based approach. We have verified the effectiveness and efficiency of our proposed method on various datasets and models. As an example, when applying our approach to ResNet-56, we achieved an accuracy improvement of 0.91% while the model size and FLOPs were reduced by 42.8% and 47.4% respectively on the CIFAR10 dataset.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
With the rapid advancement of Internet technology and the widespread adoption of smart devices, there has been a substantial increase in multimodal data that conveys identical semantics but in diverse coding formats. To foster the advancement of social intelligence, scholars are increasingly investigating the semantic correlations among multimodal data, which represents a current research focal point. The primary objective of cross-modal accurately compute the dissimilar modalities and efficiently retrieve relevant data from other modalities. The objective of this article is to provide comprehensive overview of the advancements in cross-modal retrieval research. First, it presents a conceptual framework and problem formulation for cross-modal retrieval elucidating, the multimodal nature of image and text cross-modal retrieval. Secondly, it delves into semantic representation learning-based approaches for computing image text cross-modal similarity and hash-based methods for facilitating cross-modal retrieval. Furthermore, a comparative analysis is conducted on widely adopted evaluation metrics for current cross-modal retrieval techniques, accompanied by outlook on future research directions.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
To enhance the performance of Convolutional Neural Networks(CNNs), channel attention mechanism is widely employed in CNNs recently. Most existing channel attention mechanisms assign weights to feature maps to capture the information, reflecting the relative importance of feature maps and thus improving the overall network performance. However, they bring higher latency for CNNs. Furthermore, many studies indicate that CNNs generate redundant feature maps during the convolution process, and they are similar and unimportant relatively. Empirically, we have observed that assigning weights between 0 and 1 to important feature maps can lead to distortion of their importance. In order to overcome the balance between performance and complexity while retaining information about important feature maps and learn the weights of unimportant feature maps effectively, we propose a novel Efficient Weight Learning Channel Attention(EWLCA) mechanism. The attention module effectively learns the weights of unimportant feature maps, which prevents distortion without adding high latency. Therefore, the proposed EWLCA has stronger representational ability. In addition, we demonstrate the effectiveness of our method through experiments on CIFAR100, ImageNet and VOC datasets, including comparison with other attention modules. On ImageNet classification, our EWLCA achieves a Top-1 accuracy of 81.44% in ResNet34, which is 1.28% higher than the original ResNet34.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Currently, fabric image retrieval faces challenges such as the high cost of image annotation and its vulnerability to adversarial perturbations. To minimize manual supervision and enhance the robustness of the retrieval system, this study proposes a robust deep image retrieval algorithm using multi-view self-supervised product quantization for artificially generated fabric images. The method introduces a multi-view module, which includes two views enhanced by AutoAugment, an adversarial view and a high-frequency view of the unlabeled images. AutoAugment can generate more varied data variations, which allows the model to learn more about the different features and structures of the fabric texture; fabric images are usually of high complexity and diversity, and adding the adversarial sample into the model training can add more noise and variations, which is one of the best existing ways to defend against adversarial attacks; the high-frequency component can make the edges, details, and contrasts in the fabric image clearer. A robust cross quantized contrastive loss function is also designed to jointly learn codewords and deep visual descriptors by comparing multiple views, effectively increasing the model’s robustness and generalization. The method's effectiveness is demonstrated by experimental results on multiple datasets, which can significantly improve the robustness of the retrieval system compared to other state-of-the-art retrieval algorithms. Our method presents a new approach for fabric image retrieval and has great significance for improving its performance.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
The performance of visual object detection, tracking, and recognition algorithms significantly degrade under low-light environments. To address this problem, we propose a simple yet effective approach for visual tracking under low-light environments. First, we detect and enhance low light images using Retinex technology, which boosts the contrast of the images to facilitate our algorithms to extract visual features for tracking. Then, we apply the kernelized correlation Filter (KCF) to track the object and refine the tracking performance by instance segmentation and bounding box optimization using Mask R-CNN. At last, we introduce re-detection into the proposed algorithm to enable the tracker to catch the object while it appears again after heavy occlusion or out-of-view motion. Consequently, the proposed algorithm is suitable for long-term tracking. Experiments on VOT2016, OTB50, and UAV20L show that the proposed algorithm largely improves the tracking success rate and precision, and the running time meets the need for real-time applications.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Capturing the hidden relationships in 2D pose sequences is crucial for accurate 3D human pose estimation(HPE). Recent studies have shown that frequency domain information, independent of spatio-temporal information, has strong capabilities on representing the pose sequences. However, there are few works exploring more appropriate ways to fuse these different kinds of information. In this paper, we propose an alternating cyclic approach for fusing spatio-temporal information and frequency information to achieve accurate 3D human pose estimation. The designed alternating cyclic fusion network allows for a more comprehensive integration of different features, leading to improved accuracy. By leveraging feature splitting and time-frequency convolution, the existing features are processed more appropriately, and achieving model lightweighting. Experimental results demonstrate that our approach achieves comparable accuracy to state-of-the-art methods while significantly outperforming mainstream methods in terms of model lightweighting. In conclusion, the introduction of frequency domain information is of great significance for pose estimation tasks.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
The accurate assessment of cardiac function is crucial for preventing and controlling cardiovascular diseases and reducing global mortality rates. In recent years, the rapid advancement of machine learning and deep learning, particularly the utilization of artificial intelligence technologies such as convolutional neural networks and multi-task learning, has significantly improved the objectivity and precision of assessing echocardiogram images. However, existing methods lack a thorough exploration of the intrinsic relationship between ejection fraction (EF), end-diastolic volume (EDV), and end-systolic volume (ESV) calculations, which ultimately influence the accuracy of cardiac function assessment. Therefore, we propose an AI-Based Multi-task Framework for Cardiac Function Assessment through echocardiograms. The framework utilizes a 3DCNN network to concurrently extract spatial and temporal features from echocardiographic videos. It employs multi-task learning by assigning varying weights to sub-tasks, enhancing the prediction accuracy of ejection fraction through joint training. Experimental results on the publicly available Echonet-Dynamic dataset demonstrate that the proposed framework achieves promising performance in ejection fraction prediction, with mean absolute error, root mean square error, and R2 scores of 3.89%, 5.13%, and 0.82, respectively, surpassing other comparative methods. This framework will further aid clinicians in more accurate cardiac function assessment, offering promising prospects for its practical application.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Explaining the complicated mechanisms underlying the image classification capabilities of deep Convolutional Neural Network (CNN) models, continue to pose huge challenges within the field of computer vision. To address this concern, numerous interpretability methods have been devised, aimed at clarifying the image classification process. These methods include approaches such as sensitivity maps, which involve computing the gradients of class outputs with respect to input images, and techniques like class activation mapping (CAM). Furthermore, the incorporation of noise into input images has emerged as an effective strategy for augmenting visualization quality and removing noise. In this paper, we propose two key contributions: the introduction of a novel approach that injects noise into network weights to enhance visualization, involving image gradient updates and average gradient computations; and a new indicator for evaluating interpretability - the center of gravity, and comprehensive experiments were conducted on multiple datasets and different deep neural network models. In the subsequent experimental sections, we demonstrate that our method achieves superior visualization quality and can be combined with other interpretability methods to enhance their performance.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Image Processing and Computing Model Based on Machine Learning
Lip-reading technology has the advantage that it can be used even in noisy environments and has been actively studied in recent years. In this paper, we develop a navigation application, "KuchiNavi," as a new application using lip-reading technology. The basic technology is word-level lip-reading technology, which utilizes an existing deep-learning model. However, we quantitatively evaluated lip-reading accuracy by selecting words for navigation, collecting utterance scenes independently, building an original dataset, and conducting recognition experiments. This paper, 101 Japanese words were selected, utterance scenes were collected from 15 people, and recognition experiments were conducted using the speakerindependent recognition task, the leave-one-person-out method. As a result, an average recognition rate of 88.2% was obtained. In addition, we developed an iOS app and conducted a demonstration in a car to confirm its effectiveness.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Image feature extraction and matching are fundamental but computation intensive tasks in machine vision. This paper proposes a novel FPGA-based embedded system to accelerate feature extraction and matching. It implements SURF feature point detection and BRIEF feature descriptor construction and matching. For binocular stereo vision, feature matching includes both tracking matching and stereo matching, which simultaneously provide feature point correspondences and parallax information. The proposed design can process binocular video at high frame rate (640 x 480 @ 162fps). Different from similar works, the proposed approach considers feature point distribution in hardware design, and can homogenlize feature points over the image on the fly, and impacts on following feature point matching as well as homography transform matrix calculation are also evaluated. Experiment results demonstrate that our approach can reduce feature point matching overhead, and has no much adverse effect on holography projection.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
An novel intelligent electronic document layout recognition method via deep learning is proposed. A text detection approach is used to detect the string position along with region, and those adjacent regions are merged based on the distance between text zones, then the document layout style is determined by calculating the match degree between the printed document and the publication template set. The proposed recognition method constructs a electronic document representation tree, the location of the area bounding box is added to the tree. The maximum match distance between the trees is calculated, and is used for judging the document layout based on the structural similarity. Experimental results show that this method can quickly and accurately distinguish electronic document among different layout styles. Users can not only recognize the layout of this printed publication real time, but also find the desired layout style of the printed publication from a large number of printed publication images. The given method could meet different usage needs in practical applications.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Recently, Transformer-based methods have achieved excellent results in various computer vision tasks, including Single Image Super-Resolution (SISR). In SwinIR, the mechanism of cross-window connection and local self-attention of Swin Transformer are introduced into the SISR task, achieving breakthrough improvements. However, the local self-attention mechanism of Swin Transformer has a limited spatial range of input pixels, which limits the ability of the super-resolution network to extract features in a wide range. Aiming at this problem, an enhanced CNN and Transformer hybrid module is designed for feature extraction by combining self-attention, spatial attention and channel attention. Taking advantage of their complementary strengths, the range of activated pixels is expanded while still maintaining a strong capability for local feature characterization. In addition, simply extending the activation pixel range without constraints is not conducive to reconstruction. Aiming at this problem, the Neural Window Fully-connected Conditional Random Fields (NeW FC-CRFs) are integrated for feature fusion. The shallow features are inputted into NeW FC-CRFs along with deep features, allowing for the utilization of multi-level information during the fusion process. In summary, we propose the Hybrid Attention Super Resolution Network with Conditional Random Field (HANCRF). Extensive experiments show that HANCRF achieves competitive results with a small number of parameters.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Agricultural and forestry injurious insects have the characteristics of numerous varieties, huge adverse effects and strong explosive power, which has a great impact on the growth of crops. Therefore, it is of great significance to correctly identify insects and give their characteristics and killing methods. A insect recognition method based on convolutional neural network is studied in this paper. Firstly, the network structures AlexNet and ResNet of the convolutional neural network are built and analyse. Then, by training and testing through datasets of seven types of insect, the results show that the ResNet network with better recognition effect is selected, whose recognition accuracy of the testing dataset reaches 96.2%.At last,by rotating, scaling, damaging and blurring the photos of part of the testing dataset, it is found that the average recognition accuracy of ResNet reaches 94.67%, indicating that ResNet has strong anti-interference ability and can be used as an effective insect recognition algorithm.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Multi-view clustering is a complex and significant task in the fields of machine learning and data mining. Most of the existing multi-view clustering models are for views with complete information. However, data loss inevitably occurs during data collection and transmission, leading to the problems of partial individual unalignment (IU) and individual missing (IM). To address these challenges, the article proposes a framework called incomplete multi-view clustering with multiple contrastive learning and attention mechanism (IMCLAM). IMCLAM utilizes the maximization of mutual information of different views and enhance the separability of the representation through multiple contrastive learning and the fusion of specific low-dimensional representations into a joint representation through an attentional fusion layer. Moreover, the effect of negative samples is reduced by increasing the noise robustness loss. Experiments on four multi-view datasets demonstrate the effectiveness of IMCLAM on the task of multi-view clustering compared to six state-of the-art methods.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
The least square method is common and classical in the regression analysis. It is often used to solve the convex optimization problem, but the traditional solving routine for least squares which is done by hand-written codes shows the disadvantages when dealing with common least square problems. One significant drawback for traditional solving routine is it is hard to work along with and produce high performance solver by non-professional users who do not have the knowledge of CPU/GPU architecture, and it is also a tough job to review or improve the solvers which already have been written, since many fine details that relate to the processor structure may be hard-coded in to the source code. In this paper, we propose a new domain specific language (DSL) for the producing of non-linear least square solver for research purpose with a back end of Gauss-Newton and Levenberg-Marquardt methods implemented in cuSPARSE and cuBLAS. The DSL paired with a C/C++ interface has a user-friendly syntax which can be easily used to write energy functions and generate GPU solvers which have the performance close to hand-written CUDA solvers.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Infrared and visible image fusion aims to synthesize a new image with complementary information of the source images such as the thermal radiation information and detailed texture information. However, the existing methods applied in the image fusion tasks tend to pay more attention to the improvements of visual quality and multiple evaluation indicators, while ignoring the importance of semantic information in high-level missions of image processing. In this paper, we present a novel approach to image fusion, where the synthesis of fused images is achieved through the utilization of semantic layouts generated via unsupervised learning. With the introduction of Attention Mechanism, the relationships between each pair of pixel points are gained to construct soft semantic layouts and further capture the global context information to make the same kind of semantics show the same fusion effect. By leveraging the semantic information, our method enables the automatic learning of fusion weights for two source images at the same spatial position. In comparison with other state-of-the-art image fusion methods, our experimental results obtained excellent performance both on qualitative results and quantitative indicators. Moreover, our method retains the high-level semantic information to the greatest extent, which is one of our outstanding characteristics.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Big data is the bottleneck of deep learning in the current scenario, as data itself comes with the drawbacks of expensive collection costs and even the inability to gather it at all. How to achieve good learning performance in situations with insufficient sample size has increasingly gained attention. The practical value of small sample learning is self-evident, as this technique aims to learn concepts of new classes through a few labeled samples. Data augmentation is the most intuitive approach to address small sample learning, and recent works have demonstrated its feasibility by proposing various data synthesis models. However, data augmentation during model training has a significant drawback. It can easily lead to over-fitting since it relies on a biased distribution formed by only a few training examples. In this paper, we propose a method to generate high-quality pseudo-samples by calculating a regularization factor that constrains the model generator based on statistical distribution information from a large number of classes.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
A novel method is proposed to construct relatively rich vector patterns from existing examples to address the problem of excessively simple and coarse details in automatically generated patterns. This method involves several key steps, including the extraction of vectorized primitives, the construction of primitive relationships, and the intelligent generation of patterns through optimization algorithms. Specifically, vectorized primitives are extracted from raster images, and directed graphs are used to establish relationships between primitives, taking into account the geometric relationships of the graph. Primitive relationships are calculated based on the extracted geometric relationships, and relevant constraints are used to transform the original pattern. The transformed pattern is then optimized to produce a more harmonious and aesthetically pleasing pattern variation. Experimental results show that the proposed algorithm can generate a diverse set of novel pattern variants, and the optimized variants demonstrate high levels of harmony and aesthetics. Users have the ability to influence the direction of pattern generation by adjusting the primitives, enabling them to compare and select the generated pattern variants that align with their implicit preferences. The proposed method provides an effective solution for pattern generation, catering to various requirements in practical applications and delivering a range of diverse pattern graphics for products.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Compared with the traditional light field rendering, the neural radiance field can use the neural network to fit the ray samples of the scene, and implicitly encode the light field of the input image to synthesize a new view. An improved neural radiance field view synthesis method has been proposed to reduce the training time and accelerate the rendering speed based on pixel adaptive selection and trapezoidal numerical integration. In the pixel selection stage, the pixel adaptive selection method guided by the rendering loss can reasonably allocate the positions of pixel samples, so that the regions with unclear rendering or higher details can have more iterations. In the ray sampling stage, a volume rendering integration formula using the trapezoidal numerical integration is proposed. Combined with the network structure of the neural radiance field, the trapezoidal numerical integration is used in the coarse network stage, and the rectangular numerical integration is used in the fine network stage. It ensures the quality of view synthesis while reducing the number of uniform ray samples. Training and testing experiments on Realistic Synthetic 360 °dataset and Real Forward-Facing dataset. Experimental results show that compared with the neural radiance field method, this method reduces training time by 23% and improves view synthesis efficiency by 25% while achieving similar view synthesis quality
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
The Fast Approximate Anti-Aliasing algorithm is one of the most widely used real-time anti-aliasing algorithms in practice in recent years. It has superior low-cost and high-efficiency anti-aliasing processing advantages as a post-processing algorithm. This algorithm directly achieves a real-time anti-aliasing effect on the image vision by blending and smoothing the object edge pixels in the image using blending factors. However, in practice, because the fast approximate anti-aliasing algorithm is based on the pixel blending principle to achieve the visual effect of anti-aliasing, it has the problem of blurring the image screen. To solve this problem, this paper proposes an improved anti-aliasing algorithm based on the fast approximate anti-aliasing algorithm, combining the Laplacian sharpening variant. The general optimization idea of this improved algorithm is to improve the blending factor formula and pixel blending method of the original fast anti-aliasing algorithm and to further sharpen the blended anti-aliasing results using a sharpening kernel. In this way, this algorithm can achieve improvement in the sharpness of the anti-aliasing result image while ensuring the superior real-time performance of it as a post-processing method. The experimental results show that the generating image of this improved anti-aliasing algorithm is sharper compared with the original Fast Approximate Anti-Aliasing algorithm.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In recent years, with the rapid development of ADVB (Avionics Digital Video Bus) avionics digital bus, ADVB bus has gradually replaced the original airborne DVI (Digital Visual Interface), VGA (Video Graphics Array), LVDS(Low- Voltage Differential Signal) and other video transmission modes in advanced fighter aircraft, civil aircraft, helicopters, trainers and other new test aircraft. ADVB bus is gradually replacing the original DVI (Digital Visual Interface), VGA(Video Graphics Array), LVDS (Low-Voltage Differential Signal) and other on-board video transmission modes, and has become an important data carrier in the avionics systems of advanced fighter aircraft, civil aircraft, helicopters, trainers and other new types of experimental aircraft. Faced with the test of new video bus, the existing test methods can no longer satisfy, so this paper starts from the basic protocol research, combines the ICD (Interface Control Document) structure of the airborne avionics system with the characteristic information, and completes the adaptive processing of high-speed and diversified video formats through the adaptive identification, resampling, and timing reconstruction technical paths. Through the adaptive identification, resampling, time sequence reconstruction and other technical ways, it completes the adaptive processing of high-speed, diversified video formats, realizes the generalized testing of multiple types of machines, and successfully realizes the collection and recording of unconventional videos suchas1680*1050@60fps and high-resolution videos such as 2560*1024 @30fps through the construction of a test platform, and the playback of video data is complete and clear, which verifies the validity of the method of this paper. The designed test system is compatible with the existing video test status quo, and can effectively reduce the cost of test flight and improve the efficiency of test flight.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In order to adjust the local color style of real images in a simple and low-threshold way, and to maintain the harmony of the image and the real effect of the content, this paper proposes a high-fidelity image adjustment method using color envelopes and sliced local optimal transmission. The method uses a simple mask to segment the image; extracts the initial palette and palette weight matrix of each image block through the color envelope of RGB and RGBXY dual color space; uses the sliced local optimal transfer algorithm to achieve the operator to adjust the color style migration of the target related image block, and obtains the template image block; and obtains the template image block through the template image block and the palette weight matrix; and uses the sliced local optimal transfer algorithm to achieve the operator to adjust the color style migration of the target related image block; obtain the optimal color palette of the corresponding image block through the template image block and the palette weight matrix, so as to achieve the effect of automatic adjustment of the color palette. The experimental results show that the proposed method can automatically process other image blocks related to the operator's adjustment target, effectively adjust their color styles, and ensure the global harmony of the image.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In this paper, we propose a lightweight approach to address the problems of noise amplification, detail loss, and edge blurring encountered in low-light image enhancement, which typically hinder subsequent computer vision tasks. Our proposed method utilizes transformers and depth-separated convolution, integrates the ISP image processing pipeline, and incorporates semantic information. First, we introduce an enhancement compensation extraction module that utilizes a transformer-based two-branch structure. Notably, depth-separated convolution replaces the original multi-attention transformer to achieve a lightweight design. The local branch estimates pixel-level illumination defects in low-light images, and the global branch enhances global structural information. Subsequently, we design a progressive enhancement module to receive enhancement compensation and reconstruct the enhanced image using the ISP pipeline. Together, these two modules form an enhancement network. Finally, we design a VGG16-based semantic segmentation module to preserve semantic information during the enhancement process and complete the image enhancement. Evaluations on benchmark datasets and extensive experiments with other algorithms demonstrate the effectiveness of our proposed low-light image enhancement method. The reconstructed enhanced images are improved in terms of brightness, contrast and detail sharpness, while effectively mitigating the problems associated with noise amplification and edge blurring.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In order to solve the problem that the image binarization segmentation under uneven illumination conditions can not be balanced in terms of processing speed and effect, this paper proposes a method of background estimation and threshold segmentation for images under uneven illumination conditions by using the second moving average method. This method uses the quadratic moving average prediction method to predict the trend of the gray value of each row of the image, so as to obtain the mutation point of the foreground target and the background to supplement the background image under the missing foreground target, and then uses the difference method to separate the foreground target and the background, and extract the target image for recognition. Experimental results show that the algorithm designed in this paper solves the phenomenon that the smaller neighborhood value of the local threshold algorithm will cause the hollowing out of the large foreground image to a certain extent, and increases the image processing rate to 140ms / frame, which can effectively improve the image background processing effect and accelerate the rate of image segmentation. It can basically be applied to the real-time detection scene of image anti-light interference under certain conditions.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
This paper proposed a fast and accurate method to estimate the camera pose and focal length with known camera position and vertical direction. Our proposed method only uses a single 2D-3D point correspondence for our estimation which is a perspective-n-point (PnP) problem. First, a camera with known pose is placed in the world frame and its position is the ground truth. Then, after several intermediate camera frames are established, the main work is to continually rotate the original camera frame to the final camera frame where the line of camera position and 2D point, and the line of the camera position and 3D point, are collinear. When the two lines are collinear, the camera pose can be estimated because the rotation angles are known and simultaneously an equation system of two variables (i.e., camera pose and focal length) can be given. Consequently, the single solution of camera pose and focal length is obtained. Last, a thorough test in synthetic data is conducted for our proposed method and several state-of-the-art perspective-n-point solvers. The experimental results show our proposed method performs better in terms of numerical stability, noise sensitivity and computational speed, and has good robust to camera position and vertical direction.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Image Detection Technology and Engineering Applications
Object detection continues to be a significant challenge in computer vision. Despite advancements made possible through deep learning, these models predominantly depend on extensive and diverse annotated training data. Such data, unfortunately, often lacks representation of many real-world scenarios. To bridge this gap, we use target images from the original dataset to train a specialized generator. The main intent behind producing these images is to mimic the appearance of targets across a broader spectrum of real-world situations. Once integrated with the primary dataset, these synthetically generated images act as an effective augmentation to the original training set, encompassing scenarios and variations previously absent. This autonomous method eliminates the need for external data sources, proving to be more practical in most situations. Our empirical findings highlight significant improvements: with the ResNet-34 backbone, the mAP for SSD rose notably from 0.185 to 0.233. Furthermore, for small objects detected by Faster R-CNN with the ResNet-101 backbone, there is a pronounced improvement from 0.213 to 0.225. These results underscore our method's efficacy, especially in enhancing detection capabilities for underrepresented scenarios and smaller objects.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
To cope with the threat of image content tampering in real scenes, this paper develops a multi-view spatial-channel attention network (MSCA-Net), which can use multi-view features and multi-scale features to detect whether an image has been tampered with and predict tampered regions. By introducing the frequency domain view of the image, the model can use the noise distribution around the tampered region to learn semantically independent features and detect subtle tampering traces that are difficult to detect in the RGB domain. Secondly, a new Efficient Spatial-Channel Attention Module (ESCM) is proposed to capture the correlation between different channels and between global pixels. MSCA-Net improves the localization performance of tampered regions on real-scene images by generating segmentation masks step by step at multiple scales through a progressive guidance mechanism. MSCA-Net runs very fast and is capable of processing 1080P resolution images at 40FPS+. Extensive experimental results demonstrate the promising performance of MSCA-Net on both image-level and pixel-level tampering detection tasks.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In industrial manufacturing, defect detection is essential. Since the 2020's ViT (vision transformer) hit the scene, ViT has been increasingly used for defect detection tasks in the vision domain. The advantage of ViT over convolutional neural networks (CNNs) is its ability to capture global remote dependencies to learn better features. In addition to this, contrast learning based on self-supervised methods has been well used in defect detection tasks. In this study, we suggest a strategy for detecting fabric defects that combines transformer and contrast learning. First, we propose a new backbone network CViT (convolutional vision transformer), which is improved relative to ViT by adding a convolutional attention module to the ordinary transformer block structure while using depthwise separable convolution instead of linear projection to obtain q, k, and v for attention computation. Second, to compensate for the potential instability of CViT, instead of the 16 × 16 big convolutions used in the ViT, we use several stacked 3 × 3 tiny convolutions to divide each enhanced sample into a series of patches. Third, we incorporate conditional position encoding(CPE) and explore the impact of different position encodings on model performance. Finally, the effectiveness of our model is demonstrated on three classical public datasets for fabric fault detection.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Dynamic objects are ubiquitous in real-world environments. However, traditional Simultaneous Localization and Mapping (SLAM) algorithms assume static scenes and fail on active tasks. Therefore, we need SLAM systems to achieve a precise and efficient localization and mapping in dynamic environments. Toward this goal, this paper proposes a dynamic SLAM algorithm incorporating a target detection model, YOLOv7. The general process of the algorithm is as follows: Firstly, the real-time target detector YOLOv7 segments the input image and divides it into static and dynamic parts. Secondly, feature information of the image frame is extracted. If the feature information is located in the dynamical part of the image, it will be eliminated for achieving an accurate tracking and mapping in the following steps. Next, tracking and mapping based on the image frame without dynamic objects are performed. Finally, global Bundle Adjustment (BA) optimization is performed to obtain more accurate pose and map information. The experimental results on TUM and KITTI show that the proposed algorithm in this paper has achieved the same accuracy as the DynaSLAM. In real-time, this algorithm has more advanced efficiency advantages on KITTI and TUM than the DynaSLAM algorithm and Dynamic ORB_SLAM. In particular, it has improved the speed by 33.03% on the TUM compared to Dynamic ORB_SLAM. Therefore, the proposed algorithm ensures that the SLAM system can achieve high accuracy in dynamic scenarios while maximizing speed.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In this paper, we propose a novel salient object detection framework by constructing a novel saliency tree model integrating low-level and high-level features. In our model, numerous features containing low-level features (e.g., color, texture, gradient, contrast, etc.) and high-level features (e.g., deep features extracted from pre-trained VGG19 net) are firstly selected as candidate features. We develop a novel feature integrating mechanism to acquire an integrated feature descriptor which is more discriminative to capture the contrast between foreground and background for the input image. Then, we construct a novel saliency tree model relied on the integrated features to generate saliency map. We compare the proposed method and other state-of-the-art methods on three datasets, experimental results indicate that the proposed saliency detection algorithm has achieved the top performance.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Anomaly detection is an essential task in industrial applications. Traditional anomaly detection algorithms based on convolutional neural networks (CNNs) struggle to extract global context information, resulting in poor anomaly detection performance. Moreover, the diversity and inherent uncertainty of anomalous samples present considerable constraints on the effectiveness of traditional anomaly detection algorithms. In this paper, a novel Swin Transformer Unet with Random Masks for self-supervised anomaly detection is proposed. First of all, a random mask strategy(RMS)is adopted to generate simulated anomalies in anomaly-free samples to solve the limited availability of abnormal samples during the training phase. Furthermore, to enhance the overall feature representation for anomaly detection tasks, the Swin Transformer Unet is utilized as the backbone network to extract local features and global contextual information from multi-scale feature maps. Experimental results on the industrial dataset MVTec AD demonstrate that our model achieves comparable performance in terms of anomaly detection.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Automatic Traffic Sign Detection and Recognition (ATDR) has emerged as a cornerstone in the rapidly evolving landscape of Intelligent Transportation Systems (ITS). As urban environments grow increasingly complex and the demand for smarter transportation solutions escalates, the significance of ATDR becomes ever more pronounced. Despite its growing prominence, real-world challenges, particularly the diminutive size of traffic signs in images, have hindered the performance of existing detection systems. Addressing this, we introduce a groundbreaking framework tailored to surmount these specific challenges. First, a transformer-enabled adaptive feature extractor is designed in the proposed network model to enhance the features of important areas and suppress the features of non-important areas through cross-space and cross-scale interactions of input features at each level. Subsequently, a convolutional feature fusion module is introduced, mitigating the semantic gaps that often exist between multi-scale features, streamlining the model by optimizing its parameters, reducing computational overhead, and ensuring that the dimensions align seamlessly with the input feature map. By constructing such a transformer-enabled adaptive spatial feature fusion module, small traffic signs can be effectively identified. Thorough evaluations on the TT100K and GTSDB datasets affirm the effectiveness of the proposed method, showcasing significant advancements in the detection of smaller traffic signs and marking a notable stride in ATDR research.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Panoramic imaging from the cone beam computed tomography (CBCT) dataset mimics the X-ray beam shooting along a dental arch which represents the teeth and jaw bones. This study aims to propose a novel dental arch detection algorithm in three-dimensional (3D) CBCT space based on the parametric equation to reconstruct panoramic images with clear visualization of jaw bones, teeth, and dental pulps. Our method involves two main steps: dental arch detection and panoramic imaging. The first step detects the dental arch using a parametric equation on a specially found plane in 3D CBCT space. Binary masks of jaw bones and teeth are required to fit the parametric equation. We employ deep learning techniques for tooth segmentation and a traditional method for jaw bone segmentation. In the second step, the maximal intensity projection and ray sum projection are applied for panoramic imaging. In experiments, a total of 20 CBCT datasets are used to evaluate the proposed method. 78 CBCT datasets are used to train the tooth segmentation network. Experiment results show that our proposed method enables us detect the dental arch directly in 3D CBCT space, and provides an accurate, effective, and robust solution for CBCT-based panoramic imaging.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.