Many problems in data fusion require reasoning about heterogeneous data obtained from different sources (news media, first-hand reporters, etc.) and containing diverse modalities (image, video, text, audio, etc.). Construction of the situation hypotheses, predictions of future events or inference of event causes, identification of the actors, or forensic analyses require automated decision aids that can provide fast and reliable extraction of relevant information and association of the sensory inputs. While humans can effectively fuse multiple sensory signals, the machine learning-based methods cannot yet effectively replicate these cognitive processes. The challenges are due to the presence of heterogeneity gap in multi-modal multi-source data, which creates inconsistent distributions and representations of the input signals. Researchers attempted to solve this problem by modeling explicit cross-modal correlation and learning common joint representations of the heterogeneous data. However, only marginal successes have been achieved, and domains have so far been limited to imagery/video-to-text association with no promise to generalize to any knowledge fusion.
In this paper, we explore the application of three recent modeling techniques to multi-source multi-modal data fusion: adversarial generative modeling, variational inference, and inverse reinforcement learning. The generative adversarial networks (GANs) are capable of estimating a generative model by an adversarial training process, where the component model that learns generative data distribution competes against the discriminative component that attempts to classify the generated data. GANs-learned distributions are capable of producing the data samples of high quality, which is essential for producing meaningful cross-modality association decisions. First, we use GANs for modeling the joint distribution over the heterogeneous data of different sources and modalities, learning the common representation and improving the cross-modal correlation estimates. Our approach is based on generative modeling of explicit intermodality correlation and intra-modality reconstruction, while the discriminative component will judge the quality of generated samples both within the same modality and between different modalities. Second, we will use the variational inference to add the classification reasoning to our model, making our solution capable of producing both high-quality cross-modal correlation decisions as well as classification of the fused inputs as specific events or activities. Variational inference provides the approximation necessary for reasoning over large scale data by decomposing the automated maximum likelihood estimation into perception and control, and allowing learning disentangled representations that are essential for generalizing across different domains. Finally, we will use the concepts from inverse reinforcement learning to update the parameters of common joint multi-modal representation. We will conclude this paper by studying the application of proposed model for two use-cases: construction of hypotheses and patterns of life from multi-media data where the data modalities contain text and imagery artifacts, and distributed decision-making in the geospatial environment where the input data can come from different overlapping and potentially conflicting observations from distinct observers, and contain information at different conceptual level, such as entity movement, skills, goals, and reactions.
Georgiy M. Levchuk, "Variational adversarial generative models for information fusion with applications to cross-domain causal reasoning, hypothesis construction and identity learning (Conference Presentation)," Proc. SPIE 11018, Signal Processing, Sensor/Information Fusion, and Target Recognition XXVIII, 110180U (Presented at SPIE Defense + Commercial Sensing: April 17, 2019; Published: 14 May 2019); https://doi.org/10.1117/12.2522075.6036147908001.
Conference Presentations are recordings of oral presentations given at SPIE conferences and published as part of the proceedings. They include the speaker's narration with video of the slides and animations. Most include full-text papers. Interactive, searchable transcripts and closed captioning are now available for most presentations.
Search our growing collection of more than 18,000 conference presentations, including many plenaries and keynotes.