Paper
30 April 2022 IVTEN: integration of visual-textual entities for temporal activity localization via language in video
Author Affiliations +
Proceedings Volume 12177, International Workshop on Advanced Imaging Technology (IWAIT) 2022; 121772S (2022) https://doi.org/10.1117/12.2624682
Event: International Workshop on Advanced Imaging Technology 2022 (IWAIT 2022), 2022, Hong Kong, China
Abstract
This study looks into the stumbling block of temporal activity localization via natural language (TALL) in the untrimmed video. It’s a difficult task since the target temporal activity may be misled by disorder query. Existing approaches used sliding windows, regression, and ranking to handle the query without the usage of grammar-based rules. When a query is out of sequence and cannot be correlated with the relevant activity, these approaches suffer performance deterioration. We introduce the visual, action, object, and connecting words concepts to address the issue of non-sequence queries. Integration of visual-textual entities network (IVTEN) is our proposed architecture, which consists of three submodules: (1) visual graph convolutional network (visual-GCN), (2) textual graph convolutional network (textual-GCN), and (3) compatible method for learning embeddings (CME). Visual nodes detect activity, object, and actor in the same way as textual nodes maintain word sequence using grammar-based rules. (CME) integrates several modalities (activity, query) and trained grammar-based words into the same embedding space. We also include a stochastic latent variable in CME to align and retain the query sequence with the relevant activity. On three typical benchmark datasets, our IVTEN approach outperforms the state-of-the-art Charades-STA, TACoS, and ActivityNet-Captions.
© (2022) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Hafiza Sadia Nawaz, Eric Rigall, Munaza Nawaz, Israel Mugunga, and Junyu Dong "IVTEN: integration of visual-textual entities for temporal activity localization via language in video", Proc. SPIE 12177, International Workshop on Advanced Imaging Technology (IWAIT) 2022, 121772S (30 April 2022); https://doi.org/10.1117/12.2624682
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Visualization

Video

Computer programming

Information visualization

Feature extraction

Transformers

Finite element methods

Back to Top