OLViT: Multi-Modal State Tracking via Attention-Based Embeddings for Video-Grounded Dialog

Published in International Conference on Computational Linguistics (COLING), 2024