Multimodal Attention in Recurrent Neural Networks for Visual Question Answering
Multimodal Attention in Recurrent Neural Networks for Visual Question Answering
Article PDF

Keywords

visual question answering (VQA)
multimodal attention mechanism
convolutional neural networks (CNN)
recurrent neural networks (RNN)
long short-term

How to Cite

Lorena Kodra, & Elinda Kajo Mece. (2018). Multimodal Attention in Recurrent Neural Networks for Visual Question Answering. Global Journal of Computer Science and Technology, 17(D1), 1–8. Retrieved from https://gjcst.com/index.php/gjcst/article/view/639

Abstract

Visual Question Answering VQA is a task for evaluating image scene understanding abilities and shortcomings and also measuring machine intelligence in the visual domain Given an image and a natural question about the image the system must ground the question into the image and return an accurate answer in a natural language A lot of progress has been done to address the challenges of this task by combining latest advances in image representation and natural language processing Several recently proposed solutions include attention mechanisms designed to support reasoning These mechanisms allow models to focus on specific part of the input in order to generate the answer and improve its accuracy In this paper we present a novel LSTM architecture for VQA that uses multimodal attention to focus over specific parts of the image and also on specific question words to generate the answer We evaluate our model on the VQA dataset and demonstrate that it performs better than state of the art We also make a qualitative analysis of the results and show the abilities and shortcomings of our model
Article PDF
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Copyright (c) 2017 Authors and Global Journals Private Limited