Visual Question Answering: An Analysis of Various AI Models and Datasets
Visual Question Answering is considered to be one of the latest advances in the field of Artificial Intelligence (AI). This is a unique task, which combines the three most important realms of AI, namely-Computer Vision (CV), Natural Language Processing (NLP) and Knowledge representation and reasoning (KR), each of which is being researched extensively. Given an image and an open-ended natural language question about the image, the VQA model needs to provide an open-ended natural language answer. To achieve this, the model would need to develop an understanding of the different entities of an image and language, and their dependencies. This is regarded as a true AI task. In this review we detail out the various algorithms proposed to build a VQA model, by classifying them based on the mechanisms used to extract and map the input visual and natural language features to a common feature vector space. Finally, we analyze the correctness of these models and propose some alternatives using Capsule Networks (CapsNet) for future directions.