Visual Question Answering using Transformer Architectures: Applying Transformer Models to Improve Performance in VQA Tasks
DOI:
https://doi.org/10.47363/JAICC/2022(1)E228Keywords:
Visual Question Answering, Transformer Models, Vision Transformers, Multimodal AI, Natural Language Processing, Deep Learning, Assistive Technologies, Ethical AI, Human-AI Collaboration, Lightweight TransformersAbstract
Visual Question Answering (VQA) programs are an area of subject and field in computer science that seeks to develop technologies that enable the
client to answer questions based on displayed images. Bert, Vision Transformers (ViTs), and multimodal transformers have aided the VQA systems significantly in understanding the relations between vision and text data. In this paper, these architectures are considered in relation to the scalability, dynamic attention mechanism, and multimodal pre-trained models of the prior CN-RNN hybrid models’ weaknesses. Feedback activities connected with assistive technologies, healthcare, retail, self-driving vehicles, and creative industries indicate how VQA might be easily introduced and provide examples of likely relative positive and negative societal impacts such as bias, privacy, and inclusion. VQA systems are slowly becoming paramount for improving accessibility solutions, for instance, where a visually impaired person talks to the system to explain what a picture is about. However, VQA systems based on current transformers have some issues, notably from the point of view of computational complexity and reasoning capability. This paper covers the current state of research, issues, and the direction of further development, going further and noting that more attention should be paid to lightweight models, datasets from multiple domains, as well as the integration of human-generated data with AI. Accordingly, the identified results show that VQA systems can become one of the elements of context-aware, inquiry-based solutions for advanced applications in various fields.
Downloads
Published
Issue
Section
License
Copyright (c) 2022 Journal of Artificial Intelligence & Cloud Computing

This work is licensed under a Creative Commons Attribution 4.0 International License.