vqa-zero

✨ Highlights ✨

🔍 Prompting Techniques: Explores zero- and few-shot VQA with models like BLIP2 and Kosmos2.

🔄 VQA Formats: Includes Standard, Caption, and Chain-of-Thought to refine model responses.

📊 Performance Boost: Demonstrates substantial improvements in VQA accuracy across benchmarks.

🔧 Key Models: BLIP2 Flan-T5 excels, with Kosmos2 and LLaVa showing variable gains.

📈 Results Summary:

Marked accuracy enhancements using advanced prompting strategies.
Highlights strengths and limitations of different VQA models.

📚 Built On: Leverages transformers and LLaVa, enriching the toolkit for advanced VQA research.

Abstract

In this paper, we explore effective prompting techniques to enhance zero- and few-shot Visual Question Answering (VQA) performance in contemporary Vision-Language Models (VLMs). Central to our investigation is the role of question templates in guiding VLMs to generate accurate answers. We identify that specific templates significantly influence VQA outcomes, underscoring the need for strategic template selection. Another pivotal aspect of our study is augmenting VLMs with image captions, providing them with additional visual cues alongside direct image features in VQA tasks. Surprisingly, this augmentation significantly improves the VLMs' performance in many cases, even though VLMs ``see'' the image directly! We explore chain-of-thought (CoT) reasoning and find that while standard CoT reasoning causes drops in performance, advanced methods like self-consistency can help recover it. Furthermore, we find that text-only few-shot examples enhance VLMs' alignment with the task format, particularly benefiting models prone to verbose zero-shot answers. Lastly, to mitigate the challenges associated with evaluating free-form open-ended VQA responses using string-matching based VQA metrics, we introduce a straightforward LLM-guided pre-processing technique to adapt the model responses to the expected ground-truth answer distribution. In summary, our research sheds light on the intricacies of prompting strategies in VLMs for VQA, emphasizing the synergistic use of captions, templates, and pre-processing to enhance model efficacy.

Approach

We explore fine-tuning-free prompting techniques applied to vision-language models. Our investigation includes:

Choice of the Question Template: We investigate different question templates to guide effective answer generation, by varying the structure and phrasing of questions. Further details are in §3.1.
Leveraging Captions for Enhanced Context: We investigate whether zero-shot VQA performance of VLMs can be improved by incorporating image captions as additional visual cues. (§3.2). We generate image captions with varying levels of detail and relevance to the question.
Incorporating Chain-of-thought Reasoning: Inspired by the success of chain-of-thought (CoT) reasoning in language models, we investigate its application in VQA. This approach prompts the model to provide step-by-step rationale alongside answers (§3.3).
Incorporating Text-only Few-shot Examples: We incorporate text-only few-shot examples to enhance model performance, particularly in knowledge-based tasks. (§3.4).
Elevating VQA Metrics for Generative VLMs: We improve the traditional string-matching VQA metric with minimal modifications by introducing LLM-based pre-processing to refine verbose model outputs, aligning them with the style of reference answers.

Results

We investigate how different qualities and types of image captions, presented as text-based visual cues, impact zero-shot VQA performance in VLMs. We report the baseline and best setting results. Please check the paper for more results.

Additional Findings

Therefore, our findings suggest that while each model has its unique preferences, employing well-optimized templates is crucial for best performance in zero-shot VQA tasks.

📊 Comparison of zero-shot VQA performance across datasets using different templates in the standard setting. All tested models exhibit sensitivity to template variations, as demonstrated by the varying performance improvements over the baseline ‘Null’ template. 📈

Our Relevant Work at CVPR2024

ArXiv GitHub

Contrasting Intra-modal and Ranking Cross-modal Hard Negatives to Enhance Visio-linguistic Compositional Understanding

Our current findings build upon our previous work, focusing on addressing the gaps in compositional reasoning by enhancing the alignment between images and captions. This research was pivotal in shaping the methodologies employed in our current project.

Citation

If you found this work useful in your own research, please consider citing the following:


        @article{awal2023investigating,
            title={Investigating Prompting Techniques for Zero-and Few-Shot Visual Question Answering},
            author={Awal, Rabiul and Zhang, Le and Agrawal, Aishwarya},
            journal={arXiv preprint arXiv:2306.09996},
            year={2023}
          }

Investigating Prompting Techniques for Zero- and Few-Shot Visual Question Answering

Overview of prompting techniques explored with various VLMs, encompassing Standard, Caption, Chain-of-thought VQA and Text-only Few-shot, and the use of LLM-guided pre-processing.