VisMin: Visual Minimal-Change Understanding

Mila - Quebec AI Institute   University of Montreal
*indicates equal contribution
Accepted to NeurIPS 2024 🎉
Your image description

Overview of our VisMIn benchmark. ViMin consists of four types of minimal-changes -- object, attribute, count and spatial relation -- between two image-captions pairs. The evaluation task requires a model to predict the correct image-caption match given: 1) two images and one caption, 2) two captions and one image.

✨ Highlights ✨

🔍 VisMin: New benchmark for fine-grained visual understanding.

🔄 Pipeline: Automatic generation of visual minimal-change pairs.

📊 Findings: VLMs are good at object/attribute understanding, but struggle with counting/spatial relations.

🔧 Fine-tuned CLIP and Idefics2 on our data:

📚 Our dataset is a robust training resource for VLMs.

Abstract

Fine-grained understanding of objects, attributes, and relationships between objects is crucial for visual-language models (VLMs). To evaluate VLMs' fine-grained understanding, existing benchmarks primarily focus on evaluating VLMs' capability to distinguish between two very similar captions given an image. In this paper, our focus is on evaluating VLMs' capability to distinguish between two very similar images given a caption. To this end, we introduce a new, challenging benchmark termed Visual Minimal-Change Understanding (VisMin), which requires models to predict the correct image-caption match given two images and two captions. Importantly, the image pair (as well as the caption pair) contains minimal changes, i.e., between the two images (as well as between the two captions), only one aspect changes at a time from among the following possible types of changes: object, attribute, count, and spatial relation. These four types of minimal changes are specifically designed to test the models' understanding of objects, attributes of objects (such as color, material, shape), counts of objects, and spatial relationships between objects. To curate our benchmark, we built an automatic pipeline using large language models and diffusion models, followed by a rigorous 4-step verification process by human annotators. Empirical experiments reveal that current VLMs exhibit notable deficiencies in understanding spatial relationships and counting abilities. Furthermore, leveraging the automated nature of our data creation process, we generate a large-scale training dataset, which we use to finetune CLIP (a foundational VLM) and Idefics2 (a multimodal large language model). Our findings show that both these models benefit significantly from fine-tuning on this data, as evident by marked improvements in fine-grained understanding across a wide range of benchmarks. Additionally, such fine-tuning improves CLIP's general image-text alignment capabilities too. All resources including the benchmark, the training data, and the finetuned model checkpoints will be released.

Minimal-Change Image-Text Dataset Creation

Generating captions automatically is feasible, but creating hard-negative examples for complex COCO-like images is challenging! Here's our minimal-change pairs synthesis framework: fully automated - using LLM and Diffusion and designed to be controllable and scalable!

Your image description

The pipeline includes three stages: Minimal-Change Pairs Synthesis, where we minimally edit image and text pairs; Automatic Filtering, which verifies the faithfulness of the texts and synthesized images; and Human Verification, a four-step process to ensure that only data meeting all quality criteria is included. We will discuss each stage in detail.

Training and Benchmark sets

This study aims to enhance fine-grained understanding in VLMs by creating training and benchmark sets. The training data is auto-filtered and scalable, while the benchmark data undergoes rigorous human verification for quality. The training data is sourced from VSR and the COCO 2017 training split, while the benchmark data comes from the COCO 2017 validation split. This ensures benchmark images are unseen during training. The training dataset has 64,392 samples, and the VisMin dataset has 2,084 samples. The study aimed for a balanced benchmark across categories, but attribute samples in the benchmark are relatively low due to limitations in suggested edits. The study provides an overview of the types of changes in the benchmark and detailed information on training set subcategories.

VisMin Benchmark VisMin Benchmark

🚀✨ Supercharge your Multimodal LLM training with our 65K instruction pairs. The VisMin Instruct-IT compares two images or two texts, with one being the correct match for the given instruction. 🖼️📝


      image_task_query_templates = [
        f'You are given two images. Which one better aligns with the description: "{caption}"? The first or the second image?',
        f'Question: You are given two images. Which one better aligns with the description? "{caption}" Choices: First, Second.',
        f'Based on the description: "{caption}", please select one of the given options that best matches the image. Choices: First, Second.',
      ]
      
      text_task_query_templates = [
        f"Question: Does this image depict: (A) {candidate_texts[0]}, or (B) {candidate_texts[1]}? Choices: A, B.",
        f"Does this image best represent: (A) {candidate_texts[0]}, or (B) {candidate_texts[1]}?",
        f"Which of the following best matches the image: (A) {candidate_texts[0]}, or (B) {candidate_texts[1]}?",
      ]
        

Zero-shot Eval on VisMin Benchmark

Setup. We evaluated a range of state-of-the-art Visual Language Models (VLMs) and Multimodal Large Language Models (MLLMs) including well-known foundational models like CLIP, SigLip, and emerging MLLMs like Idefics2 and GPT-4V. These evaluations covered image-text matching tasks, where models chose the correct image from two captions or the correct caption from two images, and adapted visual question answering formats for MLLMs, assessing alignment with captions across paired images.
Performance of foundational variants and MLLMs across categories on the VisMin Dataset. Columns 'I,' 'T,' and 'G' denote Image, Text, and Group scores from Winoground. 'AVG' denotes average across columns.
Object Attribute S. Relation Count AVG
T I G T I G T I G T I G
Random Chance 25 25 16.67 25 25 16.67 25 25 16.67 25 25 16.67 22.22
MTurk Human 86.87 95.50 83.07 82.31 91.15 76.87 81.67 92.76 76.20 88.96 96.77 86.41 86.54
Foundational VLMs
CLIP (ViT-B/32) 79.62 77.89 67.53 72.11 65.99 55.1 8.2 4.34 0.48 31.24 20.71 10.53 41.15
CLIP (ViT-B/16) 86.53 79.1 71.68 70.75 65.31 52.38 8.84 3.22 0.8 34.3 22.58 13.58 42.42
SigLip (ViT-B/16) 90.5 88.95 83.25 86.05 79.25 73.13 11.58 6.43 1.77 60.95 47.03 38.37 55.61
SigLip (ViT-L/16) 93.44 88.43 84.46 84.35 78.23 68.37 10.29 4.82 1.29 61.8 57.05 44.14 56.39
BLIP 92.4 92.57 87.05 88.44 86.73 78.57 11.25 4.98 2.09 52.97 46.01 33.28 56.36
Coca 84.97 81.52 73.58 78.57 66.33 57.82 11.25 5.95 1.77 60.1 35.82 28.52 48.85
MLLMs
LlaVa1.6 (7B) 93.0 32.8 32.2 92.2 34.4 33.3 91.8 7.8 7.4 73.6 25.0 20.2 38.28
Idefics2 95.4 69.4 67.6 89.1 71.4 67.0 18.6 18.8 4.8 72.2 50.6 47.0 55.99
InternVL1.5 94.65 40.24 39.72 91.16 42.86 41.16 74.28 14.79 11.74 73.51 31.58 27.5 48.60
Enterprise APIs
Gemini1.0 Pro 94.99 79.97 78.76 91.84 74.83 72.45 52.57 15.43 9.81 67.74 44.14 37.52 49.63
GPT4-o 95.51 96.2 93.44 92.18 90.48 87.07 89.07 50.48 46.78 77.42 78.27 68.42 73.93

How do state-of-the-art VLMs perform on VisMin? Both CLIP-family and Multimodal LMs (MLLMs) excel in object and attribute understanding but struggle with spatial relationships and counting. Larger CLIP models don't improve spatial understanding. Open-source MLLMs struggle more with image-score (distinguishing similar images given a caption) than text-score (distinguishing similar captions given an image). GPT4-o is slightly better in spatial tasks than others, including Gemini 1.0 Pro, but still limited.

Fine-Tuned CLIP and Idefics2

Can we enhance the visual fine-grained understanding by fine-tuning these models on our minimal change data? Yes! We fine-tune CLIP-family models and Idefics 2 (an MLLM) on our minimal change data and conduct comprehensive evaluation across both IID and OOD benchmarks.

Setup. We compare finetuning CLIP on our visual minimal change data with recent approaches that also finetune CLIP on some kind of visual hard negatives -- NegCLIP which uses nearest neighbor images, CounterCurate which uses limited minimal change data, and SPEC which uses simplistic minimal change data.
Performance comparison of VisMin variants of base CLIP and Idefics2.
Object Attribute S. Relation Count AVG
T I G T I G T I G T I G
CLIP 87.56 83.59 78.07 74.49 69.73 57.82 9.16 4.66 1.45 37.01 30.56 18.17 46.02
VisMin-CLIP 91.54 91.19 86.36 85.03 83.67 75.85 11.9 3.38 1.29 82.34 79.97 72.33 63.74
Idefics2 95.4 69.4 67.6 89.1 71.4 67.0 18.6 18.8 4.8 72.2 50.6 47.0 55.99
Idefics2-VisMin 96.5 95.7 93.3 91.2 91.8 86.7 83.0 76.0 69.3 85.4 87.8 80.5 86.43
Evaluation on Fine-Grained Understanding of Out-of-Distribution (OOD) Benchmarks. All models utilize ViT-L-14 as the vision encoder. The acronym 'CB' refers to CountBench, with 'SG' to SugarCrepe, and 'IC' to ImageCode. Our training set shows consistent improvements over the baseline model across OOD benchmarks.
Single-Image Multi-Image
VSR CB Valse SG Whatsup SPEC IC MMVP Winoground EQBEN
I2T T2I T I G T I G
CLIP (ViT-L/14) 58.33 33.65 69.1 73.0 37.7 32.85 30.86 61.47 19.26 27.5 11.0 8.5 35.71 33.57 21.43
NegCLIP 56.56 40.0 75.41 85.73 41.2 37.73 35.45 67.33 29.63 25.25 12.0 7.0 42.86 40.0 30.0
CounterCurate-CLIP 56.74 30.79 68.47 83.66 44.29 37.99 35.24 65.81 25.19 28.0 13.25 9.0 45.0 33.57 28.57
SPEC-CLIP 64.54 32.06 68.75 79.34 43.35 87.04 88.08 66.25 30.37 22.5 7.75 4.75 41.43 40.0 30.71
VisMin-CLIP 58.69 49.84 72.24 81.44 43.99 44.28 39.98 66.81 32.59 32.75 14.75 9.75 54.29 40.71 33.57
Idefics2 77.3 91.11 88.91 90.45 68.04 74.37 60.5 64.4 48.15 47.25 33.75 22.5 62.88 33.33 25.76
Idefics2-Vismin 80.32 93.97 86.08 91.14 74.42 76.2 76.48 70.7 48.89 47.0 35.75 22.5 64.39 54.55 49.24

Observation #1: Enhanced Model Performance through Fine-Tuning. Fine-tuning significantly improved both CLIP-family models and MLLMs on objects, attributes, and counting. Our VisMin-CLIP outperformed in 11 of 18 tasks, while the others excelled in 3, 1, and 3 tasks respectively.

Observation #2: Challenges in Spatial Relation Understanding. Despite improvements in several categories, in spatial relations, CLIP showed limited improvement, whereas Idefics2 (an MLLM) displayed significant gains.

Observation #3: Comparative Benefits. VisMin-CLIP achieves the best performance with the least number of samples (65K) during fine-tuning, compared to SPEC-CLIP's 637K.

Additional Findings

Scaling with VisMin data
Scaling with VisMin data

⬆️Scalability: Larger models gain more from our training set. ⬆️ How does fine-tuning impact the general image-text alignment of CLIP? Improves CLIP's image-text retrieval. ⬇️ Slight drop for Idefics2 in standard VL tasks due to overfitting, fixable with dataset mixing.📈

Citation

If you found this work useful in your own research, please consider citing the following:


        @article{vismin2024,
          title={VisMin: Visual Minimal-Change Understanding},
          author={Awal, Rabiul and Ahmadi, Saba and Zhang, Le and Agrawal, Aishwarya},
          year={2024}
          }

        @article{zhang2023contrasting,
          title={Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Fine-grained Understanding},
          author={Zhang, Le and Awal, Rabiul and Agrawal, Aishwarya},
          journal={arXiv preprint arXiv:2306.08832},
          year={2023}
        }