VisMin: Visual Minimal-Change Understanding

Rabiul Awal^* Saba Ahmadi^* Le Zhang^* Aishwarya Agrawal

Mila - Quebec AI Institute University of Montreal
^*indicates equal contribution

Overview of our VisMIn benchmark. ViMin consists of four types of minimal-changes -- object, attribute, count and spatial relation -- between two image-captions pairs. The evaluation task requires a model to predict the correct image-caption match given: 1) two images and one caption, 2) two captions and one image.

Abstract

Fine-grained understanding of objects, attributes, and relationships between objects is crucial for visual-language models (VLMs). To evaluate VLMs' fine-grained understanding, existing benchmarks primarily focus on evaluating VLMs' capability to distinguish between two very similar captions given an image. In this paper, our focus is on evaluating VLMs' capability to distinguish between two very similar images given a caption. To this end, we introduce a new, challenging benchmark termed Visual Minimal-Change Understanding (VisMin), which requires models to predict the correct image-caption match given two images and two captions. Importantly, the image pair (as well as the caption pair) contains minimal changes, i.e., between the two images (as well as between the two captions), only one aspect changes at a time from among the following possible types of changes: object, attribute, count, and spatial relation. These four types of minimal changes are specifically designed to test the models' understanding of objects, attributes of objects (such as color, material, shape), counts of objects, and spatial relationships between objects. To curate our benchmark, we built an automatic pipeline using large language models and diffusion models, followed by a rigorous 4-step verification process by human annotators. Empirical experiments reveal that current VLMs exhibit notable deficiencies in understanding spatial relationships and counting abilities. Furthermore, leveraging the automated nature of our data creation process, we generate a large-scale training dataset, which we use to finetune CLIP (a foundational VLM) and Idefics2 (a multimodal large language model). Our findings show that both these models benefit significantly from fine-tuning on this data, as evident by marked improvements in fine-grained understanding across a wide range of benchmarks. Additionally, such fine-tuning improves CLIP's general image-text alignment capabilities too. All resources including the benchmark, the training data, and the finetuned model checkpoints will be released.

Minimal-Change Image-Text Dataset Creation

Our dataset creation pipeline includes three stages:

Image Synthesis: We develop methods for synthesizing images involving Objects & Attributes and Counting & Spatial Relations.
Automatic Filtering: An LLM generates questions and answers based on captions, and a VQA model predicts answers from images. Images are excluded if answers don't match.
Human Verification: Synthetic data undergoes a rigorous 4-steps human verification, and only examples passing all stages are included in the benchmark.

Object.

Attribute.

Counting.

Spatial Relationship.

Training and Benchmark sets

This study aims to enhance fine-grained understanding in VLMs by creating training and benchmark sets. The training data is auto-filtered and scalable, while the benchmark data undergoes rigorous human verification for quality. The training data is sourced from VSR and the COCO 2017 training split, while the benchmark data comes from the COCO 2017 validation split. This ensures benchmark images are unseen during training. The training dataset has 64,392 samples, and the VisMin dataset has 2,084 samples. The study aimed for a balanced benchmark across categories, but attribute samples in the benchmark are relatively low due to limitations in suggested edits. The study provides an overview of the types of changes in the benchmark and detailed information on training set subcategories.

Zero-shot Performance on the VisMin Benchmark

Setup. We evaluated a range of state-of-the-art Visual Language Models (VLMs) and Multimodal Large Language Models (MLLMs) including well-known foundational models like CLIP, SigLip, and emerging MLLMs like Idefics2 and GPT-4V. These evaluations covered image-text matching tasks, where models chose the correct image from two captions or the correct caption from two images, and adapted visual question answering formats for MLLMs, assessing alignment with captions across paired images.

Performance of foundational variants and MLLMs across categories on the VisMin Dataset. Columns 'I,' 'T,' and 'G' denote Image, Text, and Group scores from Winoground. 'AVG' denotes average across columns. Closed-source models were assessed using 400 random samples for object, attribute, and count tasks, and the full 294 samples for attributes, due to slow inference and time constraints.
	Object			Attribute			S. Relation			Count			AVG
	T	I	G	T	I	G	T	I	G	T	I	G	AVG
Random Chance	25	25	16.67	25	25	16.67	25	25	16.67	25	25	16.67	22.22
MTurk Human	86.87	95.50	83.07	82.31	91.15	76.87	81.67	92.76	76.20	88.96	96.77	86.41	86.54
Foundational VLMs
CLIP (ViT-B/32)	79.62	77.89	67.53	72.11	65.99	55.1	8.2	4.34	0.48	31.24	20.71	10.53	41.15
CLIP (ViT-B/16)	86.53	79.1	71.68	70.75	65.31	52.38	8.84	3.22	0.8	34.3	22.58	13.58	42.42
SigLip (ViT-B/16)	90.5	88.95	83.25	86.05	79.25	73.13	11.58	6.43	1.77	60.95	47.03	38.37	55.61
SigLip (ViT-L/16)	93.44	88.43	84.46	84.35	78.23	68.37	10.29	4.82	1.29	61.8	57.05	44.14	56.39
BLIP	92.4	92.57	87.05	88.44	86.73	78.57	11.25	4.98	2.09	52.97	46.01	33.28	56.36
Coca	84.97	81.52	73.58	78.57	66.33	57.82	11.25	5.95	1.77	60.1	35.82	28.52	48.85
MLLMs
LlaVa1.6 (7B)	93.0	32.8	32.2	92.2	34.4	33.3	91.8	7.8	7.4	73.6	25.0	20.2	38.28
Idefics2	95.4	69.4	67.6	89.1	71.4	67.0	18.6	18.8	4.8	72.2	50.6	47.0	55.99
InternVL1.5	94.65	40.24	39.72	91.16	42.86	41.16	74.28	14.79	11.74	73.51	31.58	27.5	48.60
Enterprise APIs
Gemini1.0 Pro	94.99	79.97	78.76	91.84	74.83	72.45	52.57	15.43	9.81	67.74	44.14	37.52	49.63
GPT4-o	95.51	96.2	93.44	92.18	90.48	87.07	89.07	50.48	46.78	77.42	78.27	68.42	73.93

Our analysis revealed that text tasks generally outperform image-based tasks, indicating alignment challenges between images and captions. In object and attribute recognition, models demonstrated strong capabilities, with foundational Visual Language Models (VLMs) often outperforming Multimodal Large Language Models (MLLMs) in image-related tasks. However, both types of models faced significant challenges in spatial reasoning, highlighting a crucial area for future advancements. Notably, human participants outperformed models in complex scene comprehension, suggesting areas for model improvement.

Performance of Fine-Tuned CLIP and Idefics2

Setup. We fine-tuned pretrained CLIP on our synthetic data, naming it Visual Minimal Enhanced CLIP (VisMin-CLIP), and compared its performance with NegCLIP, which trains the same CLIP using nearest neighbor images with associated captions as hard negatives and generates additional false captions through random word swaps. Both models use ViT-L/14 as their backbone and the original CLIP loss, initialized from OpenAI checkpoints.

Performance comparison of VisMin variants of base CLIP and Idefics2. The table demonstrates how these models perform in terms of different categories, highlighting the effectiveness of our training set in improving fine-grained understanding in VLMs.
	Object			Attribute			S. Relation			Count			AVG
	T	I	G	T	I	G	T	I	G	T	I	G
CLIP	87.56	83.59	78.07	74.49	69.73	57.82	9.16	4.66	1.45	37.01	30.56	18.17	46.02
VisMin-CLIP	91.54	91.19	86.36	85.03	83.67	75.85	11.9	3.38	1.29	82.34	79.97	72.33	63.74
Idefics2	95.4	69.4	67.6	89.1	71.4	67.0	18.6	18.8	4.8	72.2	50.6	47.0	55.99
Idefics2-VisMin	96.5	95.7	93.3	91.2	91.8	86.7	83.0	76.0	69.3	85.4	87.8	80.5	86.43

Evaluation on Fine-Grained Understanding of **Out-of-Distribution (OOD)** Benchmarks. All models utilize ViT-L-14 as the vision encoder. The acronym 'CB' refers to CountBench, with 'SG' to SugarCrepe, and 'IC' to ImageCode. Our training set shows consistent improvements over the baseline model across OOD benchmarks. Additionally, we demonstrate that models employing minimally-changed images outperform NegCLIP, which utilizes nearest neighbor images.
	Single-Image				Multi-Image											AVG
	VSR	CB	Valse	SG	Whatsup	SPEC		IC	MMVP	Winoground			EQBEN
						I2T	T2I			T	I	G	T	I	G
CLIP (ViT-L/14)	58.33	33.65	69.1	73.0	37.7	32.85	30.86	61.47	19.26	27.5	11.0	8.5	35.71	33.57	21.43	42.01
NegCLIP	56.56	40.0	75.41	85.73	41.2	37.73	35.45	67.33	29.63	25.25	12.0	7.0	42.86	40.0	30.0	47.4
VisMin-CLIP	58.69	49.84	72.24	81.44	43.99	44.28	39.98	66.81	32.59	32.75	14.75	9.75	54.29	40.71	33.57	50.16
Idefics2	77.3	91.11	88.91	90.45	68.04	74.37	60.5	64.4	48.15	47.25	33.75	22.5	62.88	33.33	25.76	67.13
Idefics2-Vismin	80.32	93.97	86.08	91.14	74.42	76.2	76.48	70.7	48.89	47.0	35.75	22.5	64.39	54.55	49.24	71.77

Observation #1: Enhanced Model Performance through Fine-Tuning. The fine-tuning of the CLIP model using the Visual Minimal Enhanced CLIP (VisMin-CLIP) approach, which involves training on synthetic data with minimal changes, significantly enhances its performance in the Object, Attribute, and Count categories. This improvement demonstrates the effectiveness of using tailored, fine-grained synthetic data to boost the model's understanding in specific visual recognition tasks.

Observation #2: Challenges in Spatial Relation Understanding. Despite improvements in several categories, VisMin-CLIP exhibits a slight degradation in the Spatial Relations category. This indicates inherent difficulties that the CLIP model faces in mastering spatial relationships, suggesting a potential area for further research to enhance spatial reasoning in visual language models.

Observation #3: Comparative Benefits of Data Augmentation. The comparison between VisMin-CLIP and NegCLIP highlights the advantages of specific data augmentation strategies. While NegCLIP, which uses hard negatives and linguistic variations, shows benefits in single-image text retrieval tasks, VisMin-CLIP's focused synthetic data approach yields more substantial improvements in tasks requiring detailed understanding of counting and spatial relationships, underlining the importance of tailored augmentation methods for complex cognitive tasks.

Additional Findings

Further experiments reveal as shown in the figures above. Scalability: Larger CLIP models like B/32 and L/16 perform better on our synthetic data, especially on complex tasks like understanding minimal changes. For example, improvements increased from 2.37 to 6.88 in single image benchmarks when model capacity was expanded from ViT-B/32 to ViT-L/14. Enhanced Original Capabilities: Training on our data also improves performance in standard retrieval tasks, suggesting better alignment from training on minimal change tasks. This indicates our data's applicability across various cross-modal tasks.

Conclusions and Limitations

Introduced Challenging Benchmark: We introduced VisMin, a benchmark for evaluating fine-grained visual understanding in VLMs such as CLIP, SigLIP, LLaVA, and Idefics2. While these models excel at object recognition, they struggle with counting and spatial relationships.

Addressing Challenges via Fine-tuning: To address these challenges, we finetuned CLIP and Idefics2 on our minimal-change dataset, which led to significant improvements in object recognition, attributes, and counting. However, spatial relations remain a challenge, with models like GPT4V performing barely above random chance.

Training Data for Building VLMs: Despite some noise in the training data due to limitations of current diffusion models, our dataset shows potential as a robust training resource for VLMs.

Limitations: Our experiments used uniform, simple prompts for consistent evaluation, which may have variably influenced model performance.

Our Relevant Work at CVPR2024

ArXiv GitHub

Contrasting Intra-modal and Ranking Cross-modal Hard Negatives to Enhance Visio-linguistic Compositional Understanding

Le Zhang, Rabiul Awal, Aishwarya Agrawal

Our current findings build upon our previous work, focusing on addressing the gaps in compositional reasoning by enhancing the alignment between images and captions. This research was pivotal in shaping the methodologies employed in our current project.

Citation

If you found this work useful in your own research, please consider citing the following:


        @article{vismin2024,
          title={VisMin: Visual Minimal-Change Understanding},
          author={Awal, Rabiul and Ahmadi, Saba and Zhang, Le and Agrawal, Aishwarya},
          year={2024}
          }

        @article{zhang2023contrasting,
          title={Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Fine-grained Understanding},
          author={Zhang, Le and Awal, Rabiul and Agrawal, Aishwarya},
          journal={arXiv preprint arXiv:2306.08832},
          year={2023}
        }