Can Finetuned Vision Language Models Generalize to Different Tasks on the Same Image?
Date:
If we have a very detailed text caption for an image, we can use the caption to generate many, many questions about the image. Should we generate and train on as many such synthetic questions as possible? Or is training on a couple questions good enough?
The vision encoder in Vision Language Models (VLMs) is often pre-aligned with text via the CLIP objective. The vision encoder’s final patchwise embeddings are pooled and contrasted against text embeddings from a BERT model. In my opinion, there are two key shortcomings:
- Pooling loses fine-grained information
- The text caption of an image rarely encode all the visual details
To me, it’s a small miracle that VLMs are very capable of OCR despite these shortcomings in the vision encoder. But on other tasks that require fine-grained visual understanding such as chart understanding and spatial reasoning, VLMs show very poor out-of-the-box performance.
In this blog post, I wanted to explore whether VLMs effectively generalize to new tasks on the same set of images. If we finetune a VLM to count objects on a scene, does it learn to locate objects? If we finetune it to count objects of a certain color, does it learn to count objects of a certain shape?
Implications:
Let’s introduce a sense of “maximally utilizing an image when training”. For example, if we have a very detailed text caption for an image:
Should we generate and train on as many questions from it as possible? Or is training on one or two questions good enough?
Synthetic Task Design
I generated 1000 images using Blender (script). Each image contains 3 to 8 objects with the following randomized parameters:
{
"shapes": {
"cube": "SmoothCube\_v2",
"sphere": "Sphere",
"cylinder": "SmoothCylinder"
},
"colors": {
"gray": \[87, 87, 87\],
"red": \[173, 35, 35\],
"blue": \[42, 75, 215\],
"green": \[29, 105, 20\],
"brown": \[129, 74, 25\],
"purple": \[129, 38, 192\],
"cyan": \[41, 208, 208\],
"yellow": \[255, 238, 51\]
},
"materials": {
"rubber": "Rubber",
"metal": "MyMetal"
},
"sizes": {
"large": 0.7,
"small": 0.35
}
}
I designed 8 tasks:
- Extreme Material: What is the {left-most/right-most/farthest back/closest} item made out of?
- Extreme Color: …
- Extreme Shape: …
- Extreme All: Identify objects the leftmost/rightmost/farthest back/closest to the camera. Respond in the format: [size] [color] [material] [shape].
- Count Material: Count the total number of rubber/metal objects.
- Count Color: …
- Count Shape: …
- Count Relative to: Count how many objects are - to the {left of/right of/further back than/closer to the camera} than {given_object}. (“How many objects are to the right of the small green metal sphere?”)
I think the relative order of difficulty for humans would be:
Count Relative to > Count Material > Count Color = Count Shape > Extreme All > Extreme Color > Extreme Shape = Extreme Material
Finding objects at spatial extremes is super easy for humans. We can probably find it at a glance. Counting is a bit harder. “Count Relative to” first required us to find the given object before we begin counting.
For a VLM, the Extremum task is likely harder. It must identify all 4 properties and format it correctly while the other tasks only require it to generate a single numeric token. Perhaps annotating the images with unique IDs overlaid on each object might make it easier for VLMs. I’ll leave that for future work.
Data and Model
I generated 1000 images, 800 of which I use for training and 200 for testing. The number of QA generated per image depends on the number of objects. Out of ~13K generated VQA from the test images, I sample 1500 VQA for each tasks (totaling 8*1500=12000).
All task training set 1: 8*1500 = 12000 images
All task training set 2: 8*188 = 1504 images
Task {i} training set = 1500 images of task {i} (8 such sets)
Test Set 1: ~2800 questions from 800 images in the train set (in-domain test set)
Test Set 2: ~2800 questions from 200 images in the test set (out-domain test set)
I tested:
- Qwen2-VL-2B-Instruct-bnb-4bit
- Llama-3.2-11B-Vision-Instruct-bnb-4bit
I used the unsloth library and fine-tuned VLMs for 2 epochs using QLoRA for simplicity. I kept the vision encoder frozen during training.
Results Summary
Main Result: VLMs do generalize to unseen tasks on same/similar images.
For example, fine-tuning on the “Extreme-Color” task boosts performance on all other tasks. It is 58% as good as fine-tuning on all 8 tasks.
Other Results:
- Fine-tuning on all tasks together has the best performance.
- For LLaMA-3-11B, no noticeable difference on in-domain and out-domain test sets. The VLM hasn’t memorized image data?
- Suppose we have the budget to finetune on ~1500 examples. Do we sample 188 examples from each of the 8 tasks or sample ~1500 examples from one task? Results show that fine-tuning on 1500 Extreme Color examples is as good as fine-tuning on ~1504 balanced examples.
- Qwen-VL-2B fails completely. Maybe it doesn’t have the base capabilities for spatial reasoning, material perception and counting.
Detailed Results
In-domain Test Set
Baseline performance and finetuning on all tasks
Each cell shows accuracy
Finetuning on one specific task and testing on others
Each cell shows the what fraction of full-finetuning is achieved = oneTaskFT - noFT allTaskFT - noFT*100
Out-domain Test Set
Baseline performance and finetuning on all tasks
Each cell shows the accuracy
Finetuning on one specific task and testing on others
Each cell shows the what fraction of full-finetuning is achieved = oneTaskFT - noFT allTaskFT - noFT*100
Highest Cross Task Transfer
Out Domain Test Set
Yellow highlights the test task that benefitted the most from the train task (other than itself)
Full Results
Qwen2-VL-2B-Instruct-bnb-4bit
In-Domain Test Set (Same Images)
Task | # | No Finetuning | Finetuned (All tasks 1500*8) | Finetuned (All tasks 188*8) | Finetuned (Extreme Material) | Finetuned (Extreme Color) | Finetuned (Extreme Shape) | Finetuned (Extreme All) | Finetuned (Count Material) | Finetuned (Count Color) | Finetuned (Count shape) | Finetuned (Count Relative To) |
Extreme Material | 364 | 0.5137362637362637 | 0.47527472527472525 | 0.5357142857142857 | 0.4807692307692308 | 0.5576923076923077 | 0.5274725274725275 | 0.41208791208791207 | 0.554945054945055 | 0.510989010989011 | 0.5054945054945055 | 0.554945054945055 |
Extreme Color | 420 | 0.11190476190476191 | 0.1523809523809524 | 0.15 | 0.1380952380952381 | 0.15476190476190477 | 0.14047619047619048 | 0.11666666666666667 | 0.1738095238095238 | 0.14523809523809525 | 0.1619047619047619 | 0.14285714285714285 |
Extreme Shape | 378 | 0.31746031746031744 | 0.3412698412698413 | 0.30423280423280424 | 0.328042328042328 | 0.30952380952380953 | 0.31216931216931215 | 0.29894179894179895 | 0.3439153439153439 | 0.36243386243386244 | 0.34656084656084657 | 0.3862433862433862 |
Extreme All | 412 | 0.014563106796116505 | 0.009708737864077669 | 0.01699029126213592 | 0.11407766990291263 | 0.009708737864077669 | 0.0048543689320388345 | 0.0048543689320388345 | 0.007281553398058253 | 0.012135922330097087 | 0.009708737864077669 | 0.012135922330097087 |
Count Material | 184 | 0.2391304347826087 | 0.25 | 0.24456521739130435 | 0.010869565217391304 | 0.24456521739130435 | 0.25 | 0.25 | 0.2554347826086957 | 0.21739130434782608 | 0.24456521739130435 | 0.1956521739130435 |
Count Color | 386 | 0.27202072538860106 | 0.07512953367875648 | 0.07512953367875648 | 0.12435233160621761 | 0.04404145077720207 | 0.11658031088082901 | 0.38341968911917096 | 0.054404145077720206 | 0.5103626943005182 | 0.33678756476683935 | 0.0 |
Count Shape | 261 | 0.2835249042145594 | 0.22988505747126436 | 0.22988505747126436 | 0.23371647509578544 | 0.22988505747126436 | 0.26053639846743293 | 0.2950191570881226 | 0.22988505747126436 | 0.3333333333333333 | 0.3103448275862069 | 0.0038314176245210726 |
Count Relative To | 395 | 0.09367088607594937 | 0.16455696202531644 | 0.16455696202531644 | 0.015189873417721518 | 0.1670886075949367 | 0.1949367088607595 | 0.21012658227848102 | 0.17215189873417722 | 0.16455696202531644 | 0.15443037974683543 | 0.22025316455696203 |
Out-of-Domain Test Set (Different Images) | Task | # | No Finetuning | Finetuned (All tasks 1500*8) | Finetuned (All tasks 188*8) | Finetuned (Extreme Material) | Finetuned (Extreme Color) | Finetuned (Extreme Shape) | Finetuned (Extreme All) | Finetuned (Count Material) | Finetuned (Count Color) | Finetuned (Count Shape) | Finetuned (Count Relative To) | | Extreme Material | 400 | 0.53 | 0.505 | 0.5225 | 0.5025 | 0.4825 | 0.4975 | 0.4475 | 0.515 | 0.5 | 0.4975 | 0.495 | | Extreme Color | 400 | 0.0825 | 0.1525 | 0.1275 | 0.1425 | 0.165 | 0.1525 | 0.065 | 0.1725 | 0.15 | 0.1575 | 0.15 | | Extreme Shape | 400 | 0.335 | 0.35 | 0.34 | 0.3525 | 0.3575 | 0.3375 | 0.31 | 0.3275 | 0.3225 | 0.345 | 0.3075 | | Extreme All | 400 | 0.015 | 0.0175 | 0.0075 | 0.11 | 0.0125 | 0.005 | 0.0075 | 0.02 | 0.025 | 0.0.01 | 0.015 | | Count Material | 190 | 0.24210526315789474 | 0.22631578947368422 | 0.22105263157894736 | 0.042105263157894736 | 0.22631578947368422 | 0.22631578947368422 | 0.24210526315789474 | 0.21052631578947367 | 0.23684210526315788 | 0.2578947368421053 | 0.1736842105263158 | | Count Color | 423 | 0.13947990543735225 | 0.2458628841607565 | 0.2458628841607565 | 0.11583924349881797 | 0.18439716312056736 | 0.13711583924349882 | 0.20803782505910165 | 0.09929078014184398 | 0.5791962174940898 | 0.34278959810874704 | 0.06619385342789598 | | Count Shape | 254 | 0.2440944881889764 | 0.2283464566929134 | 0.2283464566929134 | 0.2283464566929134 | 0.1968503937007874 | 0.21653543307086615 | 0.18503937007874016 | 0.22440944881889763 | 0.28346456692913385 | 0.33070866141732286 | 0.051181102362204724 | | Count Relative To | 400 | 0.035 | 0.1625 | 0.1725 | 0.0925 | 0.1375 | 0.175 | 0.0625 | 0.1475 | 0.2175 | 0.2075 | 0.21 |
Llama-3.2-11B-Vision-Instruct-bnb-4bit
In-Domain Test Set (Same Images) | Task | # | No Finetuning | Finetuned (All tasks 1500*8) | Finetuned (All tasks 188*8) | Finetuned (Extreme Material) | Finetuned (Extreme Color) | Finetuned (Extreme Shape) | Finetuned (Extreme All) | Finetuned (Count Material) | Finetuned (Count Color) | Finetuned (Count shape) | Finetuned (Count Relative To) | | Extreme Material | 364 | 0.06868131868131869 | 0.9010989010989011 | 0.6126373626373627 | 0.760989010989011 | 0.5934065934065934 | 0.5714285714285714 | 0.046703296703296704 | 0.510989010989011 | 0.4423076923076923 | 0.2774725274725275 | 0.5274725274725275 | | Extreme Color | 420 | 0.04285714285714286 | 0.9404761904761905 | 0.8071428571428572 | 0.10476190476190476 | 0.8357142857142857 | 0.5880952380952381 | 0.1 | 0.37142857142857144 | 0.4880952380952381 | 0.41904761904761906 | 0.45714285714285713 | | Extreme Shape | 378 | 0.2037037037037037 | 0.91005291005291 | 0.8174603174603174 | 0.6746031746031746 | 0.8042328042328042 | 0.8835978835978836 | 0.013227513227513227 | 0.5873015873015873 | 0.6798941798941799 | 0.6322751322751323 | 0.42328042328042326 | | Extreme All | 412 | 0.02912621359223301 | 0.8470873786407767 | 0.4684466019417476 | 0.0048543689320388345 | 0.09223300970873786 | 0.019417475728155338 | 0.5922330097087378 | 0.03398058252427184 | 0.07766990291262135 | 0.0412621359223301 | 0.18446601941747573 | | Count Material | 184 | 0.10869565217391304 | 0.532608695652174 | 0.3804347826086957 | 0.358695652173913 | 0.28804347826086957 | 0.07608695652173914 | 0.0 | 0.45108695652173914 | 0.266304347826087 | 0.266304347826087 | 0.2608695652173913 | | Count Color | 386 | 0.266839378238342 | 0.9145077720207254 | 0.7616580310880829 | 0.6295336787564767 | 0.7590673575129534 | 0.6528497409326425 | 0.0025906735751295338 | 0.5155440414507773 | 0.8523316062176166 | 0.7694300518134715 | 0.2772020725388601 | | Count Shape | 261 | 0.26053639846743293 | 0.7279693486590039 | 0.5172413793103449 | 0.4099616858237548 | 0.46360153256704983 | 0.25287356321839083 | 0.007662835249042145 | 0.3448275862068966 | 0.5517241379310345 | 0.6398467432950191 | 0.26053639846743293 | | Count Relative To | 395 | 0.12658227848101267 | 0.46329113924050636 | 0.21518987341772153 | 0.22784810126582278 | 0.22278481012658227 | 0.12658227848101267 | 0.0 | 0.22784810126582278 | 0.20506329113924052 | 0.18227848101265823 | 0.18734177215189873 |
Out-Domain Test Set (Different Images) | Task | # | No Finetuning | Finetuned (All tasks 1500*8) | Finetuned (All tasks 188*8) | Finetuned (Extreme Material) | Finetuned (Extreme Color) | Finetuned (Extreme Shape) | Finetuned (Extreme All) | Finetuned (Count Material) | Finetuned (Count Color) | Finetuned (Count shape) | Finetuned (Count Relative To) | | Extreme Material | 364 | 0.06 | 0.915 | 0.6225 | 0.8275 | 0.6075 | 0.6525 | 0.07 | 0.5425 | 0.5075 | 0.3675 | 0.5175 | | Extreme Color | 420 | 0.04 | 0.9075 | 0.8025 | 0.0975 | 0.8725 | 0.59 | 0.06 | 0.385 | 0.535 | 0.4675 | 0.4825 | | Extreme Shape | 378 | 0.195 | 0.9025 | 0.835 | 0.6875 | 0.82 | 0.91 | 0.015 | 0.665 | 0.68 | 0.6625 | 0.415 | | Extreme All | 412 | 0.035 | 0.8125 | 0.4375 | 0.005 | 0.0925 | 0.0175 | 0.56 | 0.0475 | 0.075 | 0.025 | 0.2 | | Count Material | 184 | 0.08421052631578947 | 0.45263157894736844 | 0.3473684210526316 | 0.2789473684210526 | 0.23157894736842105 | 0.11052631578947368 | 0.0 | 0.4368421052631579 | 0.2631578947368421 | 0.2789473684210526 | 0.22105263157894736 | | Count Color | 386 | 0.22695035460992907 | 0.9243498817966903 | 0.8297872340425532 | 0.6335697399527187 | 0.7825059101654847 | 0.7092198581560284 | 0.002364066193853428 | 0.4846335697399527 | 0.91725768321513 | 0.8297872340425532 | 0.2907801418439716 | | Count Shape | 261 | 0.2795275590551181 | 0.7559055118110236 | 0.5 | 0.44881889763779526 | 0.5039370078740157 | 0.24015748031496062 | 0.007874015748031496 | 0.3543307086614173 | 0.6377952755905512 | 0.6220472440944882 | 0.2795275590551181 | | Count Relative To | 395 | 0.1425 | 0.4275 | 0.2325 | 0.1925 | 0.15 | 0.17 | 0.0 | 0.205 | 0.22 | 0.2175 | 0.195 |