Can Finetuned Vision Language Models Generalize to Different Tasks on the Same Image?
Pilot Experiments, Blog, Riverside CA, USA
If we have a very detailed text caption for an image, we can use the caption to generate many, many questions about the image. Should we generate and train on as many such synthetic questions as possible? Or is training on a couple questions good enough?