Dec 18 2023

We introduce a language-grounded visual prompting method to adapt the visual
encoder of vision-language models for downstream tasks. By capitalizing on
language integration, we devise a parameter-efficient strategy to adjust the
input of the visual encoder, eliminating the need to modify or add to the
model's parameters. Due to this design choice, our algorithm can operate even
in black-box scenarios, showcasing adaptability in situations where access to
the model's parameters is constrained. We will empirically demonstrate that,
compared to prior art, grounding visual prompts with language enhances both the
accuracy and speed of adaptation. Moreover, our algorithm excels in
base-to-novel class generalization, overcoming limitations of visual prompting
and exhibiting the capacity to generalize beyond seen classes. We thoroughly
assess and evaluate our method across a variety of image recognition datasets,
such as EuroSAT, UCF101, DTD, and CLEVR, spanning different learning
situations, including few-shot learning, base-to-novel class generalization,
and transfer learning.