2404.01911

Total: 1

#1 VLRM: Vision-Language Models act as Reward Models for Image Captioning [PDF8] [Copy] [Kimi8] [REL]

Authors: Maksim Dzabraev, Alexander Kunitsyn, Andrei Ivaniuta

In this work, we present an unsupervised method for enhancing an image captioning model (in our case, BLIP2) using reinforcement learning and vision-language models like CLIP and BLIP2-ITM as reward models. The RL-tuned model is able to generate longer and more comprehensive descriptions. Our model reaches impressive 0.90 R@1 CLIP Recall score on MS-COCO Carpathy Test Split. Weights are available at https://huggingface.co/sashakunitsyn/vlrm-blip2-opt-2.7b.

Subject: Computer Vision and Pattern Recognition

Publish: 2024-04-02 12:57:22 UTC