Yang_BACON_Improving_Clarity_of_Image_Captions_via_Bag-of-Concept_Graphs@CVPR2025@CVF

Total: 1

#1 BACON: Improving Clarity of Image Captions via Bag-of-Concept Graphs [PDF] [Copy] [Kimi] [REL]

Authors: Zhantao Yang, Ruili Feng, Keyu Yan, Huangji Wang, Zhicai Wang, Shangwen Zhu, Han Zhang, Jie Xiao, Pingyu Wu, Kai Zhu, Jixuan Chen, Chen-Wei Xie, Yue Yang, Hongyang Zhang, Yu Liu, Fan Cheng

Advancements in large Vision-Language Models have brought precise, accurate image captioning, vital for advancing multi-modal image understanding and processing. Yet these captions often carry lengthy, intertwined contexts that are difficult to parse and frequently overlook essential cues, posing a great barrier for models like GroundingDINO and SDXL, which lack the strong text encoding and syntax analysis needed to fully leverage dense captions.To address this, we propose BACON, a prompting method that breaks down VLM-generated captions into disentangled, structured elements such as objects, relationships, styles, and themes. This approach not only minimizes confusion from handling complex contexts but also allows for efficient transfer into a JSON dictionary, enabling models without linguistic processing capabilities to easily access key information.We annotated 100,000 image-caption pairs using BACON with GPT-4V and trained an LLaVA captioner on this dataset, enabling it to produce BACON-style captions without relying on costly GPT-4V resources. Evaluations of overall quality, precision, and recall—as well as user studies—demonstrate that the resulting caption model consistently outperforms other state-of-the-art VLM models in generating high-quality captions.Additionally, we show that BACON-style captions exhibit better clarity when applied to various models, enabling them to accomplish previously unattainable tasks or surpass existing SOTA solutions without training. For example, BACON-style captions help groundingDINO achieve 1.51 times higher recall scores on open-vocabulary object detection tasks compared to leading methods.

Subject: CVPR.2025 - Poster