Taming Large Multimodal Agents for Ultra-low Bitrate Semantically Disentangled Image Compression

#1 Taming Large Multimodal Agents for Ultra-low Bitrate Semantically Disentangled Image Compression [PDF¹] [Copy] [Kimi¹] [REL]

Authors: Juan Song, Lijie Yang, Mingtao Feng

It remains a significant challenge to compress images at ultra-low bitrate while achieving both semantic consistency and high perceptual quality. We propose a novel image compression framework, Semantically Disentangled Image Compression (SEDIC) in this paper. Our proposed SEDIC leverages large multimodal models (LMMs) to disentangle the image into several essential semantic information, including an extremely compressed reference image, overall and object-level text descriptions, and the semantic masks. A multi-stage semantic decoder is designed to progressively restore the transmitted reference image object-by-object, ultimately producing high-quality and perceptually consistent reconstructions. In each decoding stage, a pre-trained controllable diffusion model is utilized to restore the object details on the reference image conditioned by the text descriptions and semantic masks. Experimental results demonstrate that SEDIC significantly outperforms state-of-the-art approaches, achieving superior perceptual quality and semantic consistency at ultra-low bitrates ( $\le$ 0.05 bpp). Our code is available at https://github.com/yang-xidian/SEDIC.

Subject: Computer Vision and Pattern Recognition

Publish: 2025-03-01 08:27:11 UTC

2503.00399

#1 Taming Large Multimodal Agents for Ultra-low Bitrate Semantically Disentangled Image Compression [PDF1] [Copy] [Kimi1] [REL]

#1 Taming Large Multimodal Agents for Ultra-low Bitrate Semantically Disentangled Image Compression [PDF¹] [Copy] [Kimi¹] [REL]