Total: 1
While CLIP has advanced open-vocabulary predictions, its performance on semantic segmentation remains suboptimal. This shortfall primarily stems from its spatial-invariant semantic features and constrained resolution. While previous adaptations addressed spatial invariance semantic by modifying the self-attention in CLIP's image encoder, the issue of limited resolution remains unexplored. Different from previous segment-then-splice methods that segment sub-images via a sliding window and splice the results, we introduce a splice-then-segment paradigm that incorporates Segment-Anything Model (SAM) to tackle the resolution issue since SAM excels at extracting fine-grained semantic correlations from high-resolution images. Specifically, we introduce Trident, a training-free framework that first splices features extracted by CLIP and DINO from sub-images, then leverages SAM's encoder to create a correlation matrix for global aggregation, enabling a broadened receptive field. Besides, we propose a refinement strategy for CLIP's coarse segmentation outputs by transforming them into prompts for SAM. Trident achieves a significant improvement in the mIoU across eight popular benchmarks compared with the current SOTA. Furthermore, it can also be utilized to generate visual prompts that enhance the performance of Large Vision-Language Models (LVLMs). Code is available at https://github.com/YuHengsss/Trident.