Total: 1
Existing adaptation methods of pre-trained vision-language models like CLIP often rely on base-class samples during fine-tuning, introducing systematic biases that distort decision boundaries and degrade performance on novel classes. In this work, we break new ground by proposing a hierarchical divide-and-conquer framework that addresses classification bias at its root. Our method first segregates the label space into base and novel subspaces, ensuring domain separation. Subsequently, it employs text-embedding clustering within each subspace to decompose ambiguous intra-domain classes into disentangled, fine-grained clusters. This two-stage grouping strategy not only alleviates class confusion but also enables domain-specific model training in isolated subspaces, fostering specialized learning without overfitting base categories. Experiments on three classification benchmarks reveal that our approach achieves state-of-the-art performance, surpassing the second-best competitor by 10% average accuracy.