Total: 1
Large language models (LLMs) have demonstrated remarkable performance in multimodal tasks even with frozen LLM Block and only a few trainable parameters. However, the underlying mechanisms of how LLMs enhance multimodal performance remains unclear. In this work, we focus on the phenomenon that ``Merely concatenating a frozen LLM block to the Vision Transformer (ViT) encoder can yield significant performance enhancements. Moreover, the choice of LLM block and insertion position can have a substantial impact, leading to varying degrees of improvement''. We analyze the optimization of the training process from the perspective of gradient dynamics and find that frozen LLM blocks act as gradient coherence rectifiers, aligning the gradients of different samples more closely during training. Furthermore, we demonstrate that the representation similarity between the inserted LLM block and the adjacent ViT block influences performance, with greater similarity tending to yield larger positive gains. Through these findings, we can justify the selection of suitable LLM blocks to be inserted at appropriate positions, and introduce additional gradient backpropagation paths by incorporating LLM blocks, could improve the performance of vanilla ViT through the rectification effect of gradient consistency during the training process, without the need to add LLM blocks during inference. Our experiments demonstrate the effectiveness of this strategy, making the practical application of the gradient rectification effect feasible.