Total: 1
Large Language Models (LLMs) matrices can often be expressed in low-rank format with potential to relax memory and compute resource requirements. Unlike previous works which pivot around developing novel matrix decomposition algorithms, in this work we focus to study the emerging non-uniform low-rank properties across weight matrices in LLMs through the lens of stabilizing gradient subspace. \textit{Firstly,} we provide a theoretical framework to understand the stabilization of gradient subspaces through Hessian analysis. \textit{Secondly,} we empirically establish a consequential relationship between the gradient dynamics and low-rank expressiveness of weight matrices. Our findings reveal that different LLM components exhibit varying levels of converged low-rank structure, necessitating a non-uniform rank reduction across them to minimize performance drop due to compression. In view of that, we present \textit{Weight Low-Rank Projection} \textbf{(WeLore)} that unifies weight compression and memory-efficient fine-tuning as ONE, in a data-agnostic and one-shot way. Going beyond only as a compression technique, WeLore categorizes weight matrices into Low-rank Components (LRCs) and Non-Low-rank Components (N-LRCs) based on their ability to express themselves as low-rank. Our gradient dynamics perspective illustrate that \textit{LRCs tend to have better finetuning capabilities} and their standalone finetuning can closely mimic (sometimes outperform) the training loss trajectory and performance of full-finetuning with notable memory and compute footprint reduction. All codes and checkpoints will be released.