qxRQ9f9zMv@OpenReview

Total: 1

#1 A Fast Optimization View: Reformulating Single Layer Attention in LLM Based on Tensor and SVM Trick, and Solving It in Matrix Multiplication Time [PDF] [Copy] [Kimi] [REL]

Authors: Yeqi Gao, Zhao Song, Weixin Wang, Junze Yin

Large language models (LLMs) have played a pivotal role in revolutionizing various facets of our daily existence. Solving attention regression is a fundamental task in optimizing LLMs. In this work, we focus on providing a provable guarantee for the one-layer attention network objective function: given input matrices of a layer, A1,A2,A3Rn×d, our goal is to minimize the loss function: L(X,Y)=_j0=1n_i0=1d(exp(A_j0x),1_n1exp(A_j0x),A_3Y_\*,i0b_j0,i0)2,

where Aj0Rn×d2 is the j0-th block of the Kronecker product of A1 and A2. The matrix BRn×d represents the output of a layer, and b_j0,i0R is the (j0,i0)-th entry of B. Y_,i0Rd is the i0-th column vector of Y, and xRd2 is the vectorization of X. In self-attention, Q,K,VRd×d represent the weight matrices for the query, key, and value, respectively. Unlike prior works that relied on simplified and single-variable versions of the attention computation problem, our multivariate loss function analyzes a complete and unsimplified attention layer, treating all these weights as variables, where X=QKRd×d and Y=VRd×d. We propose an iterative greedy algorithm to train a neural network using the loss function L(X,Y), achieving an error bound of ϵ(0,0.1). The algorithm runs in O((T_mat(n,n,d)+T_mat(n,d,d)+d2ω)log(1/ϵ)) time, where T_mat(a,b,c) denotes the time complexity of multiplying an a×b matrix with a b×c matrix, and ω2.37 is the exponent of matrix multiplication.

Subject: UAI.2025 - Poster