Authors:
Yeqi Gao,
Zhao Song,
Weixin Wang,
Junze Yin
Large language models (LLMs) have played a pivotal role in revolutionizing various facets of our daily existence. Solving attention regression is a fundamental task in optimizing LLMs. In this work, we focus on providing a provable guarantee for the one-layer attention network objective function: given input matrices of a layer, A1,A2,A3∈Rn×d, our goal is to minimize the loss function: L(X,Y)=∑_j0=1n∑_i0=1d(⟨⟨exp(A_j0x),1_n⟩−1⋅exp(A_j0x),A_3Y_\*,i0⟩−b_j0,i0)2,
where
Aj0∈Rn×d2 is the
j0-th block of the Kronecker product of
A1 and
A2. The matrix
B∈Rn×d represents the output of a layer, and
b_j0,i0∈R is the
(j0,i0)-th entry of
B.
Y_∗,i0∈Rd is the
i0-th column vector of
Y, and
x∈Rd2 is the vectorization of
X. In self-attention,
Q,K,V∈Rd×d represent the weight matrices for the query, key, and value, respectively. Unlike prior works that relied on simplified and single-variable versions of the attention computation problem, our multivariate loss function analyzes a complete and unsimplified attention layer, treating all these weights as variables, where
X=QK⊤∈Rd×d and
Y=V∈Rd×d. We propose an iterative greedy algorithm to train a neural network using the loss function
L(X,Y), achieving an error bound of
ϵ∈(0,0.1). The algorithm runs in
O((T_mat(n,n,d)+T_mat(n,d,d)+d2ω)log(1/ϵ)) time, where
T_mat(a,b,c) denotes the time complexity of multiplying an
a×b matrix with a
b×c matrix, and
ω≈2.37 is the exponent of matrix multiplication.
Subject:
UAI.2025 - Poster