Total: 1
As efficient alternatives to softmax Attention, linear state space models (SSMs) achieve constant memory and linear compute, but maintain only a lossy, fading summary of the past, often leading to inferior performance in recall oriented settings. We propose Gated KalmaNet (GKA), a layer that reduces this gap by accounting for the full past when predicting the next token, while maintaining SSM-style efficiency. GKA is inspired by the Kalman Filter and solves ridge regression problems online at test time, with constant memory and linear time in the sequence length. An insight is that standard Kalman filter equations are numerically unstable in low-precision environments (e.g., bfloat16) and difficult to parallelize on modern GPUs. We address both challenges via two innovations: (1) an adaptive regularization strategy with input-dependent gating that controls the condition number of the problem, ensuring numerical stability and balancing memory retention; (2) the use of Chebyshev Iteration instead of conventional iterative solvers, which we show to be more stable in low-precision settings. To improve scalability, we implement Chebyshev Iteration in a hardware-aware, chunk-wise manner, along with custom kernels for backpropagating through our adaptive regularization and gating mechanisms. On short-context tasks, GKA shows strong language understanding capabilities and outperforms existing SSMs (e.g., Mamba2, and Gated DeltaNet). On long-context tasks, GKA excels at real-world RAG and LongQA tasks up to 128k tokens with more than 10% relative improvement over baselines. Finally, we show GKA outperforms Mamba when extended for ImageNet classification.