Total: 1
Convolutional neural networks (ConvNets) with large effective receptive field (ERF), still in their early stages, have demonstrated promising effectiveness while constrained by high parameters and FLOPs costs and disrupted asymptotically Gaussian distribution (AGD) of ERF. This paper proposes an alternative paradigm: rather than merely employing extremely large ERF, it is more effective and effcient to expand the ERF while maintaining AGD of ERF by proper combination of smaller kernels, such as 7x 7 , 9x 9 , 11x 11 . This paper introduces a Three-layer Receptive Field Aggregator and designs a Layer Operator as the fundamental operator from the perspective of receptive field. The ERF can be expanded to the level of existing large-kernel ConvNets through the stack of proposed modules while maintaining AGD of ERF. Using these designs, we propose a universal ConvNet, termed UniConvNet. Extensive experiments on ImageNet-1K, COCO2017, and ADE20K demonstrate that UniConvNet outperforms state-of-the-art CNNs and ViTs across various vision recognition tasks for both lightweight and large-scale models with comparable throughput. Surprisingly, UniConvNet-T achieves 84.2% ImageNet top-1 accuracy with 30M parameters and 5.1G FLOPs. UniConvNet-XL also shows competitive scalability to big data and large models, acquiring 88.4% top-1 accuracy on ImageNet.