LXILcnYTSl@OpenReview

Total: 1

#1 The Underlying Universal Statistical Structure of Natural Datasets [PDF] [Copy] [Kimi1] [REL]

Authors: Noam Levi, Yaron Oz

We study universal properties in real-world complex and synthetically generated datasets. Our approach is to analogize data to a physical system and employ tools from statistical physics and Random Matrix Theory (RMT) to reveal their underlying structure.Examining the local and global eigenvalue statistics of feature-feature covariance matrices, we find: (i) bulk eigenvalue power-law scaling vastly differs between uncorrelated Gaussian and real-world data, (ii) this power law behavior is reproducible using Gaussian data with long-range correlations, (iii) all dataset types exhibit chaotic RMT universality, (iv) RMT statistics emerge at smaller dataset sizes than typical training sets, correlating with power-law convergence, (v) Shannon entropy correlates with RMT structure and requires fewer samples in strongly correlated datasets. These results suggest natural image Gram matrices can be approximated by Wishart random matrices with simple covariance structure, enabling rigorous analysis of neural network behavior.

Subject: ICML.2025 - Poster