Total: 1
Identifying minimal and informative feature sets is a central challenge in data analysis, particularly when few data points are available. Here we present a theoretical analysis of an unsupervised feature selection pipeline based on the Differentiable Information Imbalance (DII). We consider the specific case of structural and physico-chemical features describing a set of proteins. We show that if one considers the features as coordinates of a (hypothetical) statistical physics model, this model undergoes a phase transition as a function of the number of retained features. For physico-chemical descriptors, this transition is between a glass-like phase when the features are few and a liquid-like phase. The glass-like phase exhibits bimodal order-parameter distributions and Binder cumulant minima. In contrast, for structural descriptors the transition is less sharp. Remarkably, for physico-chemical descriptors the critical number of features identified from the DII coincides with the saturation of downstream binary classification performance. These results provide a principled, unsupervised criterion for minimal feature sets in protein classification and reveal distinct mechanisms of criticality across different feature types.