Chemical Physics

Date: Thu, 9 May 2024 | Total: 10

#1 Data-Error Scaling in Machine Learning on Natural Discrete Combinatorial Mutation-prone Sets: Case Studies on Peptides and Small Molecules [PDF] [Copy] [Kimi]

Authors: Vanni Doffini ; O. Anatole von Lilienfeld ; Michael A. Nash

We investigate trends in the data-error scaling behavior of machine learning (ML) models trained on discrete combinatorial spaces that are prone-to-mutation, such as proteins or organic small molecules. We trained and evaluated kernel ridge regression machines using variable amounts of computationally generated training data. Our synthetic datasets comprise i) two na\"ive functions based on many-body theory; ii) binding energy estimates between a protein and a mutagenised peptide; and iii) solvation energies of two 6-heavy atom structural graphs. In contrast to typical data-error scaling, our results showed discontinuous monotonic phase transitions during learning, observed as rapid drops in the test error at particular thresholds of training data. We observed two learning regimes, which we call saturated and asymptotic decay, and found that they are conditioned by the level of complexity (i.e. number of mutations) enclosed in the training set. We show that during training on this class of problems, the predictions were clustered by the ML models employed in the calibration plots. Furthermore, we present an alternative strategy to normalize learning curves (LCs) and the concept of mutant based shuffling. This work has implications for machine learning on mutagenisable discrete spaces such as chemical properties or protein phenotype prediction, and improves basic understanding of concepts in statistical learning theory.

#2 Range separation of the interaction potential in intermolecular and intramolecular symmetry-adapted perturbation theory [PDF] [Copy] [Kimi]

Authors: Du Luu ; Clemence Corminboeuf ; Konrad Patkowski

Symmetry-adapted perturbation theory (SAPT) is a popular and versatile tool to compute and decompose noncovalent interaction energies between molecules. The intramolecular SAPT (ISAPT) variant provides a similar energy decomposition between two nonbonded fragments of the same molecule, covalently connected by a third fragment. In this work, we explore an alternative approach where the noncovalent interaction is singled out by a range separation of the Coulomb potential. We investigate two common splittings of the $1/r$ potential into long-range and short-range parts based on the Gaussian and error functions, and approximate either the entire intermolecular/interfragment interaction or only its attractive terms by the long-range contribution. These range separation schemes are tested for a number of intermolecular and intramolecular complexes. We find that the energy corrections from range-separated SAPT or ISAPT are in reasonable agreement with complete SAPT/ISAPT data. This result should be contrasted with the inability of the long-range multipole expansion to describe crucial short-range charge penetration and exchange effects; it shows that the long-range interaction potential does not just recover the asymptotic interaction energy but also provides a useful account of short-range terms. The best consistency is attained for the error-function separation applied to all interaction terms, both attractive and repulsive. This study is the first step towards a fragmentation-free decomposition of intramolecular nonbonded energy.

#3 Predicting the binding of small molecules to proteins through invariant representation of the molecular structure [PDF2] [Copy] [Kimi1]

Authors: R. Beccaria ; A. Lazzeri ; G. Tiana

We present a computational scheme for predicting the ligands that bind to a pocket of known structure. It is based on the generation of a general abstract representation of the molecules, which is invariant to rotations, translations and permutations of atoms, and has some degree of isometry with the space of conformations. We use these representations to train a non-deep machine learning algorithm to classify the binding between pockets and molecule pairs, and show that this approach has a better generalization capability than existing methods.

#4 Basis set extrapolation from the vanishing counterpoise correction condition [PDF] [Copy] [Kimi]

Authors: Vladimir Fishman ; Emmanouil Semidalas ; Jan M. L. Martin

Basis set extrapolations are typically rationalized either from analytical arguments involving the partial-wave or principal expansions of the correlation energy in helium-like systems, or from fitting extrapolation parameters to reference energetics for a small(ish) training set. Seeking to avoid both, we explore a third alternative: extracting extrapolation parameters from the requirement that the BSSE (basis set superposition error) should vanish at the complete basis set limit. We find this to be a viable approach provided that the underlying basis sets are not too small and reasonably well balanced. For basis sets not augmented by diffuse functions, BSSE minimization and energy fitting yield quite similar parameters.

#5 Mean-Field Ring Polymer Rates Using a Population Dividing Surface [PDF] [Copy] [Kimi]

Authors: Nathan London ; Siyu Bu ; Britta Ann Johnson ; Nandini Ananth

Mean-field Ring Polymer Molecular Dynamics (MF-RPMD) offers a computationally efficient method for the simulation of reaction rates in multi-level systems. Previous work has established that, to model a nonadiabatic state-to-state reaction accurately, the dividing surface must be chosen to explicitly sample kinked ring polymer configurations where at least one bead is in a different electronic state than the others. Building on this, we introduce a population difference coordinate and a kink-constrained dividing surface, and we test the accuracy of the resulting mean-field rate theory on a series of linear vibronic coupling model systems as well as spin-boson models. We demonstrate that this new MF-RPMD rate approach is efficient to implement and quantitatively accurate for models over a wide range of driving forces, coupling strengths, and temperatures.

#6 Chemistry Beyond Exact Solutions on a Quantum-Centric Supercomputer [PDF] [Copy] [Kimi]

Authors: Javier Robledo-Moreno ; Mario Motta ; Holger Haas ; Ali Javadi-Abhari ; Petar Jurcevic ; William Kirby ; Simon Martiel ; Kunal Sharma ; Sandeep Sharma ; Tomonori Shirakawa ; Iskandar Sitdikov ; Rong-Yang Sun ; Kevin J. Sung ; Maika Takita ; Minh C. Tran ; Seiji Yunoki ; Antonio Mezzacapo

A universal quantum computer can be used as a simulator capable of predicting properties of diverse quantum systems. Electronic structure problems in chemistry offer practical use cases around the hundred-qubit mark. This appears promising since current quantum processors have reached these sizes. However, mapping these use cases onto quantum computers yields deep circuits, and for for pre-fault-tolerant quantum processors, the large number of measurements to estimate molecular energies leads to prohibitive runtimes. As a result, realistic chemistry is out of reach of current quantum computers in isolation. A natural question is whether classical distributed computation can relieve quantum processors from parsing all but a core, intrinsically quantum component of a chemistry workflow. Here, we incorporate quantum computations of chemistry in a quantum-centric supercomputing architecture, using up to 6400 nodes of the supercomputer Fugaku to assist a Heron superconducting quantum processor. We simulate the N$_2$ triple bond breaking in a correlation-consistent cc-pVDZ basis set, and the active-space electronic structure of [2Fe-2S] and [4Fe-4S] clusters, using 58, 45 and 77 qubits respectively, with quantum circuits of up to 10570 (3590 2-qubit) quantum gates. We obtain our results using a class of quantum circuits that approximates molecular eigenstates, and a hybrid estimator. The estimator processes quantum samples, produces upper bounds to the ground-state energy and wavefunctions supported on a polynomial number of states. This guarantees an unconditional quality metric for quantum advantage, certifiable by classical computers at polynomial cost. For current error rates, our results show that classical distributed computing coupled to quantum processors can produce good approximate solutions for practical problems beyond sizes amenable to exact diagonalization.

#7 GP-MoLFormer: A Foundation Model For Molecular Generation [PDF5] [Copy] [Kimi2]

Authors: Jerret Ross ; Brian Belgodere ; Samuel C. Hoffman ; Vijil Chenthamarakshan ; Youssef Mroueh ; Payel Das

Transformer-based models trained on large and general purpose datasets consisting of molecular strings have recently emerged as a powerful tool for successfully modeling various structure-property relations. Inspired by this success, we extend the paradigm of training chemical language transformers on large-scale chemical datasets to generative tasks in this work. Specifically, we propose GP-MoLFormer, an autoregressive molecular string generator that is trained on more than 1.1B chemical SMILES. GP-MoLFormer uses a 46.8M parameter transformer decoder model with linear attention and rotary positional encodings as the base architecture. We explore the utility of GP-MoLFormer in generating novel, valid, and unique SMILES. Impressively, we find GP-MoLFormer is able to generate a significant fraction of novel, valid, and unique SMILES even when the number of generated molecules is in the 10 billion range and the reference set is over a billion. We also find strong memorization of training data in GP-MoLFormer generations, which has so far remained unexplored for chemical language models. Our analyses reveal that training data memorization and novelty in generations are impacted by the quality of the training data; duplication bias in training data can enhance memorization at the cost of lowering novelty. We evaluate GP-MoLFormer's utility and compare it with that of existing baselines on three different tasks: de novo generation, scaffold-constrained molecular decoration, and unconstrained property-guided optimization. While the first two are handled with no additional training, we propose a parameter-efficient fine-tuning method for the last task, which uses property-ordered molecular pairs as input. We call this new approach pair-tuning. Our results show GP-MoLFormer performs better or comparable with baselines across all three tasks, demonstrating its general utility.

#8 Temperature and Solvent Viscosity Tune the Intermediates During the Collapse of a Polymer [PDF] [Copy] [Kimi]

Authors: Suman Majumder ; Henrik Christiansen ; Wolfhard Janke

Dynamics of a polymer chain in solution gets significantly affected by the temperature and the frictional forces arising due to solvent viscosity. Here, using an explicit solvent framework for polymer simulation with the liberty to tune the solvent viscosity, we study the nonequilibrium dynamics of a flexible homopolymer when it is suddenly quenched from an extended coil state in good solvent to poor solvent conditions. Results from our extensive simulations reveal that depending on the temperature $T$ and solvent viscosity, one encounters long-lived sausage-like intermediates following the usual pearl-necklace intermediates. Use of shape factors of polymers allows us to disentangle these two distinct stages of the overall collapse process, and the corresponding relaxation times. The relaxation time $\tau_s$ of the sausage stage, which is the rate-limiting stage of the overall collapse process, follows an anti-Arrhenius behavior in the high-$T$ limit, and the Arrhenius behavior in the low-$T$ limit. Furthermore, the variation of $\tau_s$ with the solvent viscosity provides evidence of internal friction of the polymer, that modulates the overall collapse significantly, analogous to what is observed for relaxation rates of proteins during their folding. This suggests that the origin of internal friction in proteins is plausibly intrinsic to its polymeric backbone rather than other specifications.

#9 Lipid-mediated hydrophobic gating in the BK potassium channel [PDF] [Copy] [Kimi]

Authors: Lucia Coronel ; Giovanni Di Muccio ; Brad Rothberg ; Alberto Giacomello ; Vincenzo Carnevale

The large-conductance, calcium-activated potassium (BK) channel lacks the typical intracellular bundle-crossing gate present in most ion channels of the 6TM family. This observation, initially inferred from Ca$^{2+}$-free-pore accessibility experiments and recently corroborated by a CryoEM structure of the non-conductive state, raises a puzzling question: how can gating occur in absence of steric hindrance? To answer this question, we carried out molecular simulations and accurate free energy calculations to obtain a microscopic picture of the sequence of events that, starting from a Ca$^{2+}$-free state leads to ion conduction upon Ca$^{2+}$ binding. Our results highlight an unexpected role for annular lipids, which turn out to be an integral part of the gating machinery. Due to the presence of fenestrations, the "closed" Ca$^{2+}$-free pore can be occupied by the methyl groups from the lipid alkyl chains. This dynamic occupancy triggers and stabilizes the nucleation of a vapor bubble into the inner pore cavity, thus hindering ion conduction. By contrast, Ca$^{2+}$ binding results into a displacement of these lipids outside the inner cavity, lowering the hydrophobicity of this region and thus allowing for pore hydration and conduction. This lipid-mediated hydrophobic gating rationalizes several seemingly problematic experimental observations, including the state-dependent pore accessibility of blockers.

#10 Navigating Chemical Space with Latent Flows [PDF1] [Copy] [Kimi1]

Authors: Guanghao Wei ; Yining Huang ; Chenru Duan ; Yue Song ; Yuanqi Du

Recent progress of deep generative models in the vision and language domain has stimulated significant interest in more structured data generation such as molecules. However, beyond generating new random molecules, efficient exploration and a comprehensive understanding of the vast chemical space are of great importance to molecular science and applications in drug design and materials discovery. In this paper, we propose a new framework, ChemFlow, to traverse chemical space through navigating the latent space learned by molecule generative models through flows. We introduce a dynamical system perspective that formulates the problem as learning a vector field that transports the mass of the molecular distribution to the region with desired molecular properties or structure diversity. Under this framework, we unify previous approaches on molecule latent space traversal and optimization and propose alternative competing methods incorporating different physical priors. We validate the efficacy of ChemFlow on molecule manipulation and single- and multi-objective molecule optimization tasks under both supervised and unsupervised molecular discovery settings. Codes and demos are publicly available on GitHub at https://github.com/garywei944/ChemFlow.