Biomolecules

Date: Thu, 9 May 2024 | Total: 5

#1 GP-MoLFormer: A Foundation Model For Molecular Generation [PDF3] [Copy] [Kimi2]

Authors: Jerret Ross ; Brian Belgodere ; Samuel C. Hoffman ; Vijil Chenthamarakshan ; Youssef Mroueh ; Payel Das

Transformer-based models trained on large and general purpose datasets consisting of molecular strings have recently emerged as a powerful tool for successfully modeling various structure-property relations. Inspired by this success, we extend the paradigm of training chemical language transformers on large-scale chemical datasets to generative tasks in this work. Specifically, we propose GP-MoLFormer, an autoregressive molecular string generator that is trained on more than 1.1B chemical SMILES. GP-MoLFormer uses a 46.8M parameter transformer decoder model with linear attention and rotary positional encodings as the base architecture. We explore the utility of GP-MoLFormer in generating novel, valid, and unique SMILES. Impressively, we find GP-MoLFormer is able to generate a significant fraction of novel, valid, and unique SMILES even when the number of generated molecules is in the 10 billion range and the reference set is over a billion. We also find strong memorization of training data in GP-MoLFormer generations, which has so far remained unexplored for chemical language models. Our analyses reveal that training data memorization and novelty in generations are impacted by the quality of the training data; duplication bias in training data can enhance memorization at the cost of lowering novelty. We evaluate GP-MoLFormer's utility and compare it with that of existing baselines on three different tasks: de novo generation, scaffold-constrained molecular decoration, and unconstrained property-guided optimization. While the first two are handled with no additional training, we propose a parameter-efficient fine-tuning method for the last task, which uses property-ordered molecular pairs as input. We call this new approach pair-tuning. Our results show GP-MoLFormer performs better or comparable with baselines across all three tasks, demonstrating its general utility.

#2 Lipid-mediated hydrophobic gating in the BK potassium channel [PDF] [Copy] [Kimi]

Authors: Lucia Coronel ; Giovanni Di Muccio ; Brad Rothberg ; Alberto Giacomello ; Vincenzo Carnevale

The large-conductance, calcium-activated potassium (BK) channel lacks the typical intracellular bundle-crossing gate present in most ion channels of the 6TM family. This observation, initially inferred from Ca$^{2+}$-free-pore accessibility experiments and recently corroborated by a CryoEM structure of the non-conductive state, raises a puzzling question: how can gating occur in absence of steric hindrance? To answer this question, we carried out molecular simulations and accurate free energy calculations to obtain a microscopic picture of the sequence of events that, starting from a Ca$^{2+}$-free state leads to ion conduction upon Ca$^{2+}$ binding. Our results highlight an unexpected role for annular lipids, which turn out to be an integral part of the gating machinery. Due to the presence of fenestrations, the "closed" Ca$^{2+}$-free pore can be occupied by the methyl groups from the lipid alkyl chains. This dynamic occupancy triggers and stabilizes the nucleation of a vapor bubble into the inner pore cavity, thus hindering ion conduction. By contrast, Ca$^{2+}$ binding results into a displacement of these lipids outside the inner cavity, lowering the hydrophobicity of this region and thus allowing for pore hydration and conduction. This lipid-mediated hydrophobic gating rationalizes several seemingly problematic experimental observations, including the state-dependent pore accessibility of blockers.

#3 Impact of phylogeny on the inference of functional sectors from protein sequence data [PDF] [Copy] [Kimi]

Authors: Nicola Dietler ; Alia Abbara ; Subham Choudhury ; Anne-Florence Bitbol

Statistical analysis of multiple sequence alignments of homologous proteins has revealed groups of coevolving amino acids called sectors. These groups of amino-acid sites feature collective correlations in their amino-acid usage, and they are associated to functional properties. Modeling showed that natural selection on an additive functional trait of a protein is generically expected to give rise to a functional sector. These modeling results motivated a principled method, called ICOD, which is designed to identify functional sectors, as well as mutational effects, from sequence data. However, a challenge for all methods aiming to identify sectors from multiple sequence alignments is that correlations in amino-acid usage can also arise from the mere fact that homologous sequences share common ancestry, i.e. from phylogeny. Here, we generate controlled synthetic data from a minimal model comprising both phylogeny and functional sectors. We use this data to dissect the impact of phylogeny on sector identification and on mutational effect inference by different methods. We find that ICOD is most robust to phylogeny, but that conservation is also quite robust. Next, we consider natural multiple sequence alignments of protein families for which deep mutational scan experimental data is available. We show that in this natural data, conservation and ICOD best identify sites with strong functional roles, in agreement with our results on synthetic data. Importantly, these two methods have different premises, since they respectively focus on conservation and on correlations. Thus, their joint use can reveal complementary information.

#4 Predicting the binding of small molecules to proteins through invariant representation of the molecular structure [PDF2] [Copy] [Kimi1]

Authors: R. Beccaria ; A. Lazzeri ; G. Tiana

We present a computational scheme for predicting the ligands that bind to a pocket of known structure. It is based on the generation of a general abstract representation of the molecules, which is invariant to rotations, translations and permutations of atoms, and has some degree of isometry with the space of conformations. We use these representations to train a non-deep machine learning algorithm to classify the binding between pockets and molecule pairs, and show that this approach has a better generalization capability than existing methods.

#5 ACEGEN: Reinforcement learning of generative chemical agents for drug discovery [PDF2] [Copy] [Kimi1]

Authors: Albert Bou ; Morgan Thomas ; Sebastian Dittert ; Carles Navarro Ramírez ; Maciej Majewski ; Ye Wang ; Shivam Patel ; Gary Tresadern ; Mazen Ahmad ; Vincent Moens ; Woody Sherman ; Simone Sciabola ; Gianni De Fabritiis

In recent years, reinforcement learning (RL) has emerged as a valuable tool in drug design, offering the potential to propose and optimize molecules with desired properties. However, striking a balance between capability, flexibility, and reliability remains challenging due to the complexity of advanced RL algorithms and the significant reliance on specialized code. In this work, we introduce ACEGEN, a comprehensive and streamlined toolkit tailored for generative drug design, built using TorchRL, a modern decision-making library that offers efficient and thoroughly tested reusable components. ACEGEN provides a robust, flexible, and efficient platform for molecular design. We validate its effectiveness by benchmarking it across various algorithms and conducting multiple drug discovery case studies. ACEGEN is accessible at https://github.com/acellera/acegen-open.