Advanced Learning Strategies for Potential Energy Surfaces Applied to Organic Electrolytes (Holm, Kästner)
We develop methods and software tools to use machine learning techniques to obtain high-quality inter-atomic potentials fit on a minimal set of data using accurate ab-initio methods. The work will be based on the Gaussian moment neural network (GM-NN) approach developed in the Kästner group.
An essential prerequisite for constructing highly accurate and robust machine-learned interatomic potentials is detecting extrapolative configurations and performing ab-initio calculations on them on the fly, e.g., during a molecular dynamics (MD) simulation or offline on an unlabeled data set. Unfortunately, the application of most active learning algorithms known in the community in an on-the-fly approach is somewhat hindered by the high correlation of data sampled during, e.g., an MD simulation. Therefore, one of our goals is to develop greedy algorithms based on the last-layer uncertainty such that the most informative batch of uncorrelated configurations can be selected.
In a step towards explainable machine-learned potentials, we apply layer-wise relevance propagation. We aim to select the most representative Gaussian-moment features in a data-driven manner to speed up the training and inference.
Developing machine-learned inter-atomic potentials for room-temperature ionic liquids presents a challenge due to their complicated atomic structure and thermodynamic properties. The challenge comes in their complex potential energy surface, which, due to ionic liquids typically high viscosity, takes long simulations to adequately map with ab-initio methods. Therefore, the development of these potentials requires optimization of each stage of the machine learning workflow, including data generation, selection, representation, and ML algorithm deployment and active learning approaches for model refinement. The methods developed here will subsequently support other applications in chemistry, catalysis, surface science, and beyond.
Our developments will enter the MDSuite and MLSuite in the Holm group and the GM-NN (https://gitlab.com/zaverkin_v/gmnn) suite in the Kästner group.
AI-Empowered Universal Workflow for Molecular Design of Performant Photoswitches (Hecht, Reuter)
The design of molecules with tailored properties and functions is central to a wide range of applications from agriculture and medicine to materials, energy, and information. This design process is traditionally rather an Edisonian-type discovery in which incomplete human understanding and working hypotheses are used to prioritize restricted search spaces. Thanks to high-throughput experimentation and descriptor-based predictive-quality computations these search spaces could gradually be extended. However, they are still insignificant compared to the vastness of the chemical space.
This project therefore aims at leveraging modern artificial intelligence (AI) methodology to establish a purposeful, actual design process. Provided with sufficient amounts of data, generative models can implicitly learn the underlying structure-property relationships and then directly propose innovative molecular designs that fulfill targeted properties. A fundamental challenge is the huge amount of data generally required by such deep learning. To this end, we will exploit transfer learning concepts to reduce the amount of domain-specific data needed, develop computationally most efficient descriptors to increase the availability of synthetic data, and generate extended experimental chemical libraries by adopting one-pot strategies and automatized workflows. Visualization and explainable AI analysis tools will finally be employed to convert the implicitly learned structure-property relationships into chemically interpretable knowledge. This will not least increase trust and acceptance, will provide important validation and feedback for the established AI framework, and establish transferable insight into governing principles that can be applied across a wider space of molecules and functionalities.
As a highly challenging, but equally rewarding and pressing design problem, we specifically endeavor to design high-performance photoswitchable molecules. Optimizing often contradictory performance parameters like addressability, efficiency and robustness, the corresponding development of molecular photoswitches is a non-trivial and hitherto slow empirical process that thus constitutes a prototypical application case that will heavily benefit from an AI-empowered molecular design.
Development and Application of Improved Ligand Descriptors and Representations for Inverse Catalyst Design (Däschlein-Gessner, Gensch)
This research project aims at inverse catalyst design with the help of machine learning methods. New descriptors for an improved description of the ligands and their properties will be developed and verified by experimental studies. The applicability of the new descriptors for the prediction of new ligand structures will be demonstrated in selected test reactions. Based on experimental data, machine learning methods will be used to determine ideal ligand substituents and hence to predict new catalyst structures.
Elucidating Fingerprints - Towards a Holistic Explanatory Toolbox for Molecular Machine Learning (Glorius, Jiang)
In most chemical sciences Molecular Machine Learning (MML) is on the rise and different approaches show broad applicability in academia and industry. While so far black-box models and encrypted representations dominate the field, the need is high to understand and comprehend MML. The central point of this project is the development of interpretable and explainable MML methods on a structural level and the implementation of an out-of-the-box software. To this aim, universally applicable molecular representations will be developed, adapted, and used to train highly robust but accurate models.
Starting from these models, an open-source software pipeline will be developed and employed to map feature parameters (e.g. importance) back to the molecular structure, which gives trained chemists a plain handle for molecular and reaction design. This tool will be usable to investigate and improve underlaying datasets while methods for adaptive visualization and statistical evaluation shall lead to excellent user experience. Finally, the goal is to extract chemical rules by means of this XAI-platform, which will be validated in different labs.
Exploiting and Developing Machine Learning for Molecular Applications - Molecular Machine Learning (Glorius)
The goal of this priority programme is to develop and apply modern ML algorithms in their entire range to molecular problems. The coordination project and the underlying concepts will help to bring together the individual research groups, to foster strong and beneficial relationships and collaborations, to train and enable the doctoral students, to connect the PP with the international community and also to reach out to the general public. It will help SPP 2363 to become a success story and a scientific highlight.
Exploring Tailored Ru-Triphos Catalysts for Hydrogenation Reactions by Combination of Experimental, Computational, and Machine Learning Techniques (Bannwarth, Klankermayer)
The homogeneously catalyzed hydrogenation of carbon dioxide and different unsaturated organic compounds could already be demonstrated with multiple modified ruthenium triphos complexes. Due to the expansion of the chemical space caused by modifications of the phosphine substituents and backbone of the triphos ligand, exploring this space by means of experimental as well as quantum chemical methods quickly becomes infeasible.
In this project, the use of modern statistical and computational models, together with experimental evidence, is used to construct a scaffold that captures the reactivity of tailored ruthenium triphos complexes and aims for a reactivity prediction of such complexes. This is achieved through the use of machine learning methods based on descriptors from cost-efficient semiempirical electronic structure theory calculations.
In order to couple the computational methods with the experimental data, detailed investigations of tailored ruthenium triphos complexes in the hydrogenation of carbon dioxide and levulinic acid are carried out. Based on our work, a machine-learned structure-activity relationship model is developed with the aim of predicting the reactivity of previously unknown complexes on a substrate-specific basis.
Fourth-Generation Neural Network Potentials for Molecular Chemistry (Behler, Goedecker)
Machine learning potentials (MLP) have become an important tool for performing large-scale atomistic simulations with the accuracy of electronic structure methods at a small fraction of the computational costs. Most machine learning potentials, however, are local in that they rely on atomic energies and charges depending on the local atomic environments only, preventing their application to systems with non-local charge transfer, which is important in many molecular systems. This limitation can be overcome by the recently introduced fourth generation of machine learning potentials.
In this project, fourth-generation high-dimensional neural network potentials will be further developed and applied to molecular chemistry with the purpose to establish a tool for accurate simulations of chemistry in solution.
Machine Learning Approaches for Faster Discovery and Adaptation of Enzymes for Difficult Chemical Reactions. Phase I: Providing Solutions for Regioselective Oxygenations by 2OGD-oxidases (MacBioSyn) (Davari)
Biocatalytic synthesis of chemicals is considered a keystone for future green and sustainable chemistry. The use of enzymes in biocatalysis has found numerous applications in various fields as an alternative to chemical catalysis, especially to make chiral compounds for pharmaceuticals and the flavors and fragrance industry. In this regard, identifying new representatives of a large enzyme family having, e.g., 2OGD oxidases with increased substrate scope can offer a new range of biocatalytic routes to, e.g., natural products. However, a common challenge for enzyme development is the prediction of activity by exploring the large biodiversity through genome mining. Machine learning (ML) can capitalize on large and diverse enzyme datasets to predict function and activity and explore the biodiversity to identify advanced biocatalysts.
In the MacBioSyn project, we aim to develop an ML-based framework that predicts the activity of enzymes and their substrate/reaction scope. We will implement a new in silico framework to analyze enzyme sequences/substrates relationships based on ML models trained on large experimental data. In essence, our framework will provide a solution for activity/substrate scope prediction for biocatalyst discovery in general. Our synergistic approach will provide methodologies that enable the power of ML methods to accelerate the discovery of improved enzymes. The new fundamental design principles learned for 2OGD enzymes will broaden their applications in the biocatalytic production of valuable natural products and beyond.
Machine Learning for Developing and Understanding Novel, Asymmetric 3d Metal-catalyzed C–H Activations (Ackermann)
Transition metal-catalyzed C–H activation constitutes a powerful strategy for molecular syntheses. Throughout the years, C–H activation has been successfully applied to a vast number of chemical transformations mostly employing precious 4d and 5d transition metal catalysts, while the use of cost-efficient 3d metal catalysts in enantioselective C–H activation for the synthesis of valuable chiral targets of interest to inter alia medicinal chemistry remains scarce. This project seeks to implement ML in the area of asymmetric C–H activation. We will investigate the influence of chiral ligands on the reactivity and selectivity of challenging asymmetric Earth-abundant 3d metal-catalyzed C–H hydroarylation reactions. For this purpose, a diverse series of ligands will be explored in combination with non-toxic and environmentally-friendly cobalt and iron sources. In this project we will make use of the synergy between experimental chemistry and ML for the breakthrough in benchmarking and development of accurate predictive models for synthetic demanding and highly relevant transformations, such as in asymmetric catalysis. This will allow an in-depth insight into the properties and variables which drive and influence the reaction outcome, allowing for a better understanding of the reaction mechanism. This will allow for highly efficient optimization processes towards ideal reaction parameters. The developed systems will prove instrumental beyond the targeted reactions and will be further applied in the development of novel reactions.
Machine Learning for Hierarchical Ultrafast Molecular Force Fields (Wenzel)
In essentially in unlimited and ever-growing number of applications, the complex dynamic environment determines the overall properties. Chemical reactions in solution and in heterogeneous catalysis depend strongly and non-trivially on the molecular environment, in particular in interaction with external stimuli, such as light. In principle quantum mechanics offers nowadays workable approximations, mostly based on density functional theory, for many of these problems, but the timescales reachable in-initio quantum chemistry calculations are so short that only model systems can be adequately addressed.
Molecular mechanics methods, in particular molecular dynamics, permit the treatment of systems on timescales which are up to 100,000 times longer but often lack adequate representations of the system. There has been a decade-long effort to use machine learning method to go beyond the manual parameterization of the force fields used in molecular mechanics simulations, which has demonstrated that accurate forcefields can be parameterized with machine learning, but the numerical effort of these methods is still comparable to fast quantum methods, rather than to standard molecular dynamics methods.
Here we will develop a novel approach that decouples the molecular representation of the system, in terms of its stochiometric composition from the treatment of the molecular conformation. This enables hierarchical machine-learning of molecular force fields, where a computationally efficient forcefield is parameterized by a complex ML-protocol, which needs to be evaluated only once before the actual simulation starts. Both components, i.e. the functional form of the forcefield and its parameters can be learned based on highly accurate quantum mechanical data. In this project we will demonstrate the viability of this approach for model reactions of small organic molecules in solution and for heterogeneous catalysis. In cooperation with other projects, the force fields will be applied to the growth of metal organic frameworks, excited state chemistry in molecular organic materials and battery applications. The force fields will be implemented in standard molecular dynamics codes and made available to the community. The project will participate in a community-wide effort to generate and curate training data for molecular force fields.
Machine Learning-guided Chemical Space Exploration: Automatic Creation and Navigation of Ultra-large Open-source Molecular Libraries (Kolb)
The “chemical space” formed by all drug-like molecules contains an estimated 10 to the power of 60 compounds, a number too large to ever synthesise one of each. In this project we will tackle two challenges. First, how can we discover concretely which molecules are contained in chemical space or at least a therapeutically relevant portion thereof? Second, how can we search such large spaces with protein-structure-based in silico methods?
Our strategy is based on our database of virtually synthesised compounds, SCUBIDOO, and we will develop algorithms to identify novel robust and broadly applicable chemical reactions as well as filters to increase synthesis success rates. This will substantially increase the size of publicly available easily accessible chemical space. For navigating this huge space, we will develop evolutionary algorithms that will help us identify promising ligands in an efficient way.
Moreover, we will develop a deep-learning based method in order to store the opinion of an expert about the fit of each potential ligand in a protein binding pocket. In this way, we will be able to preserve knowledge and also apply it to molecule numbers that are out of reach for a single human being. Both arms of the project together will open the door for fast and comprehensive chemical space exploration.
Molecular Descriptors in Matrix Completion Methods (Hasse, Jirasek, Leitte)
Matrix completion methods (MCMs), which are established in recommender systems, are also promising for the prediction of fluid properties of mixtures. These MCMs can be trained in a completely data-driven way on sparse mixture data, whereby they uncover structure and similarities among the components. The goal of this project is twofold. First, we will, based on a systematic analysis of mixture data using visual data analytics, explore which molecular descriptors are key for modeling mixture properties, which are mainly determined by pair-interactions. And second, we will exploit the obtained insights for extending the purely data-driven MCMs to hybrid models by incorporating the most relevant molecular descriptors in the training.
Molecular Machine Learning for Asymmetric (Organo-)Catalysis (Schreiner)
Asymmetric catalysis plays a pivotal role in the synthesis of pharmaceutically active compounds. Organocatalysis aims to enable such transformations without the use of potentially toxic metals. Its biggest challenge, however, lies in the design of potent organocatalysts and prediction of activity: In most cases, the correlation between a catalyst's molecular structure and its activity is poorly understood. Thus, we will develop and apply machine learning (ML) techniques to thiourea and oligopeptide catalyst libraries: We aim to determine the most viable candidates for organocatalytic transformations of pharmaceutical interest, such as for the synthesis of anti-malarial agents. By using explainable AI techniques on our ML models, we plan to reveal the molecular features responsible for catalyst activity to move further towards the goal of true de novo catalyst design.
Multi-fidelity, Active Learning Strategies for Exciton Transfer Among Adsorbed Molecules (Kleinekathöfer, Zaspel)
New materials for photochemical applications are essential, e.g., for the further development of renewable energy devices. The development of such material is nowadays tackled by experiments and by computer-driven molecular simulations. Ideally, the full design process including material screening and optimizations could be done in-silico. This, however, requires time-efficient, high-accuracy and easy-to-use software for the analysis of photochemical properties of molecular aggregates or more precisely their excitonic properties. The long-term goal of this project is to develop methods that will make such an analysis feasible, noting that current molecular simulations by means of quantum mechanics / molecular mechanics (QM/MM) methods are prohibitively expensive.
A promising tool to overcome the computational challenges is the use of cheap to evaluate machine learning models, replacing expensive quantum chemical calculations in the simulation pipeline. However, the practical long-term success of this tool can only be guaranteed, if such machine learning models indeed achieve high accuracy predictions at moderate costs for the generation of the quantum chemical training data and can be constructed in a (semi-)automatic way. In this project, we develop a multi-fidelity, active learning approach for exciton transfer within molecular aggregates. Multi-fidelity machine learning promises to strongly reduce the number of required highly accurate and thereby computationally expensive training samples by using hierarchies of training data obtained at different quantum chemical theory levels, basis set sizes, etc. Further technical improvements will be achieved in the automatic selection of best possible training calculations (active learning) and the constructions of bi-molecular models, i.e. machine learning models for properties that depend on two molecules.
The overall approach is applied for the analysis of a light-harvesting material based on a molecular aggregate. As an example for such an aggregate, we focus on porphyrin molecules adsorbed on clay surfaces which experimentally have shown to posses interesting light-harvesting properties. While this model application will certainly gain from our novel contributions, our interest is to further share our expertise and tools on multi-fidelity molecular machine learning and on QM/MM simulations within the priority program and beyond.
Neural Fingerprints as Structure and Activity-sensitive Molecular Representations (Koch, Risse)
The project aim can be summarised as the development of a robust neural network architecture for the training of generic or domain-specific neural fingerprints. These neural fingerprints can be used as structure- and activity-sensitive molecular representations for e.g. virtual screening. In addition, we will integrate Explainable Artificial Intelligence techniques that will provide a better understanding of the training of the molecular representation and that can be used to analyse the important structural features learned by the neural network. This will allow a basic interpretability of the molecular representations created. Furthermore, we will develop different databases for the training of generic and important domain-specific neural fingerprints and develop a uniform benchmark framwork for evaluating and comparing neural fingerprints with respect to their functionality in virtual screening approaches.
Quantum Chemical Molecular Representations for Machine Learning (Grimme)
The project aims to develop new molecular representations for machine learning based on efficient tight-binding (TB) quantum chemistry
('quantum features') and to connect those representations to various new network architectures. The models will be applied to predict chemically relevant properties of pharmaceutical-type molecules, like conformational and tautomerization energies, pKa values, solubility or partition coefficients.
The project is supported by the science and technology company Merck with established competence in leveraging extensive chemical data.
For computing the quantum features, a new model Hamiltonian (gTB) in an extended AO basis set will be developed that is able to reproduce accurately the density matrix and various derived properties (atomic charge, shell population, bond order, dipole moment, polarizability) of a high-quality reference RSH-DFT calculation. gTB is generally applicable to the whole periodic table including organometallic systems, accounts for the first time in a semiempirical context for fundamental physical effects like orbital contraction and electronic polarization. The combination with possibly ML-boosted continuum solvation theories to model solvated molecules is straightforward. Further main aspects of the proposal are the optimization of the neural network architecture based on gTB features, development of feature representation, the automatized generation of molecular training data sets, and state-of-the-art multitask-learning inspired from image recognition algorithms. Initially we follow a Delta-ML strategy where a correction term to a fast QC calculation (typically the established GFN-xTB or GFN-FF methods) based on the available features is computed by the network. An 'inverse' strategy is to compute properties which are very difficult to obtain with gTB (e.g., atomic forces or total energy) by ML. This entire approach is supposed to provide efficiency and accuracy for a potentially wide range of chemical properties.
SAFE: Synthetically Accessible Fragment Expansion Based on Machine Learning Approaches (Glorius, Rarey)
A key question arising today in drug discovery, materials design and also synthetic chemistry is how to precisely map the space of synthetically accessible organic compounds with reasonable efforts. Large pharmaceutical companies and compound vendors addressed this question with the definition of synthetically accessible fragment spaces. Since fragment spaces are reaction-pattern driven, the extension problem can be broken down to the prediction of building blocks compatible with a certain reaction and compatible with each other in a reaction. To solve this problem, expertise in chemoinformatics, reaction screening and machine learning from the Rarey and Glorius group will be combined to identify tolerated reactants for selected reaction schemes.
In the SAFE (Synthetically Accessible Fragment Space Extensions by Machine Learning-Based Approaches) project a chemoinformatics framework and problem-specific molecular descriptors will be developed for the extraction and utilization of data on reaction schemes and reactants from fragment spaces. For the targeted improvement of the prediction performance, we aim to combine statistical analysis with new convolutional screening techniques to generate highly informative experimental data and validate our in silico approach. Software tools for transferring productive reactions into fragment space will enable the translation of reactivity predictions into synthetically accessible fragment space extensions.
Understanding the Interaction of Organic Molecules and Metal Ions by Robot-based High-throughput Experimentation and Molecular Machine Learning (Gräfe, Schubert)
The interaction of transition metal ions and organic molecules in solution will be investigated using a machine-learning approach. So far, openly reported systematic massive data on these systems are sparse, preventing from an efficient use of machine-learning approaches. Within this project, we address this challenge by generating high throughput data, both experimentally, employing modern robot-based approaches, and theoretically, by utilizing DFT calculation on fast GPU-based DFT programs.
Within this project, we will not only generate large amount of data (experimentally and theoretically), which can be individually utilized employing methods of machine-learning to identify correlations but, moreover, to also cross-correlate theoretical and experimentally obtained data.
The aim of this systematic study is to predict the interaction of an organic molecule and a metal ion by just using the chemical structure of the molecule and the sort of metal ion. These results could therefore be highly interesting for the development of new drugs, catalysts or energy conversion moieties.
Virtual Drug Screening in the Chemical Space Accessible by Chemical Synthesis (Meiler, Stadler)
Modern approaches to drug development start with the identification of a target and virtual screening of drug-like organic molecules. This entails two intertwined challenges: first, the ligand must be functional and second, it must be synthetically accessible. We propose a computational approach to virtual drug screening that combines modern techniques of machine learning for functional evaluation of molecules, rule-based modelling to chemical reactions to maintain synthetic plausibility, and an Evolutionary Algorithm-inspired optimization approach. We aim at the development of a comprehensive framework and accompanying software for virtual drug screening.