MARVEL Junior Seminar — October 2017

October 12 2017 EPFL, MED 0 1418

Welcome to the ninth MARVEL Junior Seminar on Thursday October 12, 2017, 12:15 pm, EPFL, room MED 0 1418.

Nicholas John Browning (EPFL, LCBC) gives a presentation on Genetic Optimization of Training Sets for Improved Machine Learning Models of Molecular Properties and Bing Huang (University of Basel) on The “DNA” of chemistry: Scalable quantum machine learning with “amons. The seminar is facilitated by Francesco Ambrosio.

The MARVEL Junior Seminars aim to intensify interactions between the MARVEL Junior scientists belonging to different research groups located at EPFL. The EPFL community interested in MARVEL research topics is very welcome to attend. We believe that these events will be central for establishing a vibrant community.

Each seminar consists of two presentations of 25 minutes each, allowing to present on a scientific question in depth, each presentation being followed by 10 minutes for discussion. The discussion is facilitated and timed by the chairperson of the day whose mission is to ensure active lively interactions between the audience and the speakers.

Pizza is served as of 11:45 in the MED hall (floor 0), and after the seminar at 13:30 you are cordially invited for coffee and dessert to continue discussion with the speakers.

MARVEL Junior Seminar Organizing Committee — Ariadni Boziki, Francesco Ambrosio, Davide Campi, Sandip De, Gloria Capano, Michele Pizzochero, Quang Van Nguyen, Kun-Han Lin, Francesco Maresca and Nathalie Jongen

Check the list of the next MARVEL Junior Seminars here.


Abstract Genetic Optimization of Training Sets for Improved Machine Learning Models of Molecular Properties - Nicholas John Browning

The training of molecular models of quantum mechanical properties based on statistical machine learning requires large data sets which exemplify the map from chemical structure to molecular property. Intelligent a priori selection of training examples is often difficult or impossible to achieve, as prior knowledge may be unavailable. Ordinarily representative selection of training molecules from such data sets is achieved through random sampling. We use genetic algorithms for the optimization of training set composition consisting of tens of thousands of small organic molecules. The resulting machine learning models are considerably more accurate: in the limit of small training sets, mean absolute errors for out-of-sample predictions are reduced by up to ∼75%. We discuss and present optimized training sets consisting of 10 molecular classes for all molecular properties studied. We show that these classes can be used to design improved training sets for the generation of machine learning models of the same properties in similar but unrelated molecular sets.

Abstract  The “DNA” of chemistry: Scalable quantum machine learning with “amons” - Bing Huang

Given sufficient examples, recently introduced machine learning models enable rapid, yet accurate, predictions of properties of new molecules. Extrapolation to larger molecules with differing composition is prohibitive due to all the specific chemistries which would be required for training.
We address this problem by exploiting redundancies due to chemical similarity of repeating building blocks each represented by effective atoms in molecule: The “am-on”. In analogy to the DNA sequence in a gene encoding its function, constituting amons encode a query molecule’s properties.
The use of amons affords highly accurate machine learning predictions of quantum properties of arbitrary query molecules in real time. We investigate this approach for predicting energies of various covalently and non-covalently bonded systems. After training on the few amons detected, very low prediction errors can be reached, on par with experimental uncertainty. Systems studied include two dozen large biomolecules, eleven thousand medium sized organic molecules, large common polymers, water clusters, doped hBN sheets, bulk silicon, and Watson-Crick DNA base pairs. Conceptually, the amons extend Mendeleev’s table to account for the chemical environments of elements. They represent an important stepping stone to machine learning based virtual chemical space exploration
campaigns.