cell2mol: a boon to the use of crystallographic repositories in molecular and materials design

This was published on September 13, 2022

Applying quantum chemistry (QC) approaches to the high-throughput screening of crystallographic data repositories could enable huge advances in molecular and materials design. The challenge, however, particularly in transition metal complexes, is to retrieve all the information needed to perform QC computations. In the paper "cell2mol: encoding chemistry to interpret crystallographic data," recently published in Nature Publishing's journal Computational Materials, Sergi Vela and the team led by Professor Clémence Corminboeuf, head of the Computational Molecular Design Laboratory at EPFL, present a novel solution. A fully automatic pipeline for characterizing molecular crystals, cell2mol greatly simplifies the construction of QC-ready datasets. The authors show that the code can interpret the large chemical diversity and structural complexity contained in crystallographic repositories by simply encoding the chemist's view when visualizing a crystal structure. Under this perspective, the unit cell is understood by the hierarchy of molecular fragments, and their Lewis structures and formal charges. The code, as well as reliable QC-ready databases of transition metals with incomparable chemical diversity, are now available. 

by Carey Sargent, EPFL, NCCR MARVEL

Crystallographic data repositories such as the Cambridge Structural Database (CSD) or the Crystallography Open Database (COD) play a key role in computational chemistry. Hosting what is likely the most diverse collection of synthesizable molecules, such databases could, if properly mined, form the basis of extensive exploration into new molecules and materials.

Assuring the availability, interoperability and storage of this data is critical both to developing cost-efficient Machine Learning (ML) methods and to supporting the use of Quantum Chemistry (QC) computations for the high-throughput screening of molecules and materials. Mining data from crystallographic repositories is difficult, however, and the structural data must be enriched to retrieve the structure, R, and the molecular charge, Q, both of which are essential to run electronic structure computations. Achieving this while preserving enough chemical diversity is a challenging task.  

This prompted researchers led by Professor Clemence Corminboeuf, head of the Computational Molecular Design Laboratory at EPFL, to develop the cell2mol. The software is a fully automatic pipeline for characterizing molecular crystals and also enables the construction of QC-ready datasets with large chemical diversity. The software and core principles are described in length in the paper "cell2mol: encoding chemistry to interpret crystallographic data," recently published in Nature Publishing's journal Computational Materials

Credit: Raimon Fabregat

cell2mol, available on Github, works by encoding chemical concepts and rules applied by chemists when interpreting crystallographic data to extract comprehensive information about the individual molecules contained in unit cells. Cell2mol first identifies the chemical species present in the crystal structure, and for each of them then proposes a plausible Lewis structure that defines the bond network as well as the molecular and atomic formal charges. 

The authors demonstrate that the code excels at characterizing purely organic crystals, and, more critically, that it is particularly useful in characterizing crystals with transition metal complexes, which pose a bigger challenge because of their structural complexity and the multiple oxidation states (OS) of the metal ions. Indeed, trying to determine metal OS from structural data is the subject of much research, as illustrated by oxiMACHINE, a  tool developed by another NCCR MARVEL team. 

To demonstrate its capabilities, the team evaluated the performance of cell2mol in interpreting more than 40,000 molecular crystals contained in the CSD. They found that the approach can successfully interpret about 75% of those entries, with a reliability of more than 95%. The approach is therefore not only capable of reliably assigning metal OS, but also of providing a full chemical interpretation of all molecules contained in the unit cell. Such performance enables the generation of top-down and bottom-up QC-ready databases that benefit from the unmatched chemical diversity contained in crystallographic data repositories generated from decades of creativity in synthetic chemistry work.

The authors went on to build a publicly available database of 31,019 interpreted crystal structures containing transition metal complexes of eight different metal ions (Cr, Mn, Fe, Co, Ni, Cu, Ru, Re), as well as a separate database of 13,819 unique constituent ligands that can be rearranged to generate billions of realistic new chemical structures. All of the content is fully searchable and can be analyzed and made use of with chemoinformatics software such as Rdkit and SMILES-based tools.   

"We expect that cell2mol, with possible subsequent developments, will pave the way towards making all crystallographic repositories entirely usable for molecular and materials design purposes," the authors said in the paper. 

Reference:   S. Vela, R. Laplaza, Y. Cho, and C. Corminboeuf. cell2mol: encoding chemistry to interpret crystallographic data. npj Computational Materials 8, 188 (2022). 
https://doi.org/10.1038/s41524-022-00874-9


Stay in touch with the MARVEL project

Low-volume newsletters, targeted to the scientific and industrial communities.

Subscribe to our newsletter