cell2mol: a boon to the use of crystallographic repositories in molecular and materials design
by Carey Sargent, EPFL, NCCR MARVEL
Crystallographic data repositories such as the Cambridge Structural Database (CSD) or the Crystallography Open Database (COD) play a key role in computational chemistry. Hosting what is likely the most diverse collection of synthesizable molecules, such databases could, if properly mined, form the basis of extensive exploration into new molecules and materials.
Assuring the availability, interoperability and storage of this data is critical both to developing cost-efficient Machine Learning (ML) methods and to supporting the use of Quantum Chemistry (QC) computations for the high-throughput screening of molecules and materials. Mining data from crystallographic repositories is difficult, however, and the structural data must be enriched to retrieve the structure, R, and the molecular charge, Q, both of which are essential to run electronic structure computations. Achieving this while preserving enough chemical diversity is a challenging task.
This prompted researchers led by Professor Clemence Corminboeuf, head of the Computational Molecular Design Laboratory at EPFL, to develop the cell2mol. The software is a fully automatic pipeline for characterizing molecular crystals and also enables the construction of QC-ready datasets with large chemical diversity. The software and core principles are described in length in the paper "cell2mol: encoding chemistry to interpret crystallographic data," recently published in Nature Publishing's journal Computational Materials.
cell2mol, available on Github, works by encoding chemical concepts and rules applied by chemists when interpreting crystallographic data to extract comprehensive information about the individual molecules contained in unit cells. Cell2mol first identifies the chemical species present in the crystal structure, and for each of them then proposes a plausible Lewis structure that defines the bond network as well as the molecular and atomic formal charges.
The authors demonstrate that the code excels at characterizing purely organic crystals, and, more critically, that it is particularly useful in characterizing crystals with transition metal complexes, which pose a bigger challenge because of their structural complexity and the multiple oxidation states (OS) of the metal ions. Indeed, trying to determine metal OS from structural data is the subject of much research, as illustrated by oxiMACHINE, a tool developed by another NCCR MARVEL team.
To demonstrate its capabilities, the team evaluated the performance of cell2mol in interpreting more than 40,000 molecular crystals contained in the CSD. They found that the approach can successfully interpret about 75% of those entries, with a reliability of more than 95%. The approach is therefore not only capable of reliably assigning metal OS, but also of providing a full chemical interpretation of all molecules contained in the unit cell. Such performance enables the generation of top-down and bottom-up QC-ready databases that benefit from the unmatched chemical diversity contained in crystallographic data repositories generated from decades of creativity in synthetic chemistry work.
The authors went on to build a publicly available database of 31,019 interpreted crystal structures containing transition metal complexes of eight different metal ions (Cr, Mn, Fe, Co, Ni, Cu, Ru, Re), as well as a separate database of 13,819 unique constituent ligands that can be rearranged to generate billions of realistic new chemical structures. All of the content is fully searchable and can be analyzed and made use of with chemoinformatics software such as Rdkit and SMILES-based tools.
"We expect that cell2mol, with possible subsequent developments, will pave the way towards making all crystallographic repositories entirely usable for molecular and materials design purposes," the authors said in the paper.
Reference: S. Vela, R. Laplaza, Y. Cho, and C. Corminboeuf. cell2mol: encoding chemistry to interpret crystallographic data. npj Computational Materials 8, 188 (2022).
Low-volume newsletters, targeted to the scientific and industrial communities.Subscribe to our newsletter