Learn the grammars of molecules to build them in the laboratory
Researchers generate molecular structures using machine learning algorithms, trained on smaller datasets
Researchers generate molecular structures using machine learning algorithms, trained on smaller datasets
We believe that molecules exist in nature. Large macromolecules lead us to the basis of life. The 20th century gave us new materials synthesized in the laboratory. We can now have “design molecules”, where we formulate a wish list of properties for the material (e.g. desired tensile strength as well as flexibility) and seek to not only discover, but also “build » molecules that exhibit such properties. Computationally generating molecules involves the use of artificial intelligence (AI) and machine learning algorithms that require large data sets to train. In addition, molecules designed in this way can be difficult to synthesize. The challenge is therefore to circumvent these shortcomings.
Now, researchers from Massachusetts Institute of Technology (MIT) and International Business Machines (IBM) have together devised a method for computationally generating molecules that combines the power of machine learning with so-called grammars of graphs. This approach requires much smaller datasets (eg, around 100 datasets instead of 81,000 as the researchers mention) and constructs the molecules in a bottom-up approach. The group demonstrated this method on the naphthalene diisocyanate molecule in a paper that was reviewed and accepted for presentation at the International Conference on Learning Representations (ICLR 2022).
Structure generation
Artificial intelligence (AI) techniques, in particular the use of machine learning algorithms, are in vogue today to find new molecular structures. These methods require tens of thousands of samples to train the neural networks. Additionally, the engineered molecules may not be physically synthesizable. Ensuring synthesis in these methods may require the incorporation of chemical knowledge, and extracting this knowledge from datasets is a significant challenge.
Chemical data sets with the required properties can be very few. For example, some researchers reported in 2019 that polyurethane property prediction datasets contained only 20 samples.
If we overcome all these challenges, another problem with typical machine learning algorithms is that we cannot explain their results. In other words, after discovering a molecule, we cannot understand how we found it. The implication is that if we change the desired properties slightly, we may have to start the search again. Explainable AI is considered one of the great challenges of contemporary AI research.
Grammars of molecules
An alternative to these deep learning methods is the use of formal grammars. Grammar, in the context of languages, provides rules for how sentences can be constructed from words. We can design chemical grammars that specify rules for building molecules from atoms. In recent years, several research teams have built such “grammars”. While this approach is hopeful, it requires deep expertise in chemistry, and once the grammar is built, incorporating properties from datasets, or optimization, is difficult.
Here, researchers use mathematical objects called graph grammars for this purpose.
Graph grammars
What mathematicians call graphs are networks or webs with nodes and edges between them. In this approach, a molecule is represented as a graph where the nodes are chains of atoms and the edges are chemical bonds. A grammar for such structures tells us how to replace a string in a node with an entire molecular structure. Thus, to analyze a structure means to contract a substructure; we keep doing this repeatedly until we get a single node.
The model uses machine learning techniques to learn graph grammars from data sets. The algorithm takes as input a set of molecular structures and a set of evaluation metrics (eg, synthetic ability).
Beyond chemistry
The grammar is built from the bottom up, creating rules by contractions; the choice of structures to contract is based on the learning component, a neural network that relies on chemical information. The algorithm simultaneously performs several random searches to obtain several grammars as candidates. It still has to evaluate them, and this is done using the input metrics.
While the method has been demonstrated for use in building molecules, the applications could be far-reaching, beyond chemistry.
(The author is a computer scientist, formerly at the Institute of Mathematical Sciences, Chennai, and currently a visiting professor at Azim Premji University, Bengaluru.)
“The AI techniques used before required tens of thousands of samples to train the neural networks. In addition, the molecules designed were not always physically synthesizable.
-
Now, researchers from Massachusetts Institute of Technology (MIT) and International Business Machines (IBM) have together devised a method for computationally generating molecules that combines the power of machine learning with so-called grammars of graphs.
-
Artificial intelligence (AI) techniques, especially the use of machine learning algorithms, are in vogue today to find new molecular structures
-
An alternative to these deep learning methods is the use of formal grammar. Grammar, in the context of languages, provides rules for how sentences can be constructed from words. We can design chemical grammars that specify rules for building molecules from atoms
Comments are closed.