Associate Professor Xia Kelin is a faculty member in the School of Physical and Mathematical Sciences (SPMS) at Nanyang Technological University, Singapore. In this article, he introduces his cutting-edge research topic of using AI for molecular data analysis.
Artificial Intelligence (AI) is currently having a revolutionary impact on our world. Within the past few years, AI models have achieved unprecedented capabilities in the analysis, manipulation, and generation of many forms of data, including images, text, audio, and video. These successes are the result of a confluence of factors: the accumulation of gigantic amounts of data, ever-increasing computational power, and, not least, the development of efficient and effective AI algorithms.
AI advances are also having profound effects on scientific research. In 2008, well before the latest round of AI breakthroughs, the computer scientist Jim Gray had heralded “data-driven science” as the “fourth paradigm” of science. From the start of the scientific revolution in the 16th century, scientists had sought to understand the world through successive paradigms of investigation – empirical, theoretical, and computational. Now data-driven science, supercharged by AI models, may take the scientific endeavor to the next level, fundamentally changing our everyday lives and even the nature of our society.
As an example of the promise AI holds for science, consider AlphaFold2, a scientific AI system developed by Google. It has achieved great success in solving the protein-folding problem, an infamously difficult task in computational chemistry that had long impeded progress on drug design and discovery, materials discovery, and chemical synthesis. With the advent of AlphaFold2, and other AI systems like ChatGPT, these scientific fields stand at the brink of a new era.
Yet despite all this understandable excitement, there are still significant challenges to overcome. In molecular data analysis, one of the key problems AI researchers are grappling with is called molecular representation and featurization.
Molecules can be extremely complicated objects, far too complex to be passed directly into any AI model. When building an AI model, one must construct a stripped-down mathematical model that reflects its structure, as well as its other quantifiable physical and chemical features (e.g., hydrophobicity, steric properties, and electronic properties). Over the years, chemists have developed a bewildering number (over 5000) of so-called “molecular descriptors,” and picking the right ones can dramatically affect performance of a model.
Mathematicians have a special role to play in this line of research. When choosing the representation and featurization of a molecular model, one inspects the properties of a molecule through various mathematical viewpoints (geometry, topological, mathematical analysis, etc.). In many cases, there are powerful methods to summarize complicated features in succinct but meaningful forms, known to mathematicians as “topological invariants”, “geometric invariants”, “combinatorial invariants”, and so forth.
Unlike traditional molecular descriptors, mathematically-inspired molecular descriptors can capture deeper properties of molecules, giving a critical boost for AI models.
Recently, my research group has successfully implemented mathematically-inspired AI models in two different areas: drug design, and the analysis of perovskite materials.
In drug design, one of the key issues is the prediction of protein-ligand binding affinity, which directly determines the performance of a candidate molecule. We have developed a technique called “persistent spectral-based machine learning”, which outperforms all existing machine learning models on the most-commonly used benchmark datasets in this area (PDB-Bind datasets)1.
In the second project, we applied mathematically-inspired AI models to a complex family of materials known as 2D halide perovskites. Our AI model was more accurate, compared to traditional models, at predicting the band gap of these materials, which is the key physical property determining their usefulness for next-generation solar cells, LEDs, and other devices2.
Readers interested in this exciting area of research are welcome to read our research papers, listed below.
References:
- Zhenyu Meng and Kelin Xia, “Persistent spectral–based machine learning (PerSpect ML) for protein-ligand binding affinity prediction.” Science Advances, 7 (19), eabc5329 (2021).
- Vijay Anand, Qiang Xu, Junjie Wee, Kelin Xia, and Tze Chien Sum, “Topological feature engineering for machine learning based halide perovskite materials design”, npj Computational Materials, 8 (203) (2022).