I develop scalable, data-driven AI models that learn from large-scale biological data to decode the language of life and uncover novel fundamental principles of biological systems.
I am a Postdoctoral Researcher at the Wellcome Sanger Institute in Cambridge, UK, working with Dr. Mo Lotfollahi on developing scalable, generalizable foundation models for large-scale biological data. Since January 2026, I have also been a Junior Research Fellow at Wolfson College, University of Cambridge. My work spans sequence modeling (proteins, genomes) and structural modeling (3D molecular structures), with the goal of advancing integrative understanding across biological modalities. During my Ph.D., I conducted research on foundation models for molecular modeling, which formed the core of my doctoral thesis titled Research on Molecular Modeling Based on Pre-trained Models, and this work was recognized as the Winner of the ACM Beijing Doctoral Dissertation Award.
A multi-modal, multi-scale suite of foundation models targeting diverse large-scale datasets and practical applications, organized around four key aspects:
Understands multi-scale molecular data (e.g. drug molecules and proteins), achieving state-of-the-art results on protein–molecule tasks.
Utilizes large-scale 3D molecular data to achieve high accuracy in molecular property prediction through robust structural understanding.
Uses edit-style pretraining to model fragment-level molecular semantics, enabling molecular property prediction and retrosynthesis inference.
Enhances language models’ ability to capture long-range dependencies, yielding strong performance on NLU and molecular classification tasks.
[MASK] Tokens in Masked Language Models. ICML 2025.
Feel free to reach out for collaboration, discussion, or to learn more about me.