Machine learning-guided directed evolution

The design of synthetic proteins with the desired function is a long-standing goal in biomolecular science with broad applications in biochemical engineering, agriculture, medicine, and public health. Deep generative models have established a powerful new modeling paradigm to learn sequence-function mappings and use these relations to guide and accelerate synthetic protein design campaigns. We have pioneered unsupervised, semi-supervised, and self-supervised deep learning architectures to guide experimental gene synthesis and assays within a virtuous design-build-test cycle. A model based on variational autoencoders exposed correlated patterns of mutations underpinning phylogeny and function in SH3 domains, and designed synthetic proteins with ligand binding affinities comparable to or stronger than wild type, which rescue in vivo osmosensing function in S. cerevisiae and that possess as much as 30% sequence divergence from natural Sho1 domains. A more sophisticated model employing a dilated convolution encoder and autoregressive WaveNet decoder enabled the design of variable length proteins. Computational benchmarks demonstrated predictive accuracies competitive with or superior to state-of-the-art large language models employing an order of magnitude more parameters, and experimental testing demonstrated its capacity to introduce osmosensing function into SH3 paralogs evolved to perform alternative biological tasks.

We are pursuing the following projects in this theme:

  • Application to antibody CDR and T-cell receptor engineering
  • Integration of large protein language models (pLM) with latent space learning and control tag conditioning
  • Efficient training routines for large language models
  • Discrete denoising diffusion probability models for sequence generation

Representative Publications

–         N. Praljak, X. Lian, R. Ranganathan, and A.L. Ferguson* “ProtWave-VAE: Integrating autoregressive sampling with latent-based inference for data-driven protein design” (submitted, 2023) [ https://biorxiv.org/cgi/content/short/2023.04.23.537971v1 ]

–         X. Lian, N. Praljak, S. Subramanian, S. Wasinger, R. Ranganathan, and A.L. Ferguson* “Deep learning-enabled design of synthetic orthologs of a signaling protein” (submitted, 2022) [ https://doi.org/10.1101/2022.12.21.521443 ]

83.     A.L. Ferguson* and R. Ranganathan “100th Anniversary of Macromolecular Science Viewpoint: Data-driven protein design” ACS Macro. Lett. 10 327-340 (2021) [ https://dx.doi.org/10.1021/acsmacrolett.0c00885 ]

→ Invited Viewpoint article for 2020 special collection 100th Anniversary of Macromolecular Science
→ Selected for front cover art of ACS Macro. Lett. vol. 10, issue 4 (April 20, 2021)
→ Featured in editorial review M. Müller “Selection of advances in theory and simulation during the first decade of ACS Macro LettersACS Macro Lett. 10 1629-1635 (2021) [ https://doi.org/10.1021/acsmacrolett.1c00750 ]