A trimodal protein language model enables advanced protein searches

  • Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Zhang, Y. & Skolnick, J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 33, 2302–2309 (2005).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Van Kempen, M. et al. Fast and accurate protein structure search with Foldseek. Nat. Biotechnol. 42, 243–246 (2024).

    Article 
    PubMed 

    Google Scholar
     

  • Suzek, B. E. et al. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31, 926–932 (2015).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Bileschi, M. L. et al. Using deep learning to annotate the protein universe. Nat. Biotechnol. 40, 932–937 (2022).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Gane, A. et al. ProtNLM: model-based natural language protein annotation. Preprint at https://storage.googleapis.com/brain-genomics-public/research/proteins/protnlm/uniprot_2022_04/protnlm_preprint_draft.pdf (2022).

  • Gligorijević, V. et al. Structure-based protein function prediction using graph convolutional networks. Nat. Commun. 12, 3168 (2021).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Zhou, N. et al. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biol. 20, 1–23 (2019).

    Article 

    Google Scholar
     

  • Radivojac, P. et al. A large-scale evaluation of computational protein function prediction. Nat. Methods 10, 221–227 (2013).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Liu, W. et al. PLMSearch: protein language model powers accurate and fast sequence search for remote homology. Nat. Commun. 15, 2775 (2024).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Hong, L. et al. Fast, sensitive detection of protein homologs using deep dense retrieval. Nat. Biotechnol. 43, 983–995 (2025).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Achiam, J. et al. GPT-4 technical report. Preprint at https://arxiv.org/abs/2303.08774 (2023).

  • Touvron, H. et al. LLaMA: open and efficient foundation language models. Preprint at https://arxiv.org/abs/2302.13971 (2023).

  • Touvron, H. et al. LLaMA 2: open foundation and fine-tuned chat models. Preprint at https://arxiv.org/abs/2307.09288 (2023).

  • Guo, D. et al. DeepSeek-R1: incentivizing reasoning capability in llms via reinforcement learning. Preprint at https://arxiv.org/abs/2501.12948 (2025).

  • Ferruz, N., Schmidt, S. & Höcker, B. ProtGPT2 is a deep unsupervised language model for protein design. Nat. Commun. 13, 4348 (2022).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Elnaggar, A. et al. ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7112–7127 (2021).

    Article 

    Google Scholar
     

  • Zhou, X. et al. Decoding the molecular language of proteins with Evolla. Preprint at bioRxiv https://doi.org/10.1101/2025.01.05.630192 (2025).

  • Peng, F. Z. et al. PTM-Mamba: a PTM-aware protein language model with bidirectional gated Mamba blocks. Nat. Methods 22, 945–949 (2025).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Su, J. et al. SaProt: protein language modeling with structure-aware vocabulary. In Proc. 12th International Conference on Learning Representations (ICLR, 2024); https://openreview.net/forum?id=6MRm3G4NiU

  • Su, J. et al. SaprotHub: making protein modeling accessible to all biologists. Preprint at bioRxiv https://doi.org/10.1101/2024.05.24.595648 (2024).

  • Radford, A. et al. Learning transferable visual models from natural language supervision. In Proc. 38th International Conference on Machine Learning (eds Meila, M. & Zhang, T.) 8748–8763 (PMLR, 2021).

  • Gu, Y. et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthc. 3, 1–23 (2021).

    Article 
    CAS 

    Google Scholar
     

  • Boeckmann, B. et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 31, 365–370 (2003).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • UniProt Consortium UniProt: a hub for protein information. Nucleic Acids Res. 43, D204–D212 (2015).

    Article 

    Google Scholar
     

  • Koehler Leman, J. et al. Sequence–structure–function relationships in the microbial protein universe. Nat. Commun. 14, 2351 (2023).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Todd, A. E., Orengo, C. A. & Thornton, J. M. Evolution of protein function, from a structural perspective. Curr. Opin. Chem. Biol. 3, 548–556 (1999).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Douze, M. et al. The Faiss library. Preprint at https://arxiv.org/abs/2401.08281 (2024).

  • Liu, S. et al. A text-guided protein design framework. Nat. Mach. Intell. 7, 580–591 (2025).

    Article 

    Google Scholar
     

  • Xu, M., Yuan, X., Miret, S. & Tang, J. ProtST: multi-modality learning of protein sequences and biomedical texts. In Proc. 40th International Conference on Machine Learning (eds Krause, A. et al.) 38749–38767 (PMLR, 2023).

  • Chen, J. et al. Global marine microbial diversity and its potential in bioprospecting. Nature 633, 371–379 (2024).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Hu, Z. et al. Discovery and engineering of small SlugCas9 with broad targeting range and high specificity and activity. Nucleic Acids Res. 49, 4008–4019 (2021).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Yu, T. et al. Enzyme function prediction using contrastive learning. Science 379, 1358–1363 (2023).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Kweon, J. et al. Efficient DNA base editing via an optimized DYW-like deaminase. Preprint at bioRxiv https://doi.org/10.1101/2024.05.15.594452 (2024).

  • Gherardini, P. F., Wass, M. N., Helmer-Citterich, M. & Sternberg, M. J. E. Convergent evolution of enzyme active sites is not a rare phenomenon. J. Mol. Biol. 372, 817–845 (2007).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Doolittle, R. F. Convergent evolution: the need to be explicit. Trends Biochem. Sci. 19, 15–18 (1994).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Buchfink, B., Reuter, K. & Drost, H.-G. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat. Methods 18, 366–368 (2021).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Pomaznoy, M., Ha, B. & Peters, B. GOnet: a tool for interactive Gene Ontology analysis. BMC Bioinformatics 19, 470 (2018).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Dauparas, J. et al. Robust deep learning–based protein sequence design using ProteinMPNN. Science 378, 49–56 (2022).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • He, Y. et al. Protein language models-assisted optimization of a uracil-N-glycosylase variant enables programmable T-to-G and T-to-C base editing. Mol. Cell 84, 1257–1270 (2024).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Tong, H. et al. Development of deaminase-free T-to-S base editor and C-to-G base editor by engineered human uracil DNA glycosylase. Nat. Commun. 15, 4897 (2024).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Ye, L. et al. Glycosylase-based base editors for efficient T-to-G and C-to-G editing in mammalian cells. Nat. Biotechnol. 42, 1538–1547 (2024).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Cornman, A. et al. The OMG dataset: an Open MetaGenomic corpus for mixed-modality genomic language modeling. In Proc. 13th International Conference on Learning Representations (ICLR, 2025); https://openreview.net/forum?id=jlzNb1iWs3

  • Kavli, B. et al. Excision of cytosine and thymine from DNA by mutants of human uracil-DNA glycosylase. EMBO J. 15, 3442–3447 (1996).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Hayes, T. et al. Simulating 500 million years of evolution with a language model. Science 387, 850–858 (2025).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Burley, S. K. et al. RCSB Protein Data Bank: biological macromolecular structures enabling research and education in fundamental biology, biomedicine, biotechnology and energy. Nucleic Acids Res. 47, D464–D474 (2019).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Richardson, L. et al. MGnify: the microbiome sequence data analysis resource in 2023. Nucleic Acids Res. 51, D753–D759 (2023).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Pruitt, K. D., Tatusova, T., Brown, G. R. & Maglott, D. R. NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy. Nucleic Acids Res. 40, D130–D135 (2012).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Dai, F. et al. Toward de novo protein design from natural language. Preprint at bioRxiv https://doi.org/10.1101/2024.08.01.606258 (2024).

  • Liu, N. et al. Protein design with dynamic protein vocabulary. Preprint at https://arxiv.org/abs/2505.18966 (2025).

  • Kuang, J., Liu, N., Sun, C., Ji, T. & Wu, Y. PDFBench: a benchmark for de novo protein design from function. Preprint at https://arxiv.org/abs/2505.20346 (2025).

  • Ko, Young Su. Using ProTrek for protein binder design. Twitter https://x.com/youngsuko9/status/1865845977673834595 (2024).

  • Gitter, A. Using ProTrek to retrieve proteins with desired function. Twitter https://x.com/anthonygitter/status/1827760237194920435 (2024).

  • Gitter, A. Using ProTrek to retrieve proteins with desired function. Twitter https://x.com/anthonygitter/status/1813427191000035330 (2024).

  • Gitter, A. Using ProTrek to retrieve proteins with desired function. Twitter https://x.com/anthonygitter/status/1882642214624678193 (2025).

  • Varadi, M. et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 50, D439–D444 (2022).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • van den Oord, A., Li, Y. & Vinyals, O. Representation learning with contrastive predictive coding. Preprint at https://arxiv.org/abs/1807.03748 (2018).

  • Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) 4171–4186 (Association for Computational Linguistics, 2019).

  • Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Rasley, J., Rajbhandari, S., Ruwase, O. & He, Y. DeepSpeed: system optimizations enable training deep learning models with over 100 billion parameters. In Proc. 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (eds Gupta, R. & Liu, Y.) 3505–3506 (Association for Computing Machinery, 2020).

  • Loshchilov, I. and Hutter, F. Fixing weight decay regularization in Adam. OpenReview.net https://openreview.net/forum?id=rk6qdGgCZ (2018).

  • Loshchilov, I. & Hutter, F. SGDR: Stochastic gradient descent with warm restarts. In Proc. International Conference on Learning Representations (ICLR, 2017); https://openreview.net/forum?id=Skq89Scxx

  • Xu, J. et al. Protein inverse folding from structure feedback. Preprint at https://arxiv.org/abs/2506.03028 (2025).

  • Enzyme Nomenclature (Nomenclature Committee of the International Union of Biochemistry and Molecular Biology, 2024); https://iubmb.qmul.ac.uk/enzyme/

  • Kanehisa, M., Furumichi, M., Tanabe, M., Sato, Y. & Morishima, K. KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 45, D353–D361 (2017).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Kucera, T., Oliver, C., Chen, D., and Borgwardt, K. ProteinShake: building datasets and benchmarks for deep learning on protein structures. In Advances in Neural Information Processing Systems 36 (eds Oh, A. et al.) (NeurIPS, 2023).


  • Source link

    Leave a Reply

    Your email address will not be published. Required fields are marked *