A trimodal protein language model enables advanced protein searches

Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).

Article
CAS
PubMed
PubMed Central

Google Scholar

Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).

Article
CAS
PubMed

Google Scholar

Zhang, Y. & Skolnick, J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 33, 2302–2309 (2005).

Article
CAS
PubMed
PubMed Central

Google Scholar

Van Kempen, M. et al. Fast and accurate protein structure search with Foldseek. Nat. Biotechnol. 42, 243–246 (2024).

Article
PubMed

Google Scholar

Suzek, B. E. et al. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31, 926–932 (2015).

Article
CAS
PubMed

Google Scholar

Bileschi, M. L. et al. Using deep learning to annotate the protein universe. Nat. Biotechnol. 40, 932–937 (2022).

Article
CAS
PubMed

Google Scholar

Gane, A. et al. ProtNLM: model-based natural language protein annotation. Preprint at https://storage.googleapis.com/brain-genomics-public/research/proteins/protnlm/uniprot_2022_04/protnlm_preprint_draft.pdf (2022).

Gligorijević, V. et al. Structure-based protein function prediction using graph convolutional networks. Nat. Commun. 12, 3168 (2021).

Article
PubMed
PubMed Central

Google Scholar

Zhou, N. et al. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biol. 20, 1–23 (2019).

Article

Google Scholar

Radivojac, P. et al. A large-scale evaluation of computational protein function prediction. Nat. Methods 10, 221–227 (2013).

Article
CAS
PubMed
PubMed Central

Google Scholar

Liu, W. et al. PLMSearch: protein language model powers accurate and fast sequence search for remote homology. Nat. Commun. 15, 2775 (2024).

Article
CAS
PubMed
PubMed Central

Google Scholar

Hong, L. et al. Fast, sensitive detection of protein homologs using deep dense retrieval. Nat. Biotechnol. 43, 983–995 (2025).

Article
CAS
PubMed

Google Scholar

Achiam, J. et al. GPT-4 technical report. Preprint at https://arxiv.org/abs/2303.08774 (2023).

Touvron, H. et al. LLaMA: open and efficient foundation language models. Preprint at https://arxiv.org/abs/2302.13971 (2023).

Touvron, H. et al. LLaMA 2: open foundation and fine-tuned chat models. Preprint at https://arxiv.org/abs/2307.09288 (2023).

Guo, D. et al. DeepSeek-R1: incentivizing reasoning capability in llms via reinforcement learning. Preprint at https://arxiv.org/abs/2501.12948 (2025).

Ferruz, N., Schmidt, S. & Höcker, B. ProtGPT2 is a deep unsupervised language model for protein design. Nat. Commun. 13, 4348 (2022).

Article
CAS
PubMed
PubMed Central

Google Scholar

Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).

Article
CAS
PubMed

Google Scholar

Elnaggar, A. et al. ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7112–7127 (2021).

Article

Google Scholar

Zhou, X. et al. Decoding the molecular language of proteins with Evolla. Preprint at bioRxiv https://doi.org/10.1101/2025.01.05.630192 (2025).

Peng, F. Z. et al. PTM-Mamba: a PTM-aware protein language model with bidirectional gated Mamba blocks. Nat. Methods 22, 945–949 (2025).

Article
CAS
PubMed
PubMed Central

Google Scholar

Su, J. et al. SaProt: protein language modeling with structure-aware vocabulary. In Proc. 12th International Conference on Learning Representations (ICLR, 2024); https://openreview.net/forum?id=6MRm3G4NiU

Su, J. et al. SaprotHub: making protein modeling accessible to all biologists. Preprint at bioRxiv https://doi.org/10.1101/2024.05.24.595648 (2024).

Radford, A. et al. Learning transferable visual models from natural language supervision. In Proc. 38th International Conference on Machine Learning (eds Meila, M. & Zhang, T.) 8748–8763 (PMLR, 2021).

Gu, Y. et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthc. 3, 1–23 (2021).

Article
CAS

Google Scholar

Boeckmann, B. et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 31, 365–370 (2003).

Article
CAS
PubMed
PubMed Central

Google Scholar

UniProt Consortium UniProt: a hub for protein information. Nucleic Acids Res. 43, D204–D212 (2015).

Article

Google Scholar

Koehler Leman, J. et al. Sequence–structure–function relationships in the microbial protein universe. Nat. Commun. 14, 2351 (2023).

Article
CAS
PubMed
PubMed Central

Google Scholar

Todd, A. E., Orengo, C. A. & Thornton, J. M. Evolution of protein function, from a structural perspective. Curr. Opin. Chem. Biol. 3, 548–556 (1999).

Article
CAS
PubMed

Google Scholar

Douze, M. et al. The Faiss library. Preprint at https://arxiv.org/abs/2401.08281 (2024).

Liu, S. et al. A text-guided protein design framework. Nat. Mach. Intell. 7, 580–591 (2025).

Article

Google Scholar

Xu, M., Yuan, X., Miret, S. & Tang, J. ProtST: multi-modality learning of protein sequences and biomedical texts. In Proc. 40th International Conference on Machine Learning (eds Krause, A. et al.) 38749–38767 (PMLR, 2023).

Chen, J. et al. Global marine microbial diversity and its potential in bioprospecting. Nature 633, 371–379 (2024).

Article
CAS
PubMed
PubMed Central

Google Scholar

Hu, Z. et al. Discovery and engineering of small SlugCas9 with broad targeting range and high specificity and activity. Nucleic Acids Res. 49, 4008–4019 (2021).

Article
CAS
PubMed
PubMed Central

Google Scholar

Yu, T. et al. Enzyme function prediction using contrastive learning. Science 379, 1358–1363 (2023).

Article
CAS
PubMed

Google Scholar

Kweon, J. et al. Efficient DNA base editing via an optimized DYW-like deaminase. Preprint at bioRxiv https://doi.org/10.1101/2024.05.15.594452 (2024).

Gherardini, P. F., Wass, M. N., Helmer-Citterich, M. & Sternberg, M. J. E. Convergent evolution of enzyme active sites is not a rare phenomenon. J. Mol. Biol. 372, 817–845 (2007).

Article
CAS
PubMed

Google Scholar

Doolittle, R. F. Convergent evolution: the need to be explicit. Trends Biochem. Sci. 19, 15–18 (1994).

Article
CAS
PubMed

Google Scholar

Buchfink, B., Reuter, K. & Drost, H.-G. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat. Methods 18, 366–368 (2021).

Article
CAS
PubMed
PubMed Central

Google Scholar

Pomaznoy, M., Ha, B. & Peters, B. GOnet: a tool for interactive Gene Ontology analysis. BMC Bioinformatics 19, 470 (2018).

Article
CAS
PubMed
PubMed Central

Google Scholar

Dauparas, J. et al. Robust deep learning–based protein sequence design using ProteinMPNN. Science 378, 49–56 (2022).

Article
CAS
PubMed
PubMed Central

Google Scholar

He, Y. et al. Protein language models-assisted optimization of a uracil-N-glycosylase variant enables programmable T-to-G and T-to-C base editing. Mol. Cell 84, 1257–1270 (2024).

Article
CAS
PubMed

Google Scholar

Tong, H. et al. Development of deaminase-free T-to-S base editor and C-to-G base editor by engineered human uracil DNA glycosylase. Nat. Commun. 15, 4897 (2024).

Article
CAS
PubMed
PubMed Central

Google Scholar

Ye, L. et al. Glycosylase-based base editors for efficient T-to-G and C-to-G editing in mammalian cells. Nat. Biotechnol. 42, 1538–1547 (2024).

Article
CAS
PubMed

Google Scholar

Cornman, A. et al. The OMG dataset: an Open MetaGenomic corpus for mixed-modality genomic language modeling. In Proc. 13th International Conference on Learning Representations (ICLR, 2025); https://openreview.net/forum?id=jlzNb1iWs3

Kavli, B. et al. Excision of cytosine and thymine from DNA by mutants of human uracil-DNA glycosylase. EMBO J. 15, 3442–3447 (1996).

Article
CAS
PubMed
PubMed Central

Google Scholar

Hayes, T. et al. Simulating 500 million years of evolution with a language model. Science 387, 850–858 (2025).

Article
CAS
PubMed

Google Scholar

Burley, S. K. et al. RCSB Protein Data Bank: biological macromolecular structures enabling research and education in fundamental biology, biomedicine, biotechnology and energy. Nucleic Acids Res. 47, D464–D474 (2019).

Article
CAS
PubMed

Google Scholar

Richardson, L. et al. MGnify: the microbiome sequence data analysis resource in 2023. Nucleic Acids Res. 51, D753–D759 (2023).

Article
CAS
PubMed

Google Scholar

Pruitt, K. D., Tatusova, T., Brown, G. R. & Maglott, D. R. NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy. Nucleic Acids Res. 40, D130–D135 (2012).

Article
CAS
PubMed

Google Scholar

Dai, F. et al. Toward de novo protein design from natural language. Preprint at bioRxiv https://doi.org/10.1101/2024.08.01.606258 (2024).

Liu, N. et al. Protein design with dynamic protein vocabulary. Preprint at https://arxiv.org/abs/2505.18966 (2025).

Kuang, J., Liu, N., Sun, C., Ji, T. & Wu, Y. PDFBench: a benchmark for de novo protein design from function. Preprint at https://arxiv.org/abs/2505.20346 (2025).

Ko, Young Su. Using ProTrek for protein binder design. Twitter https://x.com/youngsuko9/status/1865845977673834595 (2024).

Gitter, A. Using ProTrek to retrieve proteins with desired function. Twitter https://x.com/anthonygitter/status/1827760237194920435 (2024).

Gitter, A. Using ProTrek to retrieve proteins with desired function. Twitter https://x.com/anthonygitter/status/1813427191000035330 (2024).

Gitter, A. Using ProTrek to retrieve proteins with desired function. Twitter https://x.com/anthonygitter/status/1882642214624678193 (2025).

Varadi, M. et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 50, D439–D444 (2022).

Article
CAS
PubMed

Google Scholar

van den Oord, A., Li, Y. & Vinyals, O. Representation learning with contrastive predictive coding. Preprint at https://arxiv.org/abs/1807.03748 (2018).

Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) 4171–4186 (Association for Computational Linguistics, 2019).

Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).

Article
CAS
PubMed
PubMed Central

Google Scholar

Rasley, J., Rajbhandari, S., Ruwase, O. & He, Y. DeepSpeed: system optimizations enable training deep learning models with over 100 billion parameters. In Proc. 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (eds Gupta, R. & Liu, Y.) 3505–3506 (Association for Computing Machinery, 2020).

Loshchilov, I. and Hutter, F. Fixing weight decay regularization in Adam. OpenReview.net https://openreview.net/forum?id=rk6qdGgCZ (2018).

Loshchilov, I. & Hutter, F. SGDR: Stochastic gradient descent with warm restarts. In Proc. International Conference on Learning Representations (ICLR, 2017); https://openreview.net/forum?id=Skq89Scxx

Xu, J. et al. Protein inverse folding from structure feedback. Preprint at https://arxiv.org/abs/2506.03028 (2025).

Enzyme Nomenclature (Nomenclature Committee of the International Union of Biochemistry and Molecular Biology, 2024); https://iubmb.qmul.ac.uk/enzyme/

Kanehisa, M., Furumichi, M., Tanabe, M., Sato, Y. & Morishima, K. KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 45, D353–D361 (2017).

Article
CAS
PubMed

Google Scholar

Kucera, T., Oliver, C., Chen, D., and Borgwardt, K. ProteinShake: building datasets and benchmarks for deep learning on protein structures. In Advances in Neural Information Processing Systems 36 (eds Oh, A. et al.) (NeurIPS, 2023).

Source link

Cyptea Daily News

A trimodal protein language model enables advanced protein searches

Related Articles

Leave a Reply Cancel reply

A trimodal protein language model enables advanced protein searches

Related Articles

Federal Workers’ Union Sues Administration Over Partisan Email Language – The New York Times

Delaware vs. Western Kentucky odds, spread, time: 2025 college football predictions from proven model

Have agencies violated the Hatch Act by using Democrat-blaming language online?

Leave a Reply Cancel reply