
A research team from Arc Institute, NVIDIA, Stanford University, University of California, Berkeley, and University of California, San Francisco has jointly released a large-scale biological model called Evo2.
Evo2 is currently the largest publicly available AI biology model, with a complete version containing 40 billion parameters and covering 9.3 trillion nucleotides from 128000 species.
This model integrates the core languages of biology - DNA, RNA, and proteins - and is capable of processing sequences up to one million nucleotides at once.
As a genome based model, Evo2 has powerful functions such as generating complete genomes, predicting mutations, and parsing non coding DNA, which can be widely applied in fields such as biomolecular research, precision medicine, drug development, and synthetic biology.
1、 Background of the article and research motivation
This article was published by a research team from Stanford University and Arc Institute, mainly exploring a multimodal genomic basic model called Evo.
(This also includes scientists from institutions such as the University of California, Berkeley and Arc Research, who have deep research backgrounds and rich experience in genomics, bioinformatics, machine learning, and other fields.
This interdisciplinary collaboration effectively promotes the development of models!)

Technical Architecture and Methods
Evo 2 adopts a hybrid architecture Stripeshyena, which combines the language modeling capability of single nucleotide resolution and processes patterns in long sequences through multi-layer convolution operators and rotation position embedding mechanisms.
Its training data includes over 128000 genomic datasets covering 9.3 trillion nucleotides, which enables the model to have higher accuracy and versatility in processing complex biological information.
Evo aims to achieve sequence modeling and design at the molecular to whole genome level through deep learning techniques.
This model is based on 7 billion parameters and is capable of processing long sequences and generating high-quality genome sequences.
2. Core features of Evo model
Multimodality: Evo can simultaneously process DNA, RNA, and protein sequences to form a unified information flow. The purpose is to enable the model to better understand the complexity of the genome
Long sequence processing capability: Evo adopts a byte level single nucleotide parser, which can process long sequences exceeding 1 megabase (Mb), which is difficult to achieve in previous methods. (Just like how big language models like Kimi can handle long texts and PDFs, it's a huge leap in ability)
Generation and prediction tasks: Evo can not only generate high-quality genome sequences, but also predict the impact of gene mutations on protein function, as well as the regulation of gene expression by transcription factor binding sites (RBPs). (The ability to learn from scratch)
3、 Actually, it's a bit abstract. What can we do specifically? To summarize, the main points are as follows.
One is gene function prediction: In bacterial and phage genomes, the accuracy of Evo's gene function prediction increases with the increase of context window size.
The second aspect is the design of the CRISPR Cas system: Evo has demonstrated outstanding performance in synthetic biology experiments using the CRISPR Cas system, capable of generating diverse Cas site samples and analyzing their differences from training data through pHMMs.
Thirdly, multi gene system design: Evo has successfully generated sequences of multi gene systems that are highly similar to real data in terms of genome organization, coding density, and natural genome.

In the future, Evo 2 provides a powerful tool for quickly interpreting complex genetic information, supporting genetic engineering and treatment development.
Meanwhile, Evo 2's multimodal generation capability is not limited to biology, but can also be applied in fields such as chemistry and materials science.
