The field of life science and biomedicine is stepping into the digital 3.0 era, and AI is accelerating the steady development of the field of life health and biomedicine towards a faster, more accurate, safer, more economical and more inclusive direction.
On the afternoon of September 26, the 2021 World Internet Conference was held in Wuzhen. At the Data and Algorithm Forum, Academician Zhang Yaqin, president of the Institute of Intelligent Industry (AIR) of Tsinghua University, introduced the new digitization and intelligence changes in the biological world around the theme of "Artificial intelligence enables life science", and shared the new layout of the Institute of Intelligent Industry (AIR) of Tsinghua University in the development of artificial intelligence and life health interdisciplinary. The report was jointly completed by President Zhang Yaqin and team members Ma Weiying, LAN Yanyan and Huang Tingting.
With the development of gene sequencing technology, high-throughput biological experiments, sensors and other technologies, the field of life science and biomedicine is stepping into the digital 3.0 era, and the process of digitalization and automation is accelerating. As a new intelligent scientific computing model, health computing is the fourth research paradigm with artificial intelligence and data-driven as the core. It will greatly help human beings to explore and solve life and health problems.
The development of artificial intelligence from the 1950s to today has produced a lot of different algorithms, especially the early deep learning technology represented by RNN, LSTM and CNN, and the past two years GAN, transformer based (BERT and GPT-3 models), pre-trained models, and so on. It can be said that from our perception, speech recognition, face recognition, and object classification have reached the same level as people. But there are many gaps in natural language understanding, knowledge reasoning, and video semantics and generalization abilities. In addition, there are still major challenges in algorithmic transparency, interpretability, causality, security, privacy and ethics.
There have been many recent advances in trusted AI computing, one example of which is Federated Learning, which is also an important research topic at Tsinghua University's Intelligent Industry Research Institute. There are two main schemes for federated learning. One is horizontal federated learning, which is mainly oriented to scenarios with the same characteristics and models from different sources and can ensure the privacy of data from different sources with the same mode. The other is called vertical federation learning, which can handle different features and models from different sources and can guarantee the privacy of multi-modal data.
We have seen that AI is accelerating the steady development of life health and biomedicine fields towards a faster, more accurate, safer, more economical and more inclusive direction. Specifically, the research of artificial intelligence in protein structure prediction, CRISPR gene editing technology, antibody /TCR/ personalized vaccine research and development, precision medicine, AI-assisted drug design and other aspects has become an international frontier strategic research hotspot.
Considering such disciplinary development trends and industrial background, Tsinghua University Intelligent Industry Research Institute (AIR) has made four research directions in the "AI+ life and health direction", focusing on the research of "AI enhances personal health management and public health", "AI+ medical and life sciences", "AI-assisted drug research and development" and "AI+ gene analysis and editing".
As a cross-field research and application, AIR recognizes that there is a large knowledge gap between artificial intelligence and the life sciences and biomedical fields, and there is a lack of data sets, AI platforms, core algorithms, and computing engines for biological computing, and cross-border talents are also very scarce. In response to the above challenges, AIR proposed the "AI+ Life Science Breaking the Wall Plan", the goal is to define the core frontier research tasks in the field of AI+ life science, cross the field gap between the field of life health and artificial intelligence, break the barriers, promote the deep cross-integration of AI and life science, and accelerate scientific discovery.
To this end, we need to build artificial intelligence infrastructure, data platforms, and core algorithm engines for the field of life science to support cutting-edge research tasks in life science. At the same time, by creating a flagship open data set, organizing algorithm challenge competitions, building a mass intelligence platform for AI+ life science, cultivating cross-border talents, and building an industrial ecology.
AlphaFold2 is a classic success story for AI+ life sciences. Its success factors come from two aspects. First, it is the particularity of the task. Protein structure prediction can be regarded as a one-to-one mapping problem from sequence to three-dimensional structure, so it is a well defined AI problem. This is the goal of Project Break the Wall, to find significant research tasks in the life sciences that can be abstracted as suitable for AI. The second is the superiority of the model. On the one hand, long-term research in the field of life sciences has accumulated large-scale protein structure data, and the entire model architecture of AlphaFold2 makes full use of data-driven end-to-end deep learning models, and the combination of big data and deep models is exactly the typical characteristic of the fourth paradigm. Therefore, the revelation that AlphaFold2 brings us is that in the research of AI+ life science, we should pay attention to the importance of breaking the wall and the fourth paradigm.
Clearly, AlphaFold2 is just the beginning, and its success is starting a new paradigm. The accurate prediction of protein structure provides life scientists with an efficient computational tool, and also provides the possibility of major life science discoveries based on AI. In the future, the epitopes prediction of antibodies and antigens, the precision therapy of tumors, and the design and optimization of TCR/ personalized vaccines will become important research hotspots, and breakthrough progress will be made under the new computing model driven by AI, and the golden age of AI+ macromolecular pharmaceuticals will officially arrive.
Among them, many new scientific challenges will arise, but also herald the emergence of new computing paradigms, such as the integration of dry and wet closed-loop computing framework. On the one hand, artificial intelligence models will become more intelligent through closed-loop verification and data supplement of high-throughput, multi-round wet experiments. On the other hand, through active learning or reinforcement learning, AI will actively plan the automation of wet experiments, form dry and wet closed-loop verification, and iteratively accelerate life science discovery and industrial application. We foresee that through the opening of the wet and dry closed loop, life science research and biomedical industry will usher in a new research paradigm and industrial model.
AIR has already made some initial advances in the expression and prediction of genetic data. Recently, the GeneBert team led by Professor LAN Yanyan from the Institute of Intelligent Industry (AIR) of Tsinghua University designed a novel gene pre-training model. By constructing a two-dimensional matrix between sequences and transcription factors, a multi-modal gene pre-training model was realized, and an effective representation of genetic data was obtained. In particular, the data value of non-coding regions has been mined, which has greatly improved the performance in the prediction of downstream promoter and transcriptor binding sites, and gene screening for Hirschsprung's disease. We believe that the continued in-depth application of cutting-edge AI technologies such as pre-training on genetic data will further explore the value of genetic data, help us crack the human code, and play a role in important issues such as the precision treatment of cancer.
In summary, we believe that the biological world is in the new revolution of digitalization, automation and intelligent scientific computing, and it has become an important research direction to use computational methods, namely artificial intelligence and data-driven fourth research paradigm, to assist people to explore and solve life and health problems. In the future, it is necessary for academia and industry to jointly promote the development of life sciences, biomedicine, genetic engineering and personal health from isolation and open-loop to collaborative and closed-loop development, and achieve faster, more accurate, safer, more economical and more inclusive innovation in life sciences and biomedicine, which represents a huge new opportunity for scientific development and industrial innovation in the next decade.
email:1583694102@qq.com
wang@kongjiangauto.com