•  
  •  
 

Bulletin of Chinese Academy of Sciences (Chinese Version)

Keywords

scientific paradigm, big-data, life science

Document Type

High Ground of Science and Innovation

Abstract

The field of life sciences is rapidly evolving, driven by advancements in experimental techniques and vast biological big data which gradually arise and play an increasingly important role in life science research. First of all, biological big data has diversity and complexity, including genomic data, epigenomic data, proteomic data and other types. These data provide researchers with more comprehensive information and help reveal the laws behind life phenomena. Second, new data-driven developments and applications in life sciences cover many fields such as gene editing, precision medicine, drug development, etc., providing unprecedented possibilities for human health and quality of life. However, the era of big data for life science research also faces challenges in various aspects including data storage, sharing, and privacy protection, as well as how to transform massive data into reliable scientific discoveries. This paper provides a brief overview of the law of development of biological data in driving life sciences, sorts out the composition and characteristics of biological big data and its sources, as well as elaborates and discusses the common problems and challenges faced by our country under the new paradigm of data-driven life science research.

First page

862

Last Page

871

Language

Chinese

Publisher

Bulletin of Chinese Academy of Sciences

References

1 Kuhn T S. The Structure of Scientific Revolutions. Chicago: University of Chicago Press, 1962.

2 李鑫, 于汉超. 人工智能驱动的生命科学研究新范式. 中国科学院院刊, 2024, 39(1): 50-58. Li X, Yu H C. A new paradigm of life science research driven by artificial intelligence. Bulletin of Chinese Academy of Sciences, 2024, 39(1): 50-58. (in Chinese)

3 Vesalius A B. De Humani Corporis Fabrica. Basel: Andreas Oporinus, 1543.

4 Darwin C, Kebler L. On the Origin of Species. London: John Murray, 1859.

5 Watson J D, Crick F H C. Molecular structure of nucleic acids: A structure for deoxyribose nucleic acid. Nature, 1953, 171: 737-738.

6 Maxam A M, Gilbert W. A new method for sequencing DNA. PNAS, 1977, 74(2): 560-564.

7 Lander E S, Linton L M, Birren B, et al. Initial sequencing and analysis of the human genome. Nature, 2001, 409: 860-921.

8 Borkakoti N, Thornton J M. AlphaFold2 protein structure prediction: Implications for drug discovery. Current Opinion in Structural Biology, 2023, 78: 102526.

9 Yang X, Liu G, Feng G, et al. GeneCompass: Deciphering Universal Gene Regulatory Mechanisms with Knowledge-Informed Cross-Species Foundation Model. (2023-09-26). https://www. biorxiv.org/content/10.1101/2023.09.26.559542v1.

10 Burger B, Maffettone P M, Gusev V V, et al. A mobile robotic chemist. Nature, 2020, 583: 237-241.

11 Merchant A, Batzner S, Schoenholz S S, et al. Scaling deep learning for materials discovery. Nature, 2023, 624: 80-85.

12 Panesar A. Machine Learning and AI for Healthcare. Coventry: Apress, 2019.

13 Baro E, Degoul S, Beuscart R, et al. Toward a literature-driven definition of big data in healthcare. BioMed Research International, 2015, 2015: 639021.

14 Tomczak K, Czerwińska P, Wiznerowicz M. The Cancer Genome Atlas (TCGA): An immeasurable source of knowledge. Contemporary Oncology (Pozn), 2015, 19(1A): A68-A77.

15 CNCB-NGDC Members and Partners. Database resources of the national genomics data center, China National Center for Bioinformation in 2022. Nucleic Acids Research, 2022, 50(D1): D27-D38.

16 Cheng C Y, Soh Z D, Majithia S, et al. Big data in ophthalmology. Asia-Pacific Journal of Ophthalmology, 2020, 9(4): 291-298.

17 Ristevski B, Chen M. Big data analytics in medicine and healthcare. Journal of Integrative Bioinformatics, 2018, 15(3): 20170030.

18 Shen L, Bai J W, Wang J, et al. The fourth scientific discovery paradigm for precision medicine and healthcare: Challenges ahead. Precision Clinical Medicine, 2021, 4(2): 80-84.

19 Sanger F, Nicklen S, Coulson A R. DNA sequencing with chain-terminating inhibitors. PNAS, 1977, 74(12): 5463-5467.

20 Mardis E R. Next-generation sequencing platforms. Annual Review of Analytical Chemistry, 2013, 6: 287-303.

21 van Dijk E L, Jaszczyszyn Y, Naquin D, et al. The third revolution in sequencing technology. Trends in Genetics: TIG, 2018, 34(9): 666-681.

22 Slatko B E, Gardner A F, Ausubel F M. Overview of next-generation sequencing technologies. Current Protocols in Molecular Biology, 2018, 122(1): e59.

23 Cao M D, Ganesamoorthy D, Elliott A G, et al. Streaming algorithms for identification pathogens and antibiotic resistance potential from real-time MinIONTM sequencing. GigaScience, 2016, 5(1): s13742-16-0137-2.

24 Mueller C, Herrmann P, Cichos S, et al. Automated electronic health record to electronic data capture transfer in clinical studies in the German health care system: Feasibility study and gap analysis. Journal of Medical Internet Research, 2023, 25: e47958.

25 Sayers E W, Bolton E E, Brister J R, et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Research, 2022, 50(D1): D20-D26.

26 Thakur M, Bateman A, Brooksbank C, et al. EMBL’s European Bioinformatics Institute (EMBL-EBI) in 2022. Nucleic Acids Research, 2023, 51(D1): D9-D17.

27 Fukuda A, Kodama Y, Mashima J, et al. DDBJ update: Streamlining submission and access of human data. Nucleic Acids Research, 2021, 49(D1): D71-D75.

28 Leek J T, Scharpf R B, Bravo H C, et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Reviews Genetics, 2010, 11(10): 733-739.

29 Stuart T, Butler A, Hoffman P, et al. Comprehensive integration of single-cell data. Cell, 2019, 177(7): 1888-1902.

30 Alvi M A, Wilson R H, Salto-Tellez M. Rare cancers: The greatest inequality in cancer research and oncology treatment. British Journal of Cancer, 2017, 117(9): 1255-1257.

31 Carvalho D M, Richardson P J, Olaciregui N, et al. Repurposing Vandetanib plus Everolimus for the treatment of ACVR1-mutant diffuse intrinsic pontine glioma. Cancer Discovery, 2022, 12(2): 416-431.

32 Angermueller C, Pärnamaa T, Parts L, et al. Deep learning for computational biology. Molecular Systems Biology, 2016, 12(7): 878.

33 Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 2008, 51(1): 107-113.

34 Benson D A, Karsch-Mizrachi I, Clark K, et al. GenBank. Nucleic Acids Research, 2012, 40(D1): D48-D53.

35 Berman H M, Westbrook J, Feng Z, et al. The Protein Data Bank. Nucleic Acids Research, 2000, 28(1): 235-242.

36 Chen F Z, You L J, Yang F, et al. CNGBdb: China National GeneBank DataBase. Yi Chuan, 2020, 42(8): 799-809.

37 Marx V. Biology: The big challenges of big data. Nature, 2013, 498: 255-260.

38 Zhang H, Wang L Q, Huang H. SMARTH: Enabling multi-pipeline data transfer in HDFS// 2014 43rd International Conference on Parallel Processing. Minneapolis: IEEE, 2014: 30-39.

39 Theodoris C V, Xiao L, Chopra A, et al. Transfer learning enables predictions in network biology. Nature, 2023, 618: 616-624.

Share

COinS