•  
  •  
 

Bulletin of Chinese Academy of Sciences (Chinese Version)

Keywords

artificial intelligence;large models;corpus;data bottleneck

Document Type

Policy & Management Research

Abstract

At present, the competition within the global artificial intelligence (AI) large model industry is intensifying, and corpus resources emerging as a critical determinant for enhancing the technical performance and practical efficacy of AI systems. Nevertheless, China’s corpus development faces dual challenges in both quantity and quality, struggling to meet the escalating training demands of the rapidly evolving AI large model sector. Internationally, nations are ramping up efforts to develop their corpus infrastructures, particularly prioritizing the creation and deployment of high-quality linguistic datasets. In this context, through comparative analysis of international benchmarks and domestic conditions, this study proposes a strategic framework for establishing a national corpus management platform. The proposal encompasses four pivotal dimensions: platform orientation, architectural design, governing entities, and key functional components.

First page

522

Last Page

529

Language

Chinese

Publisher

Bulletin of Chinese Academy of Sciences

References

1​ 王文. 全球科技竞争进入“高科技冷战时代”. 中国科学院院刊, 2024, 39(1): 112-120.Wang W. Global technological competition enters high-tech cold war era. Bulletin of Chinese Academy of Sciences, 2024, 39(1): 112-120. (in Chinese)

​2​ Villalobos P, Ho A, Sevilla J, et al. Will we run out of data? Limits of LLM scaling based on human-generated data. arXiv, 2022, doi: 10.48550/arXiv.2211.04325.

​3​ Villalobos P, Ho A, Sevilla J, et al. Position: Will we run out of data? Limits of LLM scaling based on human-generated data. (2024-03-02)[2025-03-06] https://openreview.net/forum?id=ViZcgDQjyG.

​4​ 《求是》杂志评论员. 深刻认识和加快发展新质生产力.求是, 2024, (5): 39-41. Commentator for Qiushi Magazine. Deeply understanding and accelerating the development of New Qualitative Productivity. Qiushi, 2024, (5): 39-41. (in Chinese)

Share

COinS