Bulletin of Chinese Academy of Sciences (Chinese Version)
Keywords
artificial intelligence;large models;corpus;data bottleneck
Document Type
Policy & Management Research
Abstract
At present, the competition within the global artificial intelligence (AI) large model industry is intensifying, and corpus resources emerging as a critical determinant for enhancing the technical performance and practical efficacy of AI systems. Nevertheless, China’s corpus development faces dual challenges in both quantity and quality, struggling to meet the escalating training demands of the rapidly evolving AI large model sector. Internationally, nations are ramping up efforts to develop their corpus infrastructures, particularly prioritizing the creation and deployment of high-quality linguistic datasets. In this context, through comparative analysis of international benchmarks and domestic conditions, this study proposes a strategic framework for establishing a national corpus management platform. The proposal encompasses four pivotal dimensions: platform orientation, architectural design, governing entities, and key functional components.
First page
522
Last Page
529
Language
Chinese
Publisher
Bulletin of Chinese Academy of Sciences
References
1 王文. 全球科技竞争进入“高科技冷战时代”. 中国科学院院刊, 2024, 39(1): 112-120.Wang W. Global technological competition enters high-tech cold war era. Bulletin of Chinese Academy of Sciences, 2024, 39(1): 112-120. (in Chinese)
2 Villalobos P, Ho A, Sevilla J, et al. Will we run out of data? Limits of LLM scaling based on human-generated data. arXiv, 2022, doi: 10.48550/arXiv.2211.04325.
3 Villalobos P, Ho A, Sevilla J, et al. Position: Will we run out of data? Limits of LLM scaling based on human-generated data. (2024-03-02)[2025-03-06] https://openreview.net/forum?id=ViZcgDQjyG.
4 《求是》杂志评论员. 深刻认识和加快发展新质生产力.求是, 2024, (5): 39-41. Commentator for Qiushi Magazine. Deeply understanding and accelerating the development of New Qualitative Productivity. Qiushi, 2024, (5): 39-41. (in Chinese)
Recommended Citation
LI, Xingteng; FENG, Feng; and HUANG, Liqiang
(2024)
"Breaking through “data bottleneck” of AI large models—Reflections on building a national corpus operation platform,"
Bulletin of Chinese Academy of Sciences (Chinese Version): Vol. 40
:
Iss.
3
, Article 16.
DOI: https://doi.org/10.16418/j.issn.1000-3045.20240510001
Available at:
https://bulletinofcas.researchcommons.org/journal/vol40/iss3/16
Included in
Artificial Intelligence and Robotics Commons, Defense and Security Studies Commons, Information Security Commons, Science and Technology Policy Commons