Bulletin of Chinese Academy of Sciences (Chinese Version)

Breaking through “data bottleneck” of AI large models—Reflections on building a national corpus operation platform

Xingteng LI, School of Public Affairs, Zhejiang University, Hangzhou 310058, ChinaFollow
Feng FENG, School of Management, University of Science and Technology of China, Hefei 230026, China
Liqiang HUANG, School of Management, Zhejiang University, Hangzhou 310058, China

Keywords

artificial intelligence；large models；corpus；data bottleneck

Abstract

At present, the competition within the global artificial intelligence (AI) large model industry is intensifying, and corpus resources emerging as a critical determinant for enhancing the technical performance and practical efficacy of AI systems. Nevertheless, China’s corpus development faces dual challenges in both quantity and quality, struggling to meet the escalating training demands of the rapidly evolving AI large model sector. Internationally, nations are ramping up efforts to develop their corpus infrastructures, particularly prioritizing the creation and deployment of high-quality linguistic datasets. In this context, through comparative analysis of international benchmarks and domestic conditions, this study proposes a strategic framework for establishing a national corpus management platform. The proposal encompasses four pivotal dimensions: platform orientation, architectural design, governing entities, and key functional components.

First page

522

Last Page

529

Language

Chinese

Publisher

Bulletin of Chinese Academy of Sciences

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

References

1 王文. 全球科技竞争进入“高科技冷战时代”. 中国科学院院刊, 2024, 39(1): 112-120.Wang W. Global technological competition enters high-tech cold war era. Bulletin of Chinese Academy of Sciences, 2024, 39(1): 112-120. (in Chinese)

2 Villalobos P, Ho A, Sevilla J, et al. Will we run out of data? Limits of LLM scaling based on human-generated data. arXiv, 2022, doi: 10.48550/arXiv.2211.04325.

3 Villalobos P, Ho A, Sevilla J, et al. Position: Will we run out of data? Limits of LLM scaling based on human-generated data. (2024-03-02)[2025-03-06] https://openreview.net/forum?id=ViZcgDQjyG.

4 《求是》杂志评论员. 深刻认识和加快发展新质生产力.求是, 2024, (5): 39-41. Commentator for Qiushi Magazine. Deeply understanding and accelerating the development of New Qualitative Productivity. Qiushi, 2024, (5): 39-41. (in Chinese)

Recommended Citation

LI, Xingteng; FENG, Feng; and HUANG, Liqiang (2024) "Breaking through “data bottleneck” of AI large models—Reflections on building a national corpus operation platform," Bulletin of Chinese Academy of Sciences (Chinese Version): Vol. 40 : Iss. 3 , Article 16.
DOI: https://doi.org/10.16418/j.issn.1000-3045.20240510001
Available at: https://bulletinofcas.researchcommons.org/journal/vol40/iss3/16

Download

Request a Copy