Bulletin of Chinese Academy of Sciences (Chinese Version)
Keywords
deep learning processor chip, fundamental research, processor architecture
Abstract
Deep learning processor chips serve as an important computation foundation for the artificial intelligence industry, thus has attracted widespread attention from the international communities. The Institute of Computing Technology, Chinese Academy of Sciences took the lead in initiating fundamental research on deep learning processor chips in 2008. At that time, this research direction had not yet entered the mainstream vision of the international communities, with its academic value and development laws yet to gain widespread recognition, and it was confronted with the challenges of three fundamental scientific issues concerning its significance, principle, and architecture.To address the above issues, the research team carried out systematic exploration, made initial breakthroughs in three aspects, constructed a new paradigm for deep learning processor chips, and developed Cambricon-1, which is the world’s first deep learning processor chips, thus initially completing the preliminary transformation from theory to engineering. These efforts have contributed, in a modest but meaningful way, to the establishment and growth of deep learning processor chip research as an internationally recognized direction. These related works have been widely cited in academia and adopted in industry by nearly 2000 institutions across 80 countries on six continents. They are earlier than various products for dedicated deep learning acceleration, including NVIDIA’s Tensor Core, Google’s TPU, and Huawei’s NPU.This study systematically summarizes the practical experience and in-depth enlightenment of basic research on deep learning processor chips from the dual perspectives of researchers and research managers, aiming to provide some useful reference for Chinese fundamental research in promoting international academic and industrial progresses.
First page
1276
Last Page
1288
Language
Chinese
Publisher
Bulletin of Chinese Academy of Sciences
Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
References
[1] Larson C. China’s AI imperative. Science, 2018, 359: 628-630.
[2] Turing A M. On computable numbers, with an application to the entscheidungs problem. Proceedings of the London Mathematical Society, 1936, 2(42): 230-265.
[3] von Neumann J. First draft of a report on the EDVAC. IEEE Annals of the History of Computing, 1993, 15(4): 27-75.
[4] Hinton G E, Salakhutdinov R R. Reducing the dimensionality of data with neural networks. Science, 2006, 313: 504-507.
[5] Le Q V. Building high-level features using large scale unsupervised learning// 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, BC: IEEE, 2013: 8595-8598.
[6] 李国杰. 计算范式的扩展与跃迁:从CS1.0到CS2+. 计算机研究与发展, 2026, 63(6): 1377-1389. Li G J. Expansion and transition of computing paradigms: From CS1.0 to CS2+. Journal of Computer Research and Development, 2026, 63(6): 1377-1389. (in Chinese)
[7] Cybenko G. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 1989, 2(4): 303-314.
[8] Chen T S, Chen Y J, Duranton M, et al. BenchNN: On the broad potential application scope of hardware neural network accelerators// 2012 IEEE International Symposium on Workload Characterization (IISWC). La Jolla: IEEE, 2012: 36-45.
[9] Chen T S, Du Z D, Sun N H, et al. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning// Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems. Salt Lake City: ACM, 2014: 269-284.
[10] Chen Y J, Luo T, Liu S L, et al. DaDianNao: A machine-learning supercomputer// Proceedings of the 47th IEEE/ACM International Symposium on Microarchitecture (MICRO’14). Cambridge: IEEE, 2014: 609-622.
[11] Liu D F, Chen T S, Liu S L, et al. PuDianNao: A polyvalent machine learning accelerator// Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems. Istanbul Turkey: ACM, 2015: 369-381.
[12] Du Z D, Fasthuber R, Chen T S, et al. ShiDianNao: Shifting vision processing closer to the sensor// Proceedings of the 42nd Annual International Symposium on Computer Architecture. Portland Oregon: ACM, 2015: 92-104.
[13] Liu S L, Du Z D, Tao J H, et al. Cambricon: An instruction set architecture for neural networks// 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). Seoul: IEEE, 2016: 393-405.
[14] Nvidia. Tesla VI00 GPU architecture: The world's most advanced data center GPU. Santa Clara: NVIDIA Corporation, 2017.
[15] Jouppi N P, Young C, Patil N, et al. In-datacenter performance analysis of a tensor processing unit. (2017-04-16)[2026-06-08]. https://arxiv.org/abs/1704.04760.
[16] Hennessy J L, Patterson D A. A new golden age for computer architecture. Communications of the ACM, 2019, 62(2): 48-60.
[17] Du Z D, Palem K V, Avinash L, et al. Leveraging the error resilience of machine-learning applications for designing highly energy efficient accelerators//2014 19th Asia and South Pacific Design Automation Conference (ASP-DAC). Singapore: IEEE, 2014: 201-206.
[18] NVIDIA Corporation. Tensor processing using low precision format: U.S, US20170372202A1. 2017-12-28.
[19] NVIDIA Corporation. Fine-grained per-vector scaling for neural network quantization: U.S, US12045307. 2024-05-21.
[20] NVIDIA Corporation. Parallel thread execution ISA version 9.0. (2024-11-15)[2026-06-08]. https://docs.nvidia.com/cuda/parallel-thread-execution/.
[21] Raihan M A, Goli N, Aamodt T M. Modeling deep learning accelerator enabled GPUs// 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). Madison: IEEE, 2019: 79-92.
[22] Zhang J F, Lee C E, Liu C, et al. SNAP: An efficient sparse neural acceleration processor for unstructured sparse deep neural network inference. IEEE Journal of Solid-State Circuits, 2021, 56(2): 636-647.
[23] Shao Y S, Clemons J, Venkatesan R, et al. Simba: Scaling deep-learning inference with multi-chip-module-based architecture// Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. Columbus: ACM, 2019: 14-27.
[24] Liao H, Tu J J, Xia J, et al. DaVinci: A scalable architecture for neural network computing// 2019 IEEE Hot Chips 31 Symposium (HCS). Cupertino: IEEE, 2019: 1-44.
[25] 梁晓峣. 昇腾AI处理器架构与编程——深入理解CANN技术原理及应用. 北京: 清华大学出版社, 2019. Liang X Y. Ascend AI Processor Architecture and Programming—Principles and Applications of CANN. Beijing: Tsinghua University Press, 2019. (in Chinese)
Recommended Citation
CHEN, Yunji
(2026)
"Deep learning processor: Road to fundamental research on AI chip,"
Bulletin of Chinese Academy of Sciences (Chinese Version): Vol. 41
:
Iss.
6
, Article 21.
DOI: https://doi.org/10.3724/j.issn.1000-3045.20260410006
Available at:
https://bulletinofcas.researchcommons.org/journal/vol41/iss6/21


