š¤ AI Summary
This work addresses the energy-efficiency and speed bottlenecks of conventional electronic computing in high-throughput AI tasks by proposing a universal matrix multiplication architecture based on in-phase photonic integrated circuits. The system integrates, for the first time, a 256Ć256 computational array on a single chip, leveraging time-division multiplexing and on-chip optical fan-out to reduce the number of modulators from O(N²) to O(N). Combining thin-film lithium niobate high-speed modulators, silicon/silicon-nitride photonic circuits, and wafer-scale packaging, the platform achieves 6ā7 bits of computational precision at 120 Gbaud/s, supports channel configurations ranging from 8Ć8 to 256Ć100, and delivers 1,000ā6,000 TOPS throughput with an energy efficiency of 330 TOPS/W. The system successfully deploys the Qwen2.5-0.5B model for accurate token generation.
š Abstract
High-performance computing underpins modern artificial intelligence (AI), enabling foundation models, real-time inference and perception in autonomous systems, and data-intensive scientific simulations. Recent advances in quantization techniques utilizing low-precision computation without degrading model accuracy, creates new opportunities for analog photonic computing characterized by ultra-high clock rates and low energy consumption. Here we propose and demonstrate a coherent homodyne integrated circuit capable of general matrix multiplication(GEMM) with aggregate throughput that exceeds 1,000 TOPS (tera-operations per second), enabled by massive on-chip optical fanout and parallelism. By leveraging time multiplexing, the required modulator count is reduced from O($N^2$) to O(N), allowing dense integration of record-scale 256 $\times$ 256 homodyne units (each <0.0064 $mm^2$) within a single reticle. We employ wafer-scale fabricated 64 thin-film lithium niobate (TFLN) transmitters (each over 40-GHz bandwidth with propagation loss of 0.2 dB/cm) to encode data and chip-to-chip coupled to Si/SiN computing circuits (64 channels). Our system achieves up to 7-bit computational accuracy across 8 $\times$ 8 parallel channels at record computing clockrate 120 Gbaud/s, and 6-bit statistical accuracy across 256 $\times$ 100 channels at 20-128 Gbaud/s, representing a total throughput of 1,000-6,000 TOPS. Massive parallelism amortizes the optoelectronic (OE) conversion to allow 330-TOPS/W efficiency using foundry-available packaging technology. The system throughput is benchmarked with Qwen2.5-0.5 billion parameter models that generate accurate tokens. High throughput and energy efficiency establish a near-term pathway toward light-based accelerators for large-scale training and low-latency inference from datacenters to edges, accelerating new models toward artificial general intelligence.