Silicon-based optoelectronics for general-purpose matrix computation: a review

Abstract. Conventional electronic processors, which are the mainstream and almost invincible hardware for computation, are approaching their limits in both computational power and energy efficiency, especially in large-scale matrix computation. By combining electronic, photonic, and optoelectronic devices and circuits together, silicon-based optoelectronic matrix computation has been demonstrating great capabilities and feasibilities. Matrix computation is one of the few general-purpose computations that have the potential to exceed the computation performance of digital logic circuits in energy efficiency, computational power, and latency. Moreover, electronic processors also suffer from the tremendous energy consumption of the digital transceiver circuits during high-capacity data interconnections. We review the recent progress in photonic matrix computation, including matrix-vector multiplication, convolution, and multiply–accumulate operations in artificial neural networks, quantum information processing, combinatorial optimization, and compressed sensing, with particular attention paid to energy consumption. We also summarize the advantages of silicon-based optoelectronic matrix computation in data interconnections and photonic-electronic integration over conventional optical computing processors. Looking toward the future of silicon-based optoelectronic matrix computations, we believe that silicon-based optoelectronics is a promising and comprehensive platform for disruptively improving general-purpose matrix computation performance in the post-Moore’s law era.


Introduction
Silicon-based optoelectronics is a rapidly developing technology that aims to heterogeneously integrate photonic, optoelectronic, and electronic devices and circuits on a silicon substrate (photonic-electronic integration) to form a large-scale comprehensive on-chip system. 1 Since the modulation bandwidth of silicon optical modulators exceeded 1 GHz in 2004, 2 the data bitrate of transmission has been continuously increasing. Due to advantages in manufacturing cost and mass production, siliconbased optoelectronics are becoming one of the mainstream solutions in both high-speed telecommunications and data center interconnections. 3-6 IEEE P802.3bs 400GbE 7 (or even 800 GbE, 1.6 TbE) high-speed optical transceivers have attracted a wide range of interest from the optical communications industry.
In past decades, artificial neural network (ANN) became a popular model for image classification, pattern recognition, and prediction in many disciplines. Unlike neuromorphic computing (build neural dynamics models, mimic natural neural networks, train the plasticity of synapses, and aim at lower energy consumption brain-like artificial intelligence), ANN adopts an aggressive accuracy-driven strategy in its software research and development. Innovative ANN models, like the convolution (CONV) neural network (ALEXNET, 12 VGG, 13 RESNET, 14 etc.) and the recurrent neural network (long short-term memory 15 ), are proposed to achieve more accurate results. Although ANN models made revolutionary progress in artificial intelligence, the overall floating-point operations (FLOPs) of ANN models have been increasing exponentially. The parameters of these highaccuracy models are generally more than billions (or even trillions) of model parameters. The ANN model training process also requires a lot of matrix computations, which usually take several weeks in cloud data centers with large amounts of data, thereby increasing the software development life cycle. Notably, due to the slowdown of complementary metal-oxide-semiconductor (CMOS) technology scaling, Moore's law no longer seems to apply, and the switching energy of a single transistor deviates from the law's expectations 16 [in Fig. 1(a)]. It is becoming increasingly difficult to reduce the minimum feature size of transistors and improve the single-core performance, which is limited by clock speed and energy efficiency of digital logic circuits, whereas the accuracy-driven ANN models demand higher requirements of the computation performance of the processors. Figure 1(b) shows the model accuracy versus normalized energy consumption of typical ANN models. To further increase the model accuracy in classification, model training and execution often tend to consume exponentially more electricity. 17 In recent decades [in Fig. 1(c)], multicore parallel processing electronic processors, 18 including temporal architectures [e.g., graphics processing unit (GPU) based on the single-instruction multiple-data execution model] and spatial architectures [e.g., tensor processing unit (TPU) based on systolic arrays 21 ], became the mainstream solutions for accelerating large scale matrix computations in ANNs, combinatorial optimization, compressed sensing, and digital signal processing 17 [in Fig. 1(d)]. However, as suggested by Amdahl's law, 22 the overall performance gain from multicore parallelization is limited by diminishing returns; electronic processors also suffer from the tremendous energy consumption of the digital transceiver circuits during highcapacity communication with memory, storage, and peripheral hardware. In Von Neumann architecture 23 processors, during computation, data and instructions need to be sent to the processor via input/output (I/O) connections. The energy consumption of digital transmitter and receiver circuits is equivalent to or much greater than the energy consumption of transistors for computation in digital logic circuits. The data connections' energy consumption reaches the 100 fJ∕bit level (depends on the distance of the copper connections) and occupies 30% to 50% of total energy use for heavy-duty matrix computations. From the perspective of electronics, the effective solution to this energy consumption is to optimize the processor architectures, reduce the unnecessary data movements, enable higher density Fig. 1 Development of processors for matrix computation. (a) Moore's law no longer seems applicable. 16 (b) Exponential growth of energy consumption for more accurate ANN models. 17 (c) Development trends of processors. 18 (d) Temporal and spatial architectures for multicore parallelization. 17 (e) Memristor crossbar arrays in post-Moore's law era. 19 (f) Integrated waveguide meshes for general-purpose matrix computation. 20 integration, and increase the data transmission efficiency. However, electronic processors are inevitably reaching performance limitations in computational power, energy efficiency, and I/O connections. 24 Since electronic processors are approaching their limits, in post-Moore's law era, emerging potential analog computation paradigms, like in-memory computing 25 and optical computing, are being considered to surpass the performance bottleneck of electronic processors. 26 For example, memristor crossbar arrays [in Fig. 1(e)] are a typical example of the in-memory computing paradigm, 19 and silicon-based optoelectronic matrix computation based on integrated waveguide meshes is a typical example of optical computing [in Fig. 1(f)]. Silicon-based optoelectronic matrix computation has the following distinct characteristics and is showing great capabilities and feasibilities, which we will detail in the upcoming sections.
1. I/O connections: Conventional optical computing products based on discrete optics have been available in the past few decades, but it is difficult to effectively transmit high-capacity computational data through the I/O connections. Optical I/O connections promise to achieve lower than attojoule/bit-level data connections, and silicon-based optoelectronics will be the platform to solve this problem.
2. Lower latency: Optical signals propagate in on-chip large-scale integrated waveguide meshes at a speed close to the speed of light. Therefore, matrix multiplication can be completed in the order of picoseconds (two to three orders of magnitude faster than what electronic processors are capable of), which can dramatically reduce the algorithm time-complexity of matrix multiplication. Wideband optoelectronic devices (analog bandwidth up to tens of gigahertz, which is one order of magnitude faster than that of electronic processors) can also boost information processing, expand the capability of I/O connections, and reduce latency.
3. Lower energy consumption: Coherent detection [equivalent to a series of multiply-accumulate (MAC) operations] in integrated waveguide meshes is a thermodynamically reversible process that consumes little energy. However, the MAC operations in digital logic circuits require sufficient energy (irreversible digital computation) to switch the binary states of the transistors in digital logic circuits. 26

Recent Progress in Silicon-Based Optoelectronic Matrix Computation
Electronic processors are the mainstream and almost invincible hardware for general-purpose computation. Most of the novel research in optical computing published recently can easily be defeated by digital logic circuits in terms of energy efficiency, manufacturing cost, and reliability. Even a smartphone can run complex artificial intelligence applications with extremely low power consumption. 27 Unlike versatile and multi-purpose electronic processors, optical computing has difficulty achieving complex and diverse functionalities by simply arranging and combining basic logical units (just like digital logic circuits consisting of billions of transistors). Optical computing usually needs to take advantages or specific characteristics of light waves, 28 such as optical field transformation and coherent detection. By combining electronic, photonic, and optoelectronic devices and circuits together on a silicon substrate, silicon-based optoelectronic matrix computation is one of the few generalpurpose computations that have the potential to surpass the computation performance of the digital logic circuits in terms of energy efficiency, computational power, latency, and maintainability. In this section, we will review the recent studies in photonic matrix computation, including matrix-vector multiplication (MVM), CONV, and MAC operation. These computations are closely interrelated, e.g., both MVM and CONV can be achieved by a series of MAC operation; the CONV between filters and kernel can be deployed in an MVM processor (in Fig. 2).

Integrated Waveguide Meshes for MVM
Although nowadays it is possible to achieve quantum information processing (QIP) up to tens of qubits, there are still some inconveniences (e.g., a large amount of space required, need a lot of discrete optics, work at low temperatures) in achieving large scale unitary transformation to the quantum states (i.e., programming). In 2007, the first photonic integrated two-qubit control-NOT (CNOT) gate [in Fig. 3(b)] for QIP was demonstrated on a silicon chip. 30 Compared with the bulk-optical setup [in Fig. 3(a)], 29 integrated waveguide meshes are more robust for practical applications. The photonic QIP chips have made great progress and are widely employed in quantum encrypted communication, 41 quantum teleportation, and quantum Meanwhile, with the development of silicon-based optoelectronics technology, programmable linear processors were developed to meet more applications, such as tunable filter, microwave photonics, and all-optical switching. 45 For example, optical switching networks [in Fig. 3(f)] can be realized with different kinds of network topologies programmable linear processors in data centers. 34 Besides, a hexagonal cell chip [in Fig. 3(g)] for implementing arbitrary unitary transformations and signal processing, 35,46 microwave photonic signal processor [in Fig. 3(h)] for continuous radiofrequency filtering and processing, and self-configuring linear processor [in Fig. 3(i)] with in-circuit optical power monitors feedback [47][48][49][50] have also been reported.
Theoretically, the transfer matrix of lossless integrated waveguide meshes is a unitary matrix, 51,52 then the unitary MVM can be performed. 53 In recent years, integrated waveguide meshes became a feasible architecture for large-scale general purpose MVM computation in post-Moore's law era. The first ANN proof-of-concept experiment was performed on 56 cascaded programmable MZIs meshes in 2017 [in Fig. 3(j)], and a simple vowel recognition task was demonstrated due to the limited hardware capability. In 2020, large scale 64 × 64 photonicelectronic copackaged MVM processors [in Fig. 3(k)] have been reported by Lightmatter, achieving 99% accuracy in ResNet-50 ImageNet classification. 37 To enhance the energy consumption advantage of the matrix computation, the matrix configuration of the photonic MVM processor has recently been scaled up to 256 × 256, which also brings about precision problems in computation. The fabrication inconsistency in minuscule MZI devices will result in the accumulation of computation errors. Mesh optimizations are suggested to reduce the computation errors to obtain more accurate results. For example, FFTNet architecture [in Fig. 3(l)] 39 and redundant architecture [in Fig. 3(m)] 40 are both numerically investigated to enhance the robustness and overcome fabrication imperfections.

Multiple Light Source MVM
In addition to single light source schemes with a single coherent laser light source, multiple light source schemes (implemented with optical frequency comb or multiple wavelength laser arrays) are proposed in recent published studies. These are emerging methods for improving the signal-to-noise ratio of light energy and avoiding the influence of laser signal phase jitter. For example in Fig. 4(a), a microring modulators MVM scheme with 8 × 10 7 MAC∕s computational power was experimentally demonstrated at a clock rate of 10 MHz. 54 Figure 4(b) is the on-chip photorefractive interaction scheme, in which the input optical signals are sent to the photorefractive interaction region and then diffract on the photorefractive grating to perform the MVM operation. 55,56 Recently, like the memristor crossbar arrays, photonic TPU and photonic crossbars arrays are becoming hot topics, both of which are potential architectures for multiple  30 (c) Programmable quantum processor in 2016. 31 (d) Large-scale photonic processor for arbitrary two-qubit operations. 32 (e) Large-scale photonic processor for multidimensional quantum entanglement. 33 (f) Schematic of optical switch topologies in the data center. 34 (g) Reconfigurable hexagonal mesh for programmable signal processing. 35 (h) Photonic "FPGA" for programmable radiofrequency signal processing. 36 (i) Self-configuring 4 × 4-port linear processor. (j) First optical computing processor for vowel recognition. 20 (k) Large scale 64 × 64 MVM processor. 37,38 (l) FFTNet architecture for better fault tolerance against imprecise components. 39 (m) Redundant architecture to overcome fabrication errors. 40 light source MVM computations. For example, Fig. 4(c) is a simulated photonic tensor core constituted by 16 fundamental dot product engines (each performs row by column pointwise MAC) to perform MVM. 57 Figure 4(d) is a phase change material (PCM) assisted photonic crossbar array, the matrix elements are inscribed in the state of the PCM that patches on the waveguides, with the laser array vector input, and the MVM is then performed. 58 However, to obtain higher energy efficiency compared to electronic processors, the scale of the matrix configuration must be sufficiently large. In our opinion, the multiple light source MVM scheme may have difficult challenges in scaling up the matrix configuration due to the lossy photonic crossbar. can offer four to five orders of magnitude of enhanced processing speed due to the minuscule footprint of the device. 61 Moreover, Fig. 5(e) is the on-chip Cooley-Turkey method FT executing the CONV on the order of tens of picoseconds short. 62 Once the FT device is realized, then the "4f" CONV system can be realized by using a cascade of two photonic FT devices with a phase and amplitude filter mask in between. The limitation of FT-based CONV is that FT devices typically take up large on-chip space due to the need for free-space propagation. Furthermore, the insertion loss of the FT-based CONV also leads to restrictions in the matrix configuration and overall energy efficiency.

Element-wise MAC
Theoretically, both MVM and CONV can be realized by a series of element-wise MAC operations. For example, a basic 3 × 3 CONV is [in Fig. 6(a)] equivalent to nine MAC operations or 18 FLOPs. In digital logic circuits, the matrix computation is sequentially triggered by input clock signals (commonly <5 GHz). Generally, the element-wise MAC operations aim to employ wideband optoelectronic devices (e.g., microring modulators or Mach-Zehnder modulators) to achieve higher speeds up to tens of gigahertz. Although optoelectronic devices generally consume more energy than the digital logic circuits, the higher-speed MAC operations can break through the limited clock rates in electronic processors, thereby improving the "single-core performance" of computations and reducing the latency, or try to use a small amount of photons in analog element-wise MAC operation to break the energy limit of the digital computation paradigm.
Element-wise MAC operations can be realized by balanced homodyne detection, microring modulators, and cascaded modulators arrays. For balanced homodyne detection [in Fig. 6(b)], input data are optically fanned out to channels, and each detector functions as a photoelectric multiplier, calculating the homodyne product and accumulating the multiplication results. The theoretical equivalent energy consumption of analog MAC can break through Landauer's limits in the digital paradigm (∼2.7 aJ∕MAC at 300 K) and reach as low as the 50 zJ∕MAC level. 63 For the cascaded acousto-optical modulator arrays [in Fig. 6(c)], with the high linearity (∼30 dBc signal-todistortion ratio) acousto-optic modulation, the FASHION-MNIST classification task is performed, and the accuracy is examined similar to a 64-bit computer at a modulation speed lower than MHz. 64 Although the acousto-optic modulation is limited in bandwidth, the microring scheme with electro-optic

Dispersion-based MAC
The photonic matrix computation can be achieved by dispersion-based MAC, in which the dispersion manipulation is usually conducted with linearly dispersive photonic waveguides (or optical fibers) and broad-spectrum laser source (or ultrashort laser pulses). For example, the reconfigurable timewavelength plane manipulation scheme [in Fig. 7(a)] was proposed by employing a 1.1-km long dispersion fiber and 18-GHz FSR optical frequency comb to realize the 118 GigaMAC∕s MAC operation, which is equivalent to 2.69 GigaMAC∕s 4 × 4 MVM and 0.5 GigaMAC∕s 32 × 32 CONV operation. 66 Similarly, photonic perceptron [in Fig. 7(d)], conducted with a 13-km spool of standard single-mode fiber and 49-wavelength 49 GHz-FSR soliton crystal microcombs, was proposed to achieve 11.9 GigaFLOPs (8 bit/FLOP) MAC operations. 67 Figure 7(c) was a time-stretch architecture employing modelocked ultrashort pulses, the ultrashort pulses were first broadened by a dispersive fiber, then modulated, and finally compressed by another reversely dispersive fiber to realize the MAC operations. 68 Figure 7(d) is a potentially 400-GHz bandwidth temporal MAC operation operated by a hydex spiral waveguide with linear group dispersion; with a dispersive photonic waveguide, temporal CONV can be realized with 200-ps operating time and 300-fs resolution. 69

Optical Interconnections in Computation Hardware
In Von Neumann architecture processors, the memory-processor interconnections are one of the major factors influencing the overall performance [in Fig. 8(a) Fig. 8(b)], and memory circuits are always needed for storing temporary results. 71 However, in past decades, a growing performance gap between processor performance and memory bandwidth, i.e., the "memory wall" problem, has hindered high-performance computation. 73 Electronic processors are also suffering from the tremendous energy consumption of digital transceiver circuits during massive data I/O connections. Increasing the memory-processor bandwidth and energy efficiency in interconnections is an effective way to diminish the data movement problem. 74 For example, in cloud data centers where GPUs are the mainstream hardware for ANN acceleration, Nvidia developed NVlink connections for increasing the interface bandwidth of GPU interconnections (up to Tb/s). 75 However, when the bandwidth exceeds 10 Tb∕s, the energy budgets of electrical interconnections will exceed the expectations, which is unacceptable. [76][77][78] Instead of power-hungry electronic transceiver circuits, onchip optical transceivers are good alternatives for low-energybudget interconnections and boosting the data movement among the processors, memory, and peripheral hardware 72 [in Fig. 8(c)]. Therefore, it is necessary to heterogeneously integrate photonic, optoelectronic, and electronic devices and circuits on the silicon substrate. Recently many studies have been performed to realize optical interconnections on a photonic-electronic integrated platform. For example, in 2015, photonic-electronic integration was demonstrated on a silicon chip, which integrated over 70 million transistors and 850 photonic components that work together to demonstrate aggregated 55 Gb∕s memory bandwidth interconnections. 79 In 2018, an optimized (with polycrystalline silicon) monolithic photonic-electronic integrated system on a silicon chip led to potentially >2 ðTb∕sÞ∕mm 2 bandwidth densities, and the total electrical energy consumptions of the optical transmitter and receiver are 100 and 500 fJ∕bit, respectively. 80

Photonic-Electronic Integration
Optical computing has a history of nearly 70 years, 81 and the optical computing products and studies with bulk optical systems have showed great potential in matrix computations. For example, Fig. 9(a) is an MVM processor released in 2003 with a 256 × 256 pixel resolution spatial light modulator (SLM), which has 8000 GigaMAC∕s equivalent computational power; 82 Fig. 9(b) is a bulk-optical 4f system that could be adapted to implement CONV by placing a phase mask in the Fourier plane; 83 Fig. 9(c) is a diffractive deep neural network for image classification with 3D-printed multilayer phase plates. 84 However, the application of optical computing processors is not as popular as using electronic processors. Bulk optical systems have many problems in terms of their computation precision, maintainability, and mass production. For example, when the bulk-optical system encounters slight vibration, the optical components may be misaligned and cause devastating problems in computation precision.
Silicon-based optoelectronics is a photonic-electronic integrated platform that avoids the inconvenience of discrete optics; mature CMOS manufacturing processing and packaging can achieve mass production, which is an advantage that conventional optical computing does not have. Photonic-electronic copackaging [in Fig. 9(d)] is an emerging technology for comprehensive computation hardware system and enhancing the interaction between photonic core and electronic applicationspecific integrated circuits (ASICs). 34 For example, a large-scale 64 × 64 copackaged MVM processor has been reported with 14-nm process ASICs chips and 90-nm process photonic core, which achieve high-performance MVM computing. 37,38 By exploiting the advantages of light in linear matrix computations, the photonic core is excellent at disruptively improving the computing performance, while the electronic circuits are necessary for performing other nonlinear operations, such as driver circuits, arithmetic and logic, data storage, and activation function. In addition, thousands of on-chip photonic and optoelectronic devices, like optical modulators, photo-detectors, and MZIs, need to be precisely controlled and assisted by electronic circuits, such as modulator drivers, trans-impedance amplifiers (TIAs), serial-parallel converters, and analog-digital converters.

Larger Scale General Purpose Matrix Computation
When encountering a computation problem, before considering optical computing, it should be considered whether it will be faster, more economical, more energy-efficient, or more reliable than using existing electronic processors or designing new specific digital logic circuits. Silicon-based optoelectronic matrix computation processors should directly compete with its rivals, such as multicore electronic processors (such as the existing GPU, TPU, and ASIC). 85,86 By combining electronic, photonic, and optoelectronic devices and circuits together, silicon-based optoelectronic matrix computation is one of the few generalpurpose computations that have the potential to surpass the computation performance of the electronic processors. In digital Fig. 8 Interconnections in processors, memory, and peripheral hardware. (a) The memory-processor interconnections are one of the major factors influencing the overall performance and the memory wall problem that has hindered high-performance computing. 70 (b) Large-scale matrix multiplication is decomposed into small-scale matrix multiplications while processing. 71 (c) On-chip optical transceivers are good alternatives for low-energy-budget interconnections and boosting the data movement among the computation hardware. 72 electronic MVM processors, with larger matrix configuration, energy consumption increased proportionally to the area of the matrix (i.e., total number of elements in a matrix), and the energy consumption per MAC with FP16 or bfloat16 precision is about 10 −12 Joules [in Fig. 10(b)]. 87 In contrast, one distinctive feature is that the photonic matrix computation is realized by thermodynamically reversible coherent detection, which does not consume any energy, and the total energy consumption of matrix computation processors is merely proportional to the matrix side-length (i.e., the number of elements in a column/ row or the number of input optical modulator arrays). The equivalent energy consumption per MAC operation decreases linearly as the side-length increases. 63,88,89 With larger matrix configuration, the advantages in total computational power, energy efficiency, and latency will be further enhanced; although thermally maintaining the static photonic matrix will consume additional energy, this static energy consumption problem can be well solved, e.g., silicon substrate removal is doable to improve the thermal modulation efficiency and reduce the static energy consumption. Considering the entire computation systems including photonic, optoelectronic, and electronic devices and circuits, some empirical evaluation results indicate that silicon-based optoelectronic matrix computation will outperform digital logic circuits in terms of energy efficiency when the matrix configuration exceeds 128 × 128 [in Fig. 10(a)]. Recently, the matrix configuration of the siliconbased optoelectronic matrix computation processor is scaled up to 256 × 256. The manufacturing and packaging of largerscale chips are the major challenges for photonic matrix computation.

Improve Computation Density and Computation Precision
In silicon-based optoelectronic matrix computation processors, increasing the modulation bandwidth is an intuitive way to further improve computational power density per unit area. Although on-chip optoelectronic devices can reach modulation speeds of tens of gigahertz, passive components have limited dynamic response (<1-MHz bandwidth). Modulation speed mismatch is inevitable, and high-speed modulation is not considered to avoid high insertion loss in a large-scale passive photonic matrix. From our perspective, distributed computing may be possible to meet the requirements of larger-scale matrix multiplication, which means to increase the number of small photonic matrices, and then the large-scale matrix computation can also be realized without increasing the modulation rate of the optical matrices. Furthermore, integrated waveguide meshes are mostly constructed from individual MZI devices, and the footprint of the MZI is commonly about 2500 μm 2 . Reducing the MZI footprint will help increase the computation density, scale up the matrix configuration, and reduce the production cost of the photonic core. For example, improving the modulation efficiency of the MZI phase shifter can reduce the length of the MZI arms; employing surface plasmon polaritons or hybrid plasmon polaritons devices can break through the diffraction limit of light and reduce the area of fundamental devices. 90,91 Computation precision plays an important role in analog computation. With a larger scale matrix configuration (e.g., from 64 × 64, 128 × 128 to 256 × 256), the optical intensity in the integrated waveguide mesh is gradually diluted. Moreover, the fabrication inconsistency in minuscule devices will result in the accumulation of computation errors. Recently, silicon-based optoelectronic matrix computation processors tend to merely perform unitary matrix multiplication (singular value multiplication can be performed in electronic circuits). In unitary matrix multiplications, the energy of light is almost conserved without excess energy loss, which is beneficial to improving the signalto-noise ratio and computation accuracy of matrix multiplication. Moreover, a specific redundant architecture or mesh architecture can be employed to overcome the fabrication imperfections and achieve more robust matrix computation. The computation precision needs to be enhanced with further research and development.

Matrix Computation for Lower-Precision-Requirement Applications
Hardware development (processor design and production) and software development (algorithm and applications) are generally carried out separately, and matrix computation processors usually have a standard application programming interface to be utilized for software development. Higher precision is a long-standing pursuit for computation hardware. It is convenient for an electronic processor to achieve 64-bit double-precision arithmetic. However, it is impossible to achieve such high precision in analog processors. At the current stage precision problems remain in silicon-based optoelectronic matrix computation processors, as computation errors are inevitable in analog computation paradigms. We need to find some applications with lower precision requirements that can run on the generalpurpose processors. Although it is difficult for silicon-based optoelectronic matrix computation processors to solve highprecision arithmetic or global optimization problems, heuristic algorithms can be developed to effectively search a near-optimal solution at a reasonable compute cost in a lower-precision processor. Silicon-based optoelectronic matrix computation processors are feasible for solving some difficult problems and reducing their time-complexity, like nondeterministic polynomial (NP) time decidable/solvable problems. A certain degree of computation errors can be tolerated in the heuristic algorithms, and the slight computation inaccuracy does not affect the result. For example, Ising models are NP-complete problems in combinatorial optimization, and finding a minimal energy state of the Ising model (i.e., annealing) is NP-hard. Commonly, the minimal energy state of Ising models can be solved in digital processors (with heuristic algorithms) or quantum computers (with quantum annealing). Bulk-optical computing systems (such as optical fiber loops 92 and spatial light modulators 93 ) have been invented and developed to accelerate the annealing of the Ising model. A Hopfield neural network is a recurrent neural network, in which the MVM (between the binary state vector and weight matrix) can be effectively accelerated in an MVM processor with lower time complexity. By parametrically designing the evolution dynamics of the Hopfield neural network and mimicking the interactions within the nodes [in Fig. 11(a)], 94 the Ising model can spontaneously evolve to an acceptable low-energy state.
In compressed sensing [in Fig. 11(b)] applications, with known measurement value y and the measurement matrix A, the underdetermined equations y ¼ AΘ need to be solved to obtain the original K-sparse coefficient vector Θ. The reconstruction of sparse coefficient vector Θ can be solved by l 0 norm minimization, i.e., min kΘk 0 , subject to AΘ ¼ y. The l 0 norm minimization is also NP-hard, i.e., the times of linear measurements (matrix multiplications) in K-sparse signal reconstruction are OðK × log NÞ, which needs to be accelerated with photonic matrix computation. 95 Similarly, discrete FT (DFT) [in Fig. 11(c)] is a frequentlyused operation in digital signal processing and speech recognition. 96 Normally, the time complexity of the DFT algorithm by unitary matrix multiplication is OðN 2 Þ, and the time complexity of the Cooley-Tukey method DFT is OðN × log NÞ. However, the time complexity of the photonic DFT matrix multiplication is only Oð1Þ, which is of great significance for reducing energy consumption and time latency. 97

Summary
We reviewed the recent research on silicon-based optoelectronic matrix computations, including MVM, CONV, and MAC operations. Conventional electronic processors are still the mainstream and almost invincible hardware that is based on digital logic circuits for computation. When designing new optical computing processors (or coprocessors) for computation, the computation performance needs to outperform the digital logic circuits in terms of computational power, energy efficiency, I/O connections, and latency. Although computation errors are inevitable in analog computation paradigms, lowerprecision-requirement applications (e.g., ANN, combinatorial optimization, compressed sensing, digital signal processing, and quantum information processing) can be run on the general-purpose matrix computation processors. Looking forward to the future of large-scale matrix computation in specific applications, the silicon-based optoelectronic platform can not only heterogeneously integrate photonic (e.g., integrated waveguide mesh, free space propagation region, and dispersive waveguides), optoelectronic (e.g., high-speed modulators and photodetectors), and electronic (e.g., memory circuits, driver circuits, TIAs, serial-parallel converters, and analog-digital converters) devices and circuits on a silicon substrate to fulfill the requirements of large scale matrix computation, but can also boost the low-energy-budget data movement among the processors, memories, and peripheral hardware. We believe that silicon-based optoelectronics is a promising and comprehensive platform for general-purpose matrix computation in the post-Moore's law era.