Neuromorphic silicon photonics with 50 GHz tiled matrix multiplication for deep-learning applications

Abstract. The explosive volume growth of deep-learning (DL) applications has triggered an era in computing, with neuromorphic photonic platforms promising to merge ultra-high speed and energy efficiency credentials with the brain-inspired computing primitives. The transfer of deep neural networks (DNNs) onto silicon photonic (SiPho) architectures requires, however, an analog computing engine that can perform tiled matrix multiplication (TMM) at line rate to support DL applications with a large number of trainable parameters, similar to the approach followed by state-of-the-art electronic graphics processing units. Herein, we demonstrate an analog SiPho computing engine that relies on a coherent architecture and can perform optical TMM at the record-high speed of 50 GHz. Its potential to support DL applications, where the number of trainable parameters exceeds the available hardware dimensions, is highlighted through a photonic DNN that can reliably detect distributed denial-of-service attacks within a data center with a Cohen’s kappa score-based accuracy of 0.636.


Introduction
During the last decade, deep neural networks (DNNs) have become increasingly important for the resolution of numerous practical problems. 1 With the amount of computing power required to train such DNNs doubling every 3.5 months, 2 academic and industrial researchers started gravitating toward new technologies and hardware accelerators to keep pace with this growth. Highly parallelized computing solutions, including graphic processing units (GPUs), 3 field programmable gate arrays, 4 tensor processing units (TPUs), 5 and applicationspecific integrated circuits, [6][7][8] have been developed to accelerate the matrix-vector multiplication (MVM) operations, which form the most time-and power-consuming computational task in DNNs. 9 Yet, as transistor scaling is stagnating, 10 a high number of alternative emerging technologies have been investigated toward boosting energy efficiency and performance scaling, e.g., optoelectronic memristors, [11][12][13][14][15] nanophotonics, 16,17 and spintronics, 18,19 with brain-inspired photonic accelerators forming one of the key candidate platforms for future AI computing engines due to their inherent credentials to support time-of-flight latencies and terahertz bandwidths. 20,21 Remarkable progress has been witnessed during the last five years in the field of neuromorphic photonics across all necessary constituent technology blocks, including MVM photonic architectures, 17,22-28 individual photonic computational elements, [29][30][31][32] nonlinear activations, [33][34][35][36] and photonic hardware-aware training models. 37,38 All these demonstrations have highlighted the potential for energyefficient and high-speed DNNs by utilizing low-speed weight encoding technologies and a rather small amount of neurons, validating their credentials to support inference within small scale neural network (NN) topologies that can fit in a practical silicon photonic (SiPho) chip.
However, typical NN layouts used for benchmarking purposes, such as ResNet152 and AlexNet, 39 require a total number of 25 and 62 million trainable parameters, respectively, that can hardly fit as hardware-coded information even into the available number of computational elements supported by current topclass GPU and TPU platforms. This has turned tiled matrix multiplication (TMM) into the mainstream processing paradigm in today's AI engines, 40,41 where both the input and the weighting values have to be updated at line rate through time division multiplexing (TDM) approaches until all matrix tiles are processed. To this end, the upgrade of neuromorphic photonics into a versatile AI processing platform has to proceed along the paradigm of today's TPU and GPU computational engines, where a limited amount of hardware resources can execute DNNs with significantly higher dimensions. This would necessitate, however, the use of photonic architectures and technologies that can support dynamic reconfiguration of both the NN input and weight parameters, with the existing demonstrations being incapable of meeting these requirements, as they mostly rely on low-speed weight encoding technology, such as thermooptic (TO) devices 17,26 and phase change materials. 22 In this paper, we present a compact SiPho computing engine that supports both input and weight update rates at a record-high 50 GHz clock frequency, reporting for the first time, to the best of our knowledge, on high-speed TMM directly in the optical domain that allows for DNN implementations over limited-scale photonic hardware. The photonic accelerator comprises a twoinput coherent linear neuron (COLN) layout with high-speed SiGe electro-absorption modulators (EAMs) used both for input and weight imprinting. We experimentally demonstrate its credentials to implement TMM and support DNNs with higher dimensions through its deployment in data center (DC) traffic inspection for network security applications, employing the photonic engine for the identification of distributed denial-ofservice (DDoS) attack patterns via the classification of reconnaissance attacks (RAs). The DNN comprises 10 neurons and 64 trainable parameters and was successfully executed via the COLN, revealing high experimental accuracy values with a Cohen's kappa score (κ-score) 42 of 0.638 at 50 GHz. Finally, the scaling perspectives of the EAM-based two-input COLN into a higher dimension N × N coherent photonic crossbar (Xbar) are presented, providing the practical framework for the deployment of optical TMM operations in a layer-scale layout and for higher-dimension tiles.

Neuromorphic Processor for Tiled Matrix Multiplication
The TMM concept is illustrated in Figs. 1(a)-1(c), showing an example where three different steps are required for calculating the products between two rows of a 6 × 6 matrix and a six-element input vector, when 2 × 2 matrix tiles are used. The 2 × 2 matrix tile starts from the top-left position of the matrix and gets multiplied with the first two input vector values, with the respective products being stored at the first two entries of the six-element output vector, as shown in Fig. 1(a). Then, the 2 × 2 matrix tile shifts to the right and the two-element input vector tile shifts down [ Fig. 1(b)] to incorporate the next entries of the first two matrix rows and the input vector, respectively, producing in this way two new partial weighted input sums through the multiplication of the tile with the corresponding values of the input vector. This process continues with the 2 × 2 tile shifting to the right until the whole horizontal dimension of the 6 × 6 matrix has been scanned, as illustrated in Fig. 1(c). The realization of TMM in the optical domain can be accomplished through a photonic MVM engine where inputs and weights can be updated at line rate, supported by an electronic circuitry for storing the matrix values, loading the necessary tiles to the photonic MVM and storing the partial output sums. This visionary architecture is pictorially represented by Fig. 1(d), showcasing all key building blocks of a neuromorphic photonic processor. The MVM linear operations are executed via the photonic MVM processor in the analog domain, utilizing an integrated or external laser source for "lighting up" the processor. The input and weight values are stored at an electronic memory unit and are loaded onto the photonic MVM processor using digital-to-analog converters. The photonic MVM output is connected to an array of photodiodes that transforms the computed signals back to the electronic domain, exploiting an analog-todigital converter array for the digitization of the data so that they can be stored at the electronic memory. Additionally, an electronic control circuit is needed for data flow synchronization, orchestration, and communication between the memory block and the photonic MVM unit.
Yet, with the NN depth and size increasing with problem complexity, the total number of the NN-trainable parameters will reach values well beyond the matrix dimensions supported by the photonic MVM engine, implying that the photonic MVM hardware has to be shared among a larger number of parameters through inter/intralayer or intraneuron TDM techniques (see Supplementary Material). The implementation of the above requires either the continuous update of the emerging partial sums for the calculation of the multiplication of a whole input vector with a weight matrix tile, as shown in Figs. 1(a)-1(c), or by storing all partial sums at different registers and then forcing them again for further addition via the photonic MVM engine.
A pictorial example of the intraneuron TDM approach for TMM can be visualized in Fig. 1(e), illustrating how an elementary 2:1 neuron can carry out the linear summations of a fiveaxon neuron. This corresponds to the dot product between a 1 × 5 row vector that contains the weights of the neuron and a 5 × 1 column vector that includes the input values, executed through the use of 1 × 2 and 2 × 1 row and column vector tiles, respectively. The five-input neuron is unrolled into four 2:1 virtual neurons whose linear summation operations can be performed within three phases. More specifically, during the first phase, the 2:1 hardware is utilized in three time slots, calculating the linear summations P 2 i¼1 x i w i , P 4 i¼3 x i w i , and P 6 i¼5 x i w i , with x 6 and w 6 being zero and one padded input and weight values, respectively. Afterwards, these three partial weighted input sums P need to be summed in order to provide the required weighted summation of the five inputs of the neuron.
Considering that the addition of the partial sums is carried out again in the optical domain, the summation operation can be performed on-chip by applying weight values equal to 1. Because the hardware can imprint two input values at a time slot, the remaining summations would be performed in two more phases, as depicted in Fig. 1(e). It can be derived that a photonic neuron of N axons number of axons is capable of calculating the linear operations of a layer that comprises neurons of N inputs number of inputs each, in N phases ¼ log N axons ðN inputs Þ phases.
The execution of the MVM product required for an entire neural layer via the same 2:1 NN hardware requires both intraneuron and intralayer TDM, with intraneuron and intralayer TDM corresponding to the use of multiple tiles across a single matrix row and a matrix column, respectively. Assuming, for example, a fully connected layer with N neurons number of neurons and N inputs∕neuron number of inputs per neuron, Fig. 1(f) illustrates the MVM between a weight matrix with dimensions N neurons × N inputs∕neuron and an input vector with dimensions N inputs∕neuron × 1 × N samples , with N samples being equal to the inference batch size b. The 2:1 NN hardware encodes two elements of the weight matrix and two elements of the input vectors at a time slot, highlighted in the colored red, green, and yellow rectangles. This is depicted in more detail in Fig. 1(g), where the NN hardware hosts the weight values of the rectangle A in its weighting modules w a and w b , along with the input values x 1,1 and x 2,1 , in the respective input modules x a and x b during the first time slot. Both input and weight values will be updated during the second time slot, with the weight values of the matrix tile B being loaded onto the weighting modules w a and w b and the input values x 3,1 and x 4,1 onto the respective input x a and x b stages. This process continues until all partial weighted input sums of the entire matrix and the first-sample input vector are calculated, i.e., until the last matrix row that comprises matrix tiles G, H, and I gets also multiplied with the input vector from the first sample, completing in this way phase #1 of the process. Subsequently, the partial sums will be sequentially employed at the input modules x a and x b for their addition until they form the complete weighted input sum that corresponds to the product between a matrix row and the input vector of the first sample. This process is completed within the subsequent phases #2 and #3 of Fig. 1(g), utilizing weighting values equal to one. After completing phase #3 of the first sample, the multiplication of the weight matrix with the input vector in the second sample is initiated, following again the same TMM scheme and repeating all three phases. The MVM operation will be completed once the entire inference batch size b has been processed. Figure 2(a) depicts the SiPho processor that was fabricated for direct on-chip and high-speed mapping of both the input and the weight elements of an NN, following the COLN architecture 43 that can implement a dot-product operation. The SiPho chip comprises a coherent neuromorphic architecture that implements a two-input COLN capable of executing multiple-accumulate (MAC) operations, i.e., the weighted summation of its input data. It exploits the interference capabilities of Mach-Zehnder interferometers (MZIs), complemented by a bias branch that safeguards the retention of the sign of the weighted summation (see Supplementary Material). A visualization of the SiPho COLN and the experimental setup established for its evaluation are depicted in Fig. 2(b). Specifically, the SiPho processor comprises five compact and high-bandwidth SiGe EAMs (orange boxes), with two EAMs used in cascade at each MZI branch for on-chip input data and weight imprinting, respectively, and one EAM employed in the bias branch. The selection of the SiGe EAMs allows for a high compute rate, while retaining the energy consumption and the footprint at low values. 44,45 The normalized electrooptic jS21j response of an EAM biased at −1.5 V is presented in Fig. 2(c), revealing a 1-dB bandwidth higher than 50 GHz. Finally, 3 TO phase shifters (PSs) [blue cylinders in Fig. 2(b)], one at the bias branch and one at each MZI arm, are employed for the application of the sign of the weighted inputs. Figure 2(d) illustrates the optical loss of PS a with respect to the applied driving power, showcasing that ∼4 mW is required for π phase shift. Similar behavior was observed in PS b and PS bias .

Experimental Classification of Benign and Malicious Reconnaissance Attacks Using Photonic TMM at 50 GHz
The NN trained for the RAs classification follows the topology shown in Fig. 3(a). The six features of the port scanned traffic comprise the six inputs of the NN, followed by a fully connected hidden layer (Layer #1) of eight neurons and a two-neuron output layer. The Sigmoid and the SoftMax activation functions were applied to the hidden and output layers, respectively.  Figure 3(h) presents the normalized mean squared errors (MSEs) of the experimentally captured signals per inference phase and per layer. The MSEs of the 16 and 50 Gbaud summations at the last phase of Layer #1 equal ∼3% and ∼4.5%, respectively, while the respective MSE values after the summations of the first phase of Layer#2 are reduced to <1% and ∼2%, respectively. The MSE is always higher at 50 GHz compared with the 16 GHz operational mode and increases as the process moves from the first to the last phase within the same layer, being the result of the noise accumulation that is associated with the reuse of the photonic processor and the higher noise bandwidth. Yet, the interlayer transition reduces the deviation between the experimental and the reference waveform, decreasing the amount of noise upon entering Layer #2 compared with the noise that was accumulated through all Layer #1 phases. This is the result of the Sigmoid activation function employed at Layer #1 output, which takes advantage of its high nonlinearity at its boundaries to compress the edge values of the samples.
The inference classification performance of our SiPho prototype when performing with 500 samples of the generated traffic was quantified by calculating the κ-score, which comprises a statistical metric for the evaluation of the inference accuracy when imbalanced data sets are classified (see Supplementary Material), with their confusion matrices depicted in Figs. 4(a)-4(c). The software acquired κ-score was calculated to be equal to 0.70, and the respective values of the experimental classification at 16 and 50 Gbaud were measured equal to 0.688 and 0.636, as depicted in Fig. 4(d). Finally, the signal-to-noise ratio (SNR) values of the linear summations emerging from the photonic NN (PNN) were measured equal to 14.1 and 11.2 dB, respectively.

Discussion
The successful proof-of-principle experimental validation of the optical TMM at 16 and 50 GHz and the classification of the RA traffic requires a total number of six TDM phases, which is the result of the 2:1 PNN that, inevitably, necessitates the use of 2:1 matrix tiles. The number of TDM phases required and the associated time overhead can be reduced by scaling the PNN chip into a higher-dimension layout that can host a higher amount of on-chip input and weight modulation elements. Figure 5(a) illustrates how the 2:1 COLN can scale to an N∶1 neuron by following the layout that has been already mathematically validated and simulated in Ref. 43 and experimentally validated in Refs. 26 and 46. This is based on the introduction of an 1∶N splitter followed by a stack of parallel waveguides, where every waveguide incorporates a high-speed amplitude modulator for the input signal, followed by a high-speed PS for sign update and a high-speed amplitude modulator for weight update. All these parallel waveguides recombine via an N∶1 combiner, forming in this way a multibranch interferometer. Extending this architecture into a 2D N × M matrix that can support N × M matrix tiles and further reduce the MVM latency can be realized by adopting an N × M coherent linear photonic Xbar architecture that follows the principles reported in Refs. 47 and 48 and uses EAMs as its input and weight modulation circuitry. The N × M Xbar layout is depicted in Fig. 5 with the green rectangle illustrating that the 2:1 COLN utilized in our PNN chip [ Fig. 2(a)] comprises a subblock within the N × M design. The N × M Xbar architecture can host M neurons with N axons per neuron simultaneously, allowing for the use of N × M matrix tiles within the TMM process. The credentials of this architecture to support high-dimension matrix tiles within practical total loss values can be verified through a quantitative theoretical insertion loss (IL) analysis using experimentally measured specifications, assuming a symmetric N × N Xbar that employs SiGe EAMs both for the input and the weight values (see Supplementary Material). As shown in Fig. 6, the total IL of the Xbar architecture increases with increasing matrix dimensions but retains a reasonable value of less than 30 dB, even for a 32 × 32 layout, which supports a total amount of 1024 MAC operations. This can scale to higher total MAC capabilities by combining wavelength division multiplexing (WDM) with the coherent Xbar scheme, following the design reported in Refs. 47 and 49, and can support k × N × N tensor tiles, with k representing the number of wavelengths employed. The extension of the 2:1 COLN into an N∶M coherent Xbar design retains all its additional benefits with respect to flexibility, robustness, and energy and footprint efficiency 50,51 (see also Supplementary Material), as it allows for one-to-one and high-fidelity single-step mapping of the NN parameters onto the PNN hardware 48 and the deployment of high-speed nodes' technology.

Conclusion
Recent advances in SiPhos have enabled the exploitation of light for computing by accelerating the execution of deep-learning algorithms. In this work, we demonstrate an SiPho processor that is capable of performing linear algebra operations of any-dimensioned NN layer towards classifying, at record-high speeds, DDOS attacks within DC server packets. Specifically, by employing the TMM method, we were able to accelerate the MAC operations that take place into the AI processor up to the rate of 50 GHz, detecting successfully benign and malicious attacks with a κ-score of 0.636. Finally, towards minimizing the computing steps and maximizing the classification speeds, we provide a dimension scaling analysis of the demonstrated prototype into a space division multiplexed Xbar architecture capable of supporting layer-scale linear algebra operations.