Decision-making and control with diffractive optical networks

The ultimate goal of artificial intelligence is to mimic the human brain to perform decision-making and control directly from high-dimensional sensory input. Diffractive optical networks provide a promising solution for implementing artificial intelligence with high-speed and low-power consumption. Most of the reported diffractive optical networks focus on single or multiple tasks that do not involve environmental interaction, such as object recognition and image classification. In contrast, the networks capable of performing decision-making and control have not yet been developed to our knowledge. Here, we propose using deep reinforcement learning to implement diffractive optical networks that imitate human-level decision-making and control capability. Such networks taking advantage of a residual architecture, allow for finding optimal control policies through interaction with the environment and can be readily implemented with existing optical devices. The superior performance of these networks is verified by engaging three types of classic games, Tic-Tac-Toe, Super Mario Bros., and Car Racing. Finally, we present an experimental demonstration of playing Tic-Tac-Toe by leveraging diffractive optical networks based on a spatial light modulator. Our work represents a solid step forward in advancing diffractive optical networks, which promises a fundamental shift from the target-driven control of a pre-designed state for simple recognition or classification tasks to the high-level sensory capability of artificial intelligence. It may find exciting applications in autonomous driving, intelligent robots, and intelligent manufacturing.


Introduction
Artificial intelligence (AI) is to imitate the functions of neurons in performing decision-making by creating hierarchical artificial neural networks.It has found many exciting applications in computer vision [1,2], natural language processing [3,4], and data mining [5].Except for electronics and computer science applications, artificial neural networks have been applied to optimize the design of photonic devices, including metamaterials and metasurface, significantly facilitating the performance of photonic devices beyond the conventional inverse design strategy [6][7][8][9][10][11][12][13].
Recently, optical neural networks have drawn tremendous attention because they provide a compelling route of processing information at the speed of light [14][15][16][17][18][19], with low energy consumption and massive parallelism compared to the electronic-circuit-based neural networks.In the pioneering work of Lin et al. [20], diffractive optical networks (DON, also known as diffractive deep neural network, D 2 NN) consisting of multilayer of three-dimensional printed diffractive optical elements operating at terahertz were first proposed for inference and prediction through parallel computation and dense interconnection at the speed of light.Later, DONs were extended to various nanostructures for implementation.Such architecture has been effectively validated in performing specific inference functions, such as image classification [21][22][23][24], saliency detection [25], and logic operation [26].More recently, a reconfigurable DON based on optoelectronic fused computing architecture has been proposed [27], which can perform different neural networks and achieve a high model complexity with millions of neurons.Although DONs have witnessed significant progress in the past few years, their functions mainly focus on image classification and object recognition without involving any interaction with the environment.To our knowledge, human-level AI based on DONs that can perform decisionmaking and control has not yet been developed.
In this work, we bring the capability of decision-making and control directly from highdimensional sensory inputs to DON.The networks build upon deep reinforcement learning to interact with a simulated environment for optimal control policies.The training process of policy is based solely on deep reinforcement learning from selfplay without dataset or guidance.A phase profile mapping features each layer of the DON and thus can be immediately implemented by optical modulation devices.The effectiveness of the proposed DON is validated with three typical games, Tic-Tac-Toe, Super Mario Bros., and Car Racing.We also provide a direct experimental demonstration of such DON capable of playing Tic-Tac-Toe.Excellent agreement can be found between theoretical prediction and experimental measurements.This work enables a fundamental shift from the target-driven control of a pre-designed state for simple recognition or classification tasks to human-imitative AI, revealing the potential of optoelectronic AI systems to solve complex realworld problems.We envision that such DONs find promising applications in autonomous driving, industrial robots, and intelligent manufacturing, aiming to enhance human life in every aspect.

The network for decision-making and control
The working principle of the DON for decisionmaking and control is illustrated in Fig. 1a-c, using an example of playing Nintendo's classic video game Super Mario Bros.In general, a human player goes through seeing, understanding, and making a decision in each step, and these perception and control behaviors loop until the game is over.In order to play games in a human-like manner, the network necessitates the sensory capability to capture continuous, high-dimensional state spaces and the controllable execution ability of sequences of different behaviors.The DON shown in Fig. 1b comprises the specific free-space configuration: an input layer with images encoded using an optical modulation device, multiple hidden layers encoding phases of transmitted waves, and an output layer in which the computational results are imaged into.More importantly, the proposed framework for decision-making and control integrates deep reinforcement learning and DON into a training procedure, allowing interaction between the game and the agent to learn control policies that can be implemented through the optical computing platform.The method observes each state within 1 The DON for decision-making and control.a-c The proposed network plays the video game of Super Mario Bros. in a human-like manner.In the network architecture, an input layer captures continuous, high-dimensional game snapshots (seeing), a series of diffractive layers choose a particular action through a learned control policy for each situation faced (making a decision), and an output layer maps the intensity distribution into preset action regions to generate the control signals in the games (controlling).d Training framework of policy and network.Deep reinforcement learning through an agent interacts with a simulated environment to find a near-optimal control policy represented by a CNN, which is employed as the ground truth to update the DON by error backpropagate algorithm.e The experimental setup of DON for decision-making and control.f The building block of DON.
the game environment and chooses a particular action through a learned control policy for each situation.Then, the changed environment generates observation of the new state and makes the following action, and continuously updates the control policy in the loop.Unlike the previous optical networks, the input images from each video game frame are continuous high-dimensional sensory data.Furthermore, the execution procedure, such as playing games, is essentially a type of interactive control rather than the one-way recognition for a single objective, such as written digits or fashion items.
To address the complexity of imitating human players on the optical platform, we develop the training framework of policy and network shown in Fig. 1d, using a combination of novel and existing general-purpose techniques for neural network architectures.As shown in the middle block of Fig. 1d, central to the architecture is a control policy π θ (a|s), which is represented by a convolutional neural network (CNN) with parameters θ that makes states s as inputs and takes actions a as outputs by optimizing the reward of games of self-play.Note that the training epoch of deep reinforcement learning is markedly more than that of the DONs due to the training of policies starting from entirely random behavior.Thus, we developed the training process approach with two main phases to eliminate unnecessary computations.Firstly, deep reinforcement learning through an agent interacts with a simulated game environment to find a near-optimal control policy to meet the specified goals.Secondly, the control policy updates the DON by the error backpropagation algorithm.
In the first phase, a deep reinforcement learning algorithm collects data to find a control policy concerning the specific reward function through interaction with the game environment, thereby achieving the desired outcome.The states of these games need to satisfy the Markov property that the information of a particular state contains all relevant histories.Thus, it is possible to perform actions in the current state and move to the next state without considering the previous states.The agent interacts with the environment through a sequence of observations, actions, and rewards.At each step of interaction, the agent observes the state of the environment to decide on an action to take and then gives rewards based on the game result.The neural network decides the best action for each step based on the reward.It continuously updates the policy using proximal policy optimization (PPO) [29] to find the optimal action.After testing, the trained policies can all complete the respective game.Compared with the previous studies, the algorithm only requires game rules without the need for human data, guidance, or domain knowledge, avoiding the performance's dependence on the dataset's quality.
In the second phase, the control policy is transferred onto the DON.The optimal control policy modeled using CNN is utilized as the ground truth during the learning procedure.Meanwhile, following the forward propagation model based on Huygens' principle and Rayleigh-Sommerfeld diffraction, the encoded input light can be directed into any desired location at the output layer via the learnable transmission coefficients, that is, phase profiles of hidden layers in the network.The energy distributions clustered in the target detection region imply the prediction results.The transmission coefficients at each diffractive layer should be adequately trained via the error backpropagation algorithm and a loss function with mean square error (MSE), which is defined to evaluate the performance between the output intensities and the ground truth target.The adaptive moment estimation (Adam) [30], an algorithm for first-order gradient-based optimization of stochastic objective functions, is adopted to reduce the loss function.Then, the gradient of the loss function concerning all the trainable network variables is backpropagated to iteratively update the network during each cycle of the training phase until the network converges.
Once the training is completed, the target phase profiles of the diffractive layers are determined, which are ready to connect the physical and digital worlds for optical neuromorphic computing.Here, we choose an approach similar to the diffractive processing unit [27] to build the network because of its reconfigurability and ability to support millions of neurons for computation.The experimental setup of the DON is shown in Fig. 1e.The entire computing process is primarily optical, except for the dataflow control.These light modulation devices are very fast and therefore allow for real-time computation.Such an experimental system allows for a deep residual framework that can overcome the vanishing gradients problem by introducing shortcut connections between layers, and the architecture has become one of the cornerstones of neural networks [31].Fig. 1f demonstrates a block that composes the DON.First, when there is an angle between the polarization direction of incident light and the extraordinary axis of the liquid crystal of spatial light modulator (SLM), some light will not be modulated and reflected directly to the camera, thus creating a shortcut connection.Formally, the incident light is denoted as X, the diffraction computation is denoted as F (X), and the original mapping can be recast into F (αX) + (1 − α)X, where α is the modulation ratio of SLM, which can be fine-tuned by rotating laser and polarizer to change the polarization direction (or adding a half-wave plate).Compared to previous research [32], this approach does not require introducing additional optical devices, providing a free improvement.In addition, the approach also lowers the bar for the polarization state of light, and partially polarized light can be used in the network.Then, we use the photoelectric effect occurring at each image sensor pixel to implement the activation function of diffractive neurons, denoted as | Ẽ| 2 .In addition, to some extent, the exposure of the camera and the differences in resolution between various devices can be analogized to the layer normalization and downsampling operations of neural networks, respectively.Unlike previous studies that used complex network structures, we stack the block to build the DON.

Playing Tic-Tac-Toe
In our first implementation, we perform the decision-making and control for Tic-Tac-Toe.This classic game is played on a 3×3 grid of cells where each player places their mark, an X or an O, in an empty cell.The first player to place three of their marks in a row vertically, horizontally, or diagonally wins the game.If all cells are filled, and neither player has three marks in a row, the game is declared a draw.There are 255,168 possible ways to play this game, and we use the proposed network architecture to capture the effective policies to make the most optimal move in every possible situation.
To play this game, the network composed of three diffractive blocks is designed by the above training algorithm.The input images carrying the information of the current states are encoded into the amplitude of the input field to the network.The network is trained to map the incident energy into nine cells corresponding to the grid (labeled by the number 1-9), where the received energy distribution at each region reveals the current state and predicts the probability of the player's next move, as shown in Fig. 2a.Since the observed state and the action are both discrete in this game, Tic-Tac-Toe can be considered to demonstrate our method for a collection of tasks with discrete state and action spaces.
Note that the first player (X) and the second player (O) have different control policies; specifi- of the network is greatly improved when changing from 2 layers to 3 layers because if there are not enough layers in the network, the shortcut connections between layers may not be fully computed, thus affecting the results.However, the accuracy does not show a noticeable change when the layer number continues to increase from 3, which may be due to the following reasons.First, the DON is unsuitable for predicting states with high similarity [33]; see Supplementary Note 4 for detailed derivation.In addition, DONs have a similar global perceptual property to a multilayer perceptron (MLP), which can capture features at given spatial locations.However, it is difficult to capture features between different spatial locations [34].We will discuss this point later in the paper.

Playing Super Mario Bros.
In our second implementation, the world 1-1 of the original Super Mario Bros. game is used to demonstrate the validity of DON.Unlike the Tic-Tac-Toe on a square-divided board, Super Mario Bros. is a video game with continuous high-dimensional state inputs.The gameplay consists of moving the player-controlled character, Mario, through two-dimensional levels to get to the level's end, traversing it from left to right, avoiding obstacles and enemies, and interacting with game objects.
In the game, the player controls Mario to take discrete actions run, jump, and crouch.Under these considerations, this game can be an example of continuous state space and discrete action space for testing the proposed network.Fig. 3a illustrates the DON for playing Super Mario Bros.The network consists of an input layer carrying the optical field encoded from each video game frame, hidden layers composed of cascaded three diffractive blocks trained by the same algorithm, and the output layer mapping the intensity distribution into preset regions.It is clear that the input images from the game scene consisting of moving backgrounds and different objects are more complex compared to the Tic-Tac-Toe with a regular pattern.In addition, the game images are similar between adjacent ones and constantly changing due to the gameplay on a side-scrolling platform, which challenges the DON in processing highly similar input states for choosing optimal actions.
After training with the control policy, the network makes decisions for Mario's optimal action.It achieves accurate control to reach the end of the level until taking down the flag raised above the castle, as shown in Supplementary Video 1. Specifically, at any given state, the most optimal action that Mario chooses to take is predicted by the maximum action signal.In the examples of Fig. 3b,c, we take some snapshots from Supplementary Video 1 to analyze the decision-making and control of Mario's actions in complex, time-varying configurations.Since the goal of our network is to finish the level as quickly as possible successfully, Mario should maintain the run action until the end while choosing to jump or crouch to overcome the challenges at certain states.Thus, the output intensity of run keeps high throughout the game, while the intensity of jump and crouch shows smaller fluctuations, verified by Fig. 3b,c.Although this significant intensity triggers the prediction only at a particular frame, this control signal is intentionally set to last for 20 frames to ensure Mario's finishing the entire action.It is worth noting that the intensity-frame curve remains relatively stable during the 516th to 530th frame, which can be understood with the static and high-contrast background images after Mario enters the pipe, as shown in Fig. 3c.
To gain an insight into how the DON makes decisions, we investigate the network's perception capability, employing inverse prediction in Fig. 3d.
We demonstrate what the network has learned from the high-dimensional sensory input to perform the crouch action corresponding to the 501st frame image.We use the error backpropagation algorithm in a retrained network to inversely predict the input image at this moment, where α = 1 in the network to avoid the effect of the residual structure; see Supplementary Note 5 for detailed derivation.The inversely predicted image matches the original input image of the 501st frame, especially the background, such as clouds and grasses.When humans play the game, they may ignore these backgrounds and focus only on the critical parts, such as Mario, enemies, and pipes.The inverse prediction of the whole scene highlights the capability of the network to extract global features instead of local ones; the property is the same as MLP, further verifying the perception capability of the network in capturing the global features to make decisions.

Playing Car Racing
In our third implementation, we demonstrate the proposed network capability in Car Racing, which requires perceiving the game environment using continuous high-dimensional inputs and making decisions to control the car by performing continuous steering actions.The game's control policy is trained based on the rules of keeping the car within the track by controlling its rotation, and the car is set to increase the speed once the game starts continuously.The DON architecture is shown in Fig. 4a similar to previous examples.The input energy of the optical field is redistributed through three diffractive blocks into the two designated regions on the left and right of the output layer.The difference value between the intensities at the current state controls the steering direction and angle of the car shown in Fig. 4b.In addition, just as in the steering dead zone in real vehicles, a slight difference value would not lead to steering action to avoid disturbance.
The successful network implementation in Car Racing is illustrated in Supplementary Video 2, where the car is controlled in the center of the track almost within the whole lap.For the two basic actions of the left and right turn, some exemplary snapshots are provided in Fig. 4c,e.Specifically, the negative difference values in Fig. 4c predict the left turn of the car wheel, while the larger absolute values indicate sharper turns.It is also observed that sometimes the difference values approach zero, and a rotation angle of 0 is predicted to keep the car moving in the direction of the current state.Due to the larger turning angle of the track, the intensity difference of the left turn shows a more drastic change.It is also intriguing that although the steering of the car in the left turn is somewhat unsmooth so that it does not appear in the middle of the track in certain states, the controlled action that is updated in the following state leads to successful gameplay.This real-time feedback and updating feature shows the great potential of the architecture for challenging auto-driving almost at the speed of light [35], such as dealing with sudden obstacles.
To validate the anti-disturbance ability of the proposed approach, we introduce two crucial randomization disturbance mechanisms to the frame image of the game and then test the network performance in controlling Car Racing.With the same network previously trained, Gaussian blur and Gaussian noise are respectively added to the frames, and the control results are shown in Fig. 4d and Fig. 4f, respectively.Although the introduction of disturbances, including blur and noise, causes the quality decline of the input image, the car can still maintain accurate and effective control to successfully complete the game, as verified by Supplementary Videos 3 and 4. Compared with the normal cases in Fig. 4c,e, the output intensity curves in Fig. 4d,f show similar trends to control the left or right turning actions.However, the curves are less smooth, with more amplitude fluctuations, indicating unsmooth steering angle control.The successful control in the cases with the randomization disturbance reveals the great perception of the game environment, especially the full access to the global features.

Experimental demonstration of playing Tic-Tac-Toe
Finally, to test the real experimental performance of the DON, we built an experimental system using off-the-shelf optical modulation devices.We tested it by playing Tic-Tac-Toe, as shown in Fig. 5a.A laser beam with a working wavelength of 632.8nm is expanded using a microscope objective and lens, while a linear polarizer can be embedded to adjust the incident light intensity, which is then projected onto a digital micromirror device (DMD).The input image data is optically encoded and modulated by the DMD, followed by two relay lenses to adjust the image to the appropriate size and projected onto the SLM for phase modulation.The optical iris is used to filter out high-order diffractions and stray light.The diffraction pattern is imaged onto the camera; then, the output image is input to the DMD for the next diffractive layer until the end of the network computation.After that, the optical intensities in the predefined detection zones are extracted from the output image, and the predicted results are decoded to generate the control signals in the games.Then the new frame image of the video game stimulates the new process procedure, and the updated results control the game until the end.In addition, because DMD is a binary device, the training process needs to simulate the fast rotation of the micromirror when displaying grayscale images to make the training results more practical.We adapt the previously trained phase profiles for DMD, as detailed in Supplementary Note 6.We first test our proposed residual architecture, Fig. 5b shows the effect of our proposed residual architecture, which is the output of the first layer of the sample in Fig. 2a.It can be seen that the value of α varies with the polarization direction of the incident light.This shows that our proposed residual architecture is valid and can be flexibly adapted to the task.
After that, we tested the same two games as in Fig. 2b,c, and the experimental results are presented in 5c,d.It can be seen that the intensity distribution of the output changes as the input game state changes.Due to the unavoidable physical error in the experimental system, the experimental results are different from the simulated ones, but the overall intensity changes are very similar.The maximum intensity distributions occur at the same positions, and the same games are successfully completed.

Discussion
We have demonstrated DONs for decision-making and control.The optimal control policy enables this technique through a harmonious combination of deep reinforcement learning and the DON architecture.Based solely on reinforcement learning from self-play, the control policy of the training algorithm is flexible, as demonstrated by successfully learning to play the three types of classic games.In addition, we further exploit the potential of the photoelectric fusion DON by introducing a free residual architecture, which achieves excellent performance in the simplest network structure.
It is worth noting that Tic-Tac-Toe does not achieve perfect results despite the definite rules and optimal control policy, just like Super Mario Bros. and Car Racing.There are several possible reasons for this result: Playing Tic-Tac-Toe needs to strategically handle different states and a more significant number of output signals.The gameplay of Tic-Tac-Toe requires correct predictions at each state, while the other two games show better error tolerance and accidental mistakes do not necessarily affect the results.In addition, using the difference as a mechanism to trigger actions also improves the network's performance in Car Racing to some extent.Since the DON is not good at extracting local features, the differences in intensity distributions between the adjacent input board images are challenging to detect for Tic-Tac-Toe.
By testing our proposed DON on the challenging domain of classic games, we demonstrate its ability to master difficult game control policies for playing game, which is also the first time on an optical platform.This work bridges the gap between optical and digital neural networks aiming to achieve human-level AI.The most important aspect is that the decision-making and control process is implemented in optical devices at the speed of light by imitating human competence.Another ideal platform for implementing DONs is metasurface.Metasurfaces provide an unprecedented ability to manipulate the wavefront of light and are widely used to implement sophisticated functions such as holography and computational imaging [36][37][38][39].Therefore, driven by the demand of all-optical on-chip integration of AI systems, some recent studies have introduced optical metasurfaces consisting of an array of subwavelength meta-atoms to replace bulky diffractive optical devices for high-density integration [22,23,40,41].The working mechanism and design principle of our proposed DONs are universal and thus can be generalized to nanostructures.We have also implemented the above network on metasurfaces; see Supplementary Note 7 for details.Therefore, a metasurface-based DON can be envisaged and will serve as a very promising candidate for photonic integrated circuits.
Despite the exciting results of playing games, the DON currently has limitations for handling more complex tasks.First, for the sake of the computational requirements of optical forward propagation, we deploy a two-phase training architecture to obtain the policy model before iterating the DON instead of end-to-end learning in this work.Combining the two steps may reduce errors and make it easier to use.Second, ideally, the last layer of the network should not have shortcut connection, which can be improved by modifying the experimental system.In addition, given the similar properties of DON and MLP, the introduction of MLP-based attention mechanisms [34,42] into the field of optics could be considered.Moreover, the inference and control capability of DONs could also be improved by introducing methods such as nonlinear optical effects [43][44][45][46], multichannel structures [28], and Fourier space [25] in the future, leading to a variety of new applications.While preliminary, this research suggests that the DON has great potential for processing complex visual inputs and tasks.It could provide a promising avenue for an optical computing system for decision-making and control, which would be a fruitful area for next-generation AI.

Fig. 2
Fig.2Playing Tic-Tac-Toe.a The schematic illustration of the DON composed of an input layer, hidden layers of cascaded three diffractive blocks, and an output layer for playing Tic-Tac-Toe.b,c The sequential control of the DON in performing gameplay tasks for X and O, respectively.d The accuracy rate of playing Tic-Tac-Toe.There is a collection of 87 games utilized for predicting the X, obtaining 81 wins and 6 draws in these games.In the rest of the 583 games, the O obtains 454 wins, 74 draws, and 21 losses.When previous moves have occupied the predicted position at a turn, such a case is counted as a playing error and occurs 34 times.e Dependence of the prediction accuracy on the number of hidden layers.

Fig. 3
Fig. 3 Playing Super Mario Bros. a The layout of the designed network for playing Super Mario Bros. b,c Snapshots of Mario's jumping and crouching actions by comparing the output intensities of actions.The output intensity of the jump is maximum at the 201st frame, so the predicted action is jump, and Mario is controlled to act, shown in b.A similar series of prediction and control for another crouch action can also be observed in c. d The inverse prediction result.Considering the predicted crouch at the current state is crucial for updating Mario's action, we use the maximized output intensity of the crouch as input, ignoring the simultaneous output of other actions.

Fig. 4
Fig.4Playing Car Racing.a The layout of the designed network for playing Car Racing.b The control of the steering direction and angle of the car with respect to the difference value between the intensities at the current state, normalized between −1 and 1. c-f Snapshots of controlling the car steering.When the car is facing a left-turn track in c, the output intensity on the left keeps the value greater than the right intensity, allowing continuous control in updating the rotation angle of the left-turn action.A similar control process can also be performed for the right-turn track in e.In addition, the anti-disturbance of the network is validated by introducing the Gaussian blur d and Gaussian noise f to the game images, respectively.

Fig. 5
Fig. 5 Experimental demonstration of the DON for Tic-Tac-Toe.a The photo of the experimental system, where the unlabeled devices are lenses, a spatial filter is used to remove the unwanted multiple-order energy peaks, and a filter is mounted on the camera.b The output of the first layer of the sample in Fig. 2a, and the red arrows represent the polarization direction of incident light.c,d The sequential control of the DON in playing the same two games as in Fig. 2b,c, respectively.The experimental results are normalized based on simulation results.Sim.simulation, Exp.experimental.