[1]\fnmMinho \surChoi \equalcontThese authors contributed equally to this work.
These authors contributed equally to this work.
[1,4]\fnmEli\surShlizerman
[1,2]\fnmArka\surMajumdar
1]\orgdivDepartment of Electrical and Computer Engineering, \orgnameUniversity of Washington, \orgaddress\citySeattle, \postcode98103, \stateWA, \countryUSA
2]\orgdivDepartment of Physics, \orgnameUniversity of Washington, \orgaddress\citySeattle, \postcode98103, \stateWA, \countryUSA
3]\orgdivDepartment of Computer Science and Engineering, \orgnamePohang University of Science and Technology, \orgaddress\cityPohang, \postcode37673, \stateGyeongbuk, \countryRepublic of Korea
4]\orgdivDepartment of Applied Mathematics, \orgnameUniversity of Washington, \orgaddress\citySeattle, \postcode98103, \stateWA, \countryUSA
Transferable polychromatic optical encoder for neural networks
Abstract
Artificial neural networks (ANNs) have fundamentally transformed the field of computer vision, providing unprecedented performance. However, these ANNs for image processing demand substantial computational resources, often hindering real-time operation. In this paper, we demonstrate an optical encoder that can perform convolution simultaneously in three color channels during the image capture, effectively implementing several initial convolutional layers of a ANN. Such an optical encoding results in reduction in computational operations, with a state-of-the art classification accuracy () in free-space optical system. In addition, our analog optical encoder, trained for CIFAR-10 data, can be transferred to the ImageNet subset, High-10, without any modifications, and still exhibits moderate accuracy. Our results evidence the potential of hybrid optical/digital computer vision system in which the optical frontend can pre-process an ambient scene to reduce the energy and latency of the whole computer vision system.
keywords:
Neural networks, Meta-optics, Object detection, Image classification1 Introduction
Visual information plays a crucial role in human response, particularly in situations where reaction time is limited to a few tens to hundreds of milliseconds [1, 2]. Though the human brain has efficiency far exceeding that of any other human-made computing systems, it still cannot process the entire collected visual data due to its massive amount of information. Most likely, our brain performs early visual processing to extract essential features for efficient and rapid interpretation without handling the entire visual data [3, 4, 5].
With the dramatic development of artificial intelligence (AI), computers can process the visual information like human brain, thanks to artificial neural network (ANN), enabling computer/machine vision [6, 7, 8, 9, 10]. Despite impressive progress, real-time inference with limited computational resources remains very challenging even with more efficient algorithms. For example, in a flying object (i.e., habitat drones [11]) on-site data processing is plagued by severe heating, battery capacity and weight handling challenges. Utilizing cloud based systems poses challenges associated with data security and additional data transfer latency [12, 13].
Optical neural networks have emerged as a potential platform to circumvent these trade-offs, since an optical system can process multidimensional information with large spatio-temporal bandwidth [14]. Recently, integrated photonics and free-space or fiber optics have been employed to implement some parts of an ANN for image compression/encryption [15, 16] and classification [17, 18, 19, 20, 21]. However, most of them are highly restricted on solving a relatively simple gray-scale datasets (i.e., MNIST and fashion-MNIST) and only a couple of systems have shown their implementation for more complicated multichannel datasets (i.e., CIFAR-10 and ImageNet) [17, 19]. For these complex datasets, the optical system often become extremely large (with multiple stacks of the photonic circuits) [19], otherwise the classification accuracy remains low ( accuracy for CIFAR-10 classification tasks) [17, 22, 23]. In addition, the most successful ANN architectures utilize nonlinear activation functions that are challenging to implement optically. Proposed solutions, including atomic vapor cells [24, 25] and image intensifiers [26], introduce significant experimental complexity, and additional power consumption.
To leverage the strengths of both optical and digital computing systems, an encoder-decoder inspired hybrid optical/digital architecture is a promising approach [8, 27, 10, 28]. Specifically, an analog linear optical frontend (denoted as the optical encoder) performs bulk of linear computational tasks, while the digital backend implements the nonlinear operations. One intriguing possibility is to employ a static optical frontend, which is data agnostic, whereas the backend is trained and reconfigured. This resolves usual issues of modulation speed, errors, and system size in all-optical systems. An optical encoder is particularly suitable for convolutional neural network (CNN) architectures, where convolutional layers act as feature extractors, encoding high-dimensional images into low-dimensional features [29]. In fact, every free-space optic inherently performs a two-dimensional convolution operation during the imaging under incoherent light. The captured image is a convolution of the scene and the optic’s incoherent point-spread-function (PSF) [30]. Thus, by engineering the PSF, an optical encoder can perform the desired convolution and replace the initial layers of a CNN.
Recently, PSF-engineered optical encoder has been employed to classify MNIST hand-written dataset and a reasonable classification accuracy with much less computational costs compared to the AlexNet is demonstrated [31]. We note that, however, MNIST images are monochrome, and is almost linearly separable ( loss without any nonlinearity [32]). The monochrome nature of the images makes the PSF-engineering approach wavelength agnostic. On the other hand, datasets such as CIFAR-10 [33] or ImageNet subset (High-10) [34, 29] are not separable by linear layers. Moreover, they consist of colored images, where the actual color information is exploited in classification.
Here, we demonstrate a polychromatic optical encoder with PSF-engineered meta-optics to classify the CIFAR-10 dataset. We first compressed the architecture into a single convolutional layer and two fully-connected layers using Knowledge Distillation. Then, we physically realized the convolution layer using an array of metasurfaces, where each metasurface, thanks to the inherent chromaticity, performs a separate convolution for each color channel. As a result, the hybrid CNN with an optical encoder reduces the total number of multiply–accumulate (MAC) operations at the digital backend by an factor of . The reduction of number of MAC operation directly corresponds to the computational costs, i.e., power and latency [35]. It is worth noting that we always require an imaging system (i.e., lens and camera) to capture the image under ambient illumination, before we deliver the image data to the computational backend. Hence, with a single meta-optical encoder, we are not adding any additional optics, but simply replacing a conventional lens with PSF-engineered meta-optics. This makes our optical system compact and fully compatible with conventional optical imaging systems, while the other systems such as, integrated-photonic systems require pre-processing of the data[19] and in-sensor computing needs a customized sensor design [36].
Furthermore, we adopt the same meta-optics (optical convolutional layer) which was optimized for CIFAR-10 dataset to High-10 dataset to explore the generality of optical encoders. In practice, a static optical encoder should be applicable for any scene. While, one approach is to employ reconfigurable frontend, e.g. based on non-volatile phase change materials [37] or liquid crystals [38], the performance of these reconfigurable front end in terms of individual pixel control, power consumption, and operating speed are still inferior for practical deployment. Remarkably, with the same passive optical encoder (optimized for CIFAR-10 dataset), we achieved a high classification accuracy (for High-10 dataset) by fine-tuning the digital backend with additional fully-connected layer (via transfer learning approach). This ability to generalize the frontend is crucial for any ANNs as it enhances their versatility, efficiency, and robustness. A network that generalizes well can be applied to different tasks without extensive re-training, saving time, reducing costs for meta-surface fabrications, and conserving computational resources for real-world applications.
2 Results
Our optical encoder concept is described in Figure 1. The original CNN, i.e., AlexNet, has five convolutional layers and three max pooling layers at the front, followed by three fully-connected layers at the end, while nonlinear activation functions “ReLU” are placed in each layer. Replacing all individual five convolutional layers with five sequential optics is extremely difficult because of misalignment, large system size, lack of nonlinearity, and low signal-to-noise ratio, issues that compound with increasing number of optical elements. Therefore, we compressed AlexNet to a single convolutional layer and two fully-connected layers using knowledge distillation method [39], which reduces the complexity of the architecture with a minimal compromise in accuracy.

While, the compression of an original CNN is essential for realizing the optical encoder scheme, there are practical trade-offs to consider. On one hand, the physical size of the sensor and meta-optics limit the number of kernels and kernel size. On the other hand, small kernel size or number of kernels fail to classify the data effectively. We empirically searched for the optimal number and size of the kernels while compressing the original CNN. For CIFAR-10 classification task, we design 16 kernels of size (see details in Supplementary Materials). The training and testing accuracy from the compressed all-digital CNN are and , respectively.
Since the CIFAR-10 images have three channel information corresponding to red (R), green (G), and blue (B) we have 16 individual kernels for each color, a total of 48 kernels. In meta-optics, it is possible to design a single meta-optics to produce three different PSFs (i.e., convolutional kernels) for RGB wavelengths (more details in Methods). It is difficult to create positive and negative weights on the camera in terms of light intensity at the same time, so we separate each kernel into its positive and negative parts and design an optic for each [40], for a total of 32 meta-optics in total corresponding to a 16 positive and negative polychromatic kernels.
Another important design parameter is to determine how many pixels on the camera represent one pixel of the PSF. We term this as the enlargement factor. For example, when the enlargement factor is , the ground-truth PSF which is a matrix will correspond to pixels on the camera. While a large enlargement factor ensures less alignment error, the signal intensity on each camera pixel will be lower, resulting in low signal-to-noise ratio. To determine the optimal enlargement factor, we experimentally tested several meta-optics with different enlargement factors for a particular kernel (see details in Supplementary Materials), and obtained an optimal enlargement factor of for a meta-optic made of scatterers.

The metasurfaces were made of silicon nitride on a quartz substrate to ensure high transparency in the visible wavelength (Figure 2a). Figure 2b shows the transmission coefficients and phase shifts from silicon nitride pillars at RGB wavelengths as a function of the pillar width, , at the fixed height of 800 nm, obtained by rigorous coupled-wave analysis (RCWA). We target wavelengths () of , , and for RGB colors based on the availability of the laser diodes.
In order to effectively model the wavelength-dependent effects of the meta-optics in a gradient descent-based optimization method, a fast and differentiable function is required to map between the pillar width and imparted phase. We define a proxy function inspired by the approximate phase shift of a dielectric waveguide with corrective factors that are fit to the RCWA simulation results. To calculate the phase shift (, , and ) for RGB wavelengths with respect to the pillar width, , we define the proxy function as
(1) |
The first term corresponds to a general phase shift from a dielectric waveguide, where and are the effective refractive index and height of the silicon nitride pillars. The second term, only has the variance, corresponds to a correction term in the Gaussian shape with , , and as a fitting parameters. Lastly, corresponds to a phase shift offset, making the . This proxy function does not model the resonance-induced phase variations; However, we do not want to use those phase variations due to reduced amplitude and these resonances are expected to be less prominent in the fabricated devices due to the sidewall roughness.
Figure 2c shows the design flow for the polychromatic RGB meta-optics that have optimized PSFs for individual RGB colors. For a meta-optic parameterized by an arbitrary two-dimensional pillar width map, , we extract three separate phase maps using the proxy functions . We then propagate the electromagnetic field using angular spectrum method [41] to simulate the PSFs at the focal plane, away from the meta-optics. At the focal plane, we compare the computational ground-truth PSFs defined by the convolutional kernels obtained using knowledge distillation () and optically-simulated PSFs () at each RGB channels, where the channel-dependent losses are defined by the sum of squares of differences in each pixels:
(2) |
We optimize the map of two-dimensional pillar width, i.e., meta-optics, for minimizing the net loss which defined as a root mean square of the losses at three different colors using the Adam optimizer in TensorFlow [42]:
(3) |
Calculated losses for all the kernels at three different colors are shown in the Supplementary Materials.

An optical image of the fabricated chip is shown in Figure 3a. A single chip contains a total of 32 convolutional meta-optics (corresponding to 16 positive and 16 negative convolutional kernels) and additional 5 metalenses which are focusing light at the focal plane, to aid in the alignment (e.g., tilt, rotation, and distance) between the meta-optics and the camera. Figure 3b shows the schematic of the PSF measurement setup. By changing the laser diodes, we illuminate individual RGB coherent light onto the camera through the meta-optics and experimentally characterize the polychromatic PSFs. A pinhole of diameter creates an approximate point source and the positions of the optics, i.e., pinhole, meta-optics, and camera, remain the same while changing the laser diodes.
Figure 3c shows both the ground-truth PSFs and measured PSFs for a particular kernel for individual RGB wavelengths. To quantitatively analyze the difference between the ground-truth and experimentally measured PSFs, we define a cosine similarity () as:
(4) |
where and are the ground-truth and measured intensity profiles of the PSF for RGB wavelengths, respectively. The calculated for RGB wavelengths are about 0.88, 0.56, and 0.81, respectively. The quntitative discrepancy can be attributed partially to the fabrication and measurement imperfections. Additionaly, not all the polychromatic PSFs are physically realizable as the phases at different wavelengths are not completely independent. Creating more physically-realizable PSFs via co-designing the optical frontend and computational backend, also termed as end-to-end design [43, 44], instead of replacing the convolutional layer with optics, may increase the . However, including the meta–optical simulation in the end-to-end design may result in local optimum, and the fabrication/ measurement imperfections will still be present. As we will show later in the figure, the computational backend is robust against such discrepancy in the PSFs, and we can easily correct for these errors by introducing an additional fully-connected calibration layer in the digital backend.
Then, we test the polychromatic optical encoder for CIFAR-10 dataset. By replacing the pinhole with an organic light-emitting diode (OLED) display, we convolve the CIFAR-10 images with the characterized PSFs of the meta-optics (Figure 3d). The displayed image size is carefully adjusted according to the convolutional kernel size and the enlargement factor on the camera (more details in Supplementary Materials). Figure 3e shows the computationally and meta-optically convolved RGB images of one of the CIFAR-10 dataset. The meta-optically convolved image loses some of the high resolution components, likely due to the imperfect fabrication and alignment errors which are already recognisable from the PSF measurements as well as the spectral overlap between RGB color pixels of the camera. However, as we will show later in the figure, the computational backend is robust against such discrepancy as we do average pooling the convolved image into size. We added an additional fully-connected layer, called calibration layer, to address the weights of each kernels and colors, dealing with the discrepancy between optical/digital systems (e.g., normalization, scaling, translation, rotation, tilt, noise). This calibration layer allows us to use the pre-trained digital backend and incur minimal computational cost. Detailed explanations on the calibration layer is in the Methods and Supplementary Materials.
Network Architecture | Train accuracy (%) | Test accuracy (%) |
---|---|---|
AlexNet | 83.04 0.87 | 81.03 0.89 |
Compressed digital CNN | 76.94 0.52 | 76.59 0.50 |
Hybrid optical/digital CNN (w/o calibration layer) | 56.78 0.91 | 56.39 0.92 |
Hybrid optical/digital CNN (w/ calibration layer) | 73.18 0.58 | 72.06 0.57 |
Hybrid optical/digital CNN (w/) | 75.05 0.47 | 73.17 0.49 |
\botrule |
Figure 3f shows the confusion matrices of the classification accuracy of the CIFAR-10 data with an original CNN (AlexNet), a compressed CNN using knowledge distillation, and a hybrid optical/digital CNN using convolutional meta-optics after the compression. Even though there are slight differences between the optical and digital convolution results (Figure 3e), after introducing the calibration layer, we can achieve similar accuracy (less than 5% loss) for both training and testing dataset (Table 1). It is possible to improve the accuracy if we retrain the backend and the calibration layer; However, retraining the backend has no practical usages. Additionally, this hybrid approach significantly reduces computational costs which can be represented by the number of multiply-accumulate (MAC) operations. From the original CNN to compressed CNN, we can reduce the computational load, which is represented by a number of MAC operations, by a factor of , while we can reduce further by a factor of after replacing a convolutional layer with meta-optics. Detailed calculation about number of MAC operations for each CNN architectures are described in Extended Data Table A1. The detailed information for network design and dimension selection is available in the Supplementary Materials.
To analyze the effectiveness of the meta-optical convolutional layer, we utilize principal component analysis (Figure 4). For the original CNN and compressed all-digital CNN, each class is well-separated (Figures 4 a and b), implying that we can extract out the key features of the CIFAR-10 image dataset after convolution. On the other hand, after the optical convolution using meta-optics, without calibration, different classes of the image were very difficult to distinguish (Figure 4c). Additionally, some clusters exhibit larger sizes and overlapping regions than Figure 4(a-b), e.g., the navy blue, brown, and red clusters (confidence ellipses). However, after introducing the calibration layer, the clustering regions become smaller, and the separations between classes increases. As shown in Figure 4d, each class becomes well-separated and distinguishable, similar to the compressed CNN. This critical role of the calibration layer is consistent with the classification accuracy without and with the calibration layer (Table 1). We note that the calibration layer can potentially be compressed into the pretrained digital backend via additional training and will not affect the number of MAC operation for inference (Extended Data Table A1).

Our convolutional meta-optics implements convolutional kernels which came from compressed CNN for CIFAR-10 data. Unlike computational neural networks, optical implementations are extremely difficult to modify once they are fabricated. This necessitates different convolutional meta-optics for different dataset. However, we found that the convolutional layer that we optimized for CIFAR-10 can be readily adapted to classify another dataset High-10, with a transfer learning process. We added an additional fully-connected layer, which we call a “transfer learning layer”, that is located in between the former fully-connected layers and convolutional layer. By training the transfer learning layer, we can fit the other dataset, i.e., High-10, to the CNN which is pre-optimized for a particular dataset, i.e., CIFAR-10, with only fine-tuning a small part of the original network without changing the former network structure (see details in Methods). The High-10 image dataset is polychromatic (RGB) and has a size of size. To use the former CNN optimized for CIFAR-10 data for High-10 data, we resize the High-10 images to size, same as the CIFAR-10 data.
Network Architecture | Train accuracy (%) | Test accuracy (%) |
---|---|---|
AlexNet | 85.31 0.27 | 84.95 049 |
Compressed CNN (without Transfer Learning) | 41.43 0.26 | 40.44 0.39 |
Compressed CNN (with Transfer Learning) | 67.43 0.22 | 66.01 0.13 |
Hybrid optical/digital CNN (with Transfer Learning) | 63.46 0.46 | 59.73 0.91 |
\botrule |
Without applying a transfer learning method, the training and testing accuracy is rather low around 40%. However, after transfer learning, we achieve much higher training and testing accuracy ( and , respectively) on the High-10 data with the convolutional layer and two fully-connected layers. We further experimentally verified this approach works in our hybrid optical/digital CNN using the same convolutional meta-optics that we used for the CIFAR-10 data digital backend structure with one additional fully-connected layer. The average training and testing experiment accuracy of the High-10 data are similar (less than loss) to the compressed all-digital CNN, which is about the same of the CIFAR-10 case. The principal component analysis results of both compressed CNN and hybrid CNN for the High-10 data is shown in Supplementary Materials, where we can see how different classes are separable in their feature map. The detailed information for calibration layer design, number of calibration selection and PCA visualization is available in the supplementary materials.
3 Discussion
3.1 Multichannel dataset
The advantages of the knowledge distillation and meta-optical encoder are a dramatic reduction of computational complexity, which is represented by the MAC operation. For the CIFAR-10 dataset, our hybrid optical/digital CNN reduced the number of MAC operation by a factor of . This reduction is about an order of magnitude higher than that of the MNIST hand-written dataset, where the meta-optical encoder reduced the number of MAC operation only by a factor of [31].
On the other hand, the classification accuracy drops are more significant for the CIFAR-10 dataset compared to the MNIST dataset. The train (test) accuracy for CIFAR-10 dataset of our hybrid CNN drops by () from the original CNN. For MNIST dataset, the train (test) accuracy of our hybrid CNN drops by () from the original CNN [31]. While this classification accuracy drop in CIFAR-10 dataset is not negligible, our PSF-engineered optical encoder has significantly large classification accuracy compared to the other free-space optical neural network architectures (which are compatible with conventional camera systems). Our encoder has a classification test (train) accuracy of () for CIFAR-10 dataset without retraining the backend and only projecting by a calibration layer. These test (train) accuracy can be improved further up to () if we retrain the backend, which is better than the previous state-of-the-art result () which used a complex end-to-end optimization as well as the backend retraining with 50 number of kernels [44]. Our hybrid optical/digital CNN can be further improved by using complex meta-atoms to reproduce better PSFs optically [45] and using advanced compression method to reduce the loss during the knowledge distillation [46]. The other reports have much less accuracy [17, 22, 23] compared to ours.
For a ImageNet subset, High-10, we have the same amount of reduction in number of MAC operation as the CIFAR-10 dataset since we used the identical CNN architecture. The train (test) accuracy of our hybrid CNN heavily drops by () compared the original CNN. The majority of losses occur during the network compression as we share the convolutional layer and fully-connected layers optimized for CIFAR-10 dataset. However, albeit to the losses, our transfer learning results has a classification accuracy of , still better than the other free-space optical neural networks system [17]. Here, we selected the ImageNet dataset which has more complicated and distinct classes from the CIFAR-10 dataset. We did not change the optical frontend, but only fine-tuned the digital backend of two fully-connected layers and an additional transfer learning layer, to show the versatility of our hybrid CNN system.
3.2 Energy consumption
In practice, we can implement our hybrid optical/digital CNN simply by replacing a lens with meta-optics during imaging. Hence, the energy consumption will solely be determined by the number of MAC operations. However, it is important to also consider the power from the sensor. Specifically the sensor power depends on the number of pixels being passed to the digital backend. For an original CNN, we only need pixels to capture the image. On the other hand, hybrid CNN needs pixels for imaging one convolved image, considering the average pooling (details in the Supplementary), which ends up with pixels for all positive and negative kernels. Hence, our hybrid CNN requires a bit larger number of pixels for imaging compared to the original CNN.
The color camera we used (Allied Vision Prosilica; GT 1930 C) has a total power consumption of with frames per second and pixels, which ends up with per frame and pixel. Thus we estimate that the original CNN and hybrid CNN requires an energy of about and , respectively, for the image capturing process per a single image. However, the energy consumption for the computational backend is much larger for the original CNN compared to the hybrid CNN. For state-of-the-art computational system, an energy consumption per a single MAC operation is [26, 43]. Thus the energy consumption for a single object classification task for the hybrid CNN is about , which is more than four orders of magnitude smaller than that of the original CNN, . We note that, we can trade-off the sensor power (by reducing the number of kernels) with computational backend power (by increasing MAC operations). However, having more operation in the optical encoder with a simple computational backend will always be preferred to reduce the latency.
3.3 Applications
The hybrid CNN has a strong advantages in terms of latency and energy consumption compared to the original CNN. Additionally, it can be adequately integrated on the commercial imaging system (e.g., camera) without modifying the physical architecture other than the lens with a PSF-engineered meta-optics. Moreover, the ability to encode a colorful image brings up a potential to utilize the encoder for real-world scenes. However, sacrifice of the accuracy is extremely crucial and not negotiable for cases where the safety matters (e.g., autonomous driving vehicles). In other words, if the object classification is applied for statistical analysis (where the ensemble average can minimize the individual inaccuracy), we can endure the loss of classification accuracy. Habitat monitoring drones can be an example [47, 48]. Especially, in case of drones, restrictions to minimize their weight is crucial, which force it store only essential features. Then on-site data processing can be beneficial. As our optical encoder can minimize the latency and energy consumption for the on-site data processing, the habitat drones can investigate much larger areas with a single flight.
4 Conclusion
The results in this work are strong evidence that optical frontend can significantly reduce the power consumption and latency of ANNs for computer vision tasks. Despite realistic fabrication and measurement errors arising from optical implementation, the approach achieves the state-of-the art classification accuracy in multichannel CIFAR-10 data with the addition of a calibration layer and trainable fully-connected layers. The use of a single meta-optical layer to perform complex, multi-channel convolutions highlights a unique applicability of meta-optics that cannot be accomplished using traditional optics. In addition, we address the lack of reconfigurability in existing optical implementations using transfer learning approach, and reconcile the optical frontend optimized for CIFAR-10 to the High-10 dataset. In this regard, we suggest that a hybrid approach comprised of an optical frontend and reconfigurable digital backend utilizes the key advantages of optics (i.e., no latency, no loss, large space-bandwidth) with robustness and reconfigurability provided by the backend.
5 Methods
5.1 Knowledge distillation
Typically, the knowledge distillation algorithm is designed to compress neural networks. Here, we propose using knowledge distillation to transfer the generalized knowledge from a larger, pre-trained teacher network, AlexNet, to a more compact CNN, referred to as the “student network.” Specifically, the student network comprises only a single convolutional layer coupled with a backend, which consists of a single fully connected layer and a linear calibration layer. In addition, we selected AlexNet as our teacher network for two main reasons: first, AlexNet was the foundational model that successfully addressed the ImageNet dataset. Additionally, compared to more complex networks like ResNet-18 or VGG-16, the five-layer AlexNet is more accessible and easier to implement optically.
The knowledge distillation algorithm includes two types of losses: student loss and temperature loss. Student loss minimizes the discrepancy between the student network’s predictions and the ground truth labels. The softmax function is used to compute:
(5) |
where represents the student logits after the last fully connected layer. Temperature loss, on the other hand, optimizes the discrepancy between the student network’s predictions and the teacher network’s predictions. Knowledge distillation incorporates a softening parameter, , known as the distillation temperature for the teacher probabilities. Thus, we can compute such loss as:
(6) |
Finally, the total loss is calculated as a weighted sum of the two losses:
(7) |
where is the weight balancing the two loss components, is the cross-entropy loss function, is the Kullback-Leibler (KL) Divergence loss function [49].
We also find that other key hyperparameters might impact our hybrid CNN system. First, in most CNNs, such as ResNet-18 and AlexNet, there are multiple convolutional layers, each with more than 200 kernels to extract useful features and maintain generalization across various datasets. While some pruning strategies show that using 1% of the parameters can achieve similar accuracy [50], applying these algorithms to optical neural networks are non-trivial. Most pruning methods still retain ANN structures, which suffer from misalignments that are nearly impossible to eliminate [31]. Therefore, compressing into shallower layers with more kernels is preferred. However, each meta-surface has a physical size that limits the number of kernels it can contain. To address this limitation, we can use multiple cameras and multiple meta-surfaces to increase the number of kernels, thereby improving the classification accuracy and generalization of the hybrid CNN.
5.2 Meta-optics design
For 16 digital kernels for each R, G, and B channels, we have 32 meta-optical kernels as we use a single meta-optics for all RGB channels but we cannot represent both positive and negative weights with optics. Hence, we create 16 positive kernels and 16 negative kernels, then perform digital substraction on the digital backend. Each of our convolutional meta-optics has scatterers, with scatters constitute a group to enhance the robustness of fabrication. Based on the ground-truth digital convolutional kernels, we defined optical PSFs for each RGB channels, and inverse-design the meta-optics having those PSFs at each RGB wavelengths using TensorFlow Adam optimizers.
5.3 Meta-optics fabrication
Our meta-optics operate at visible wavelength ( ). We use silicon nitride on quartz substrate for the meta-optics to have high transparency at the whole visible regime. We deposit a thick silicon nitride layer ( ) on top of the double-polished quartz substrate using plasma-enhanced chemical vapor deposition (Oxford; Plasma Lab 100). We spin coat and bake electron beam resist (ZEP-520A) on top of the silicon nitride layer, followed by a spin coat anti-charging agent (DisCharge H20). We pattern using electron beam lithography (JEOL; JBX6300FS), and develop the resist using amyl acetate. After that, we deposit via electron beam evaporation (CHA; SEC-600) and do lift-off an alumina layer ( ) for hard mask . Finally, we etch the silicon nitride with an alumina hard mask using plasma etcher with fluorine-based gas (Oxford; PlasmaLab 100, ICP-180). The sub-wavelength structured meta-optics has a period of 293 nm, which is a half of the camera pixel size collecting the image.
5.4 Optical measurements
We measure the PSF by placing a laser and a pinhole ( ), representing a point source. Then we place the convolutional meta-optics on 3-axis stage with rotational knobs to align the meta-optics centered and parallel to the beam path. High resolution color camera (GT-1930C) which has a pixel size of 5.86 is placed away from the meta-optics. We measure the PSFs for each RGB color light by replacing the laser with three different wavelengths (Thorlabs; CPS450, CPS532, and CPS635). For image convolution measurements of the CIFAR-10 dataset, we put the micro-display at the pinhole position, then connect to the computer to show the color images. Since a single meta-optics can represent three different RGB kernels at the same time (see details in Supplementary Materials), a color camera which have RGB color pixels can extract the convolved images at three different channels. This can eventually save the space of the meta-optics and camera, which is critical in real-world applications. The point source is replaced by an arbitrary two-dimensional image, . We can express the image as a sum of the three color channels, . The convolutional meta-optics perform a convolution for each color, and as a result, a convolved image, , will be imaged on the camera. Since we determined the enlargement factor of for the PSF, we use the same enlargement factor for the CIFAR-10 image as well. According to the camera pixel size, , and and CIFAR-10 image size, , the projected image size on the camera has to be about . At the given values of distance between the display and meta-optics and meta-optics to the camera, we can end up with the CIFAR-10 image size on the display to be . We use 10,000 images for training (subset of original images )and 10,000 images for testing, with an exposure time of 500 ms. Among the 10,000 images of training and testing dataset, 186 and 201 images are not involved on training and testing, respectively, due to the overexposure issue. All the measurement parameters and number of images are the same for the High-10 dataset for transfer learning process.
Another critical factor is the exposure time. Since the optical features are captured by a CCD camera, the exposure time significantly influences the final performance. If the optical features are overexposed, texture information, such as the fur of a cat, might be missing. Conversely, if the optical features are underexposed, most information may also be lost, resulting in a lack of distinction between highlights and shadows in the image. To find the most appropriate exposure, we could use a similar approach to modern cameras, where “18% gray” is considered as the mid-point between black and white on a logarithmic or exponential curve. This standard can help us achieve balanced exposure, ensuring that the captured optical features are neither overexposed nor underexposed.
5.5 Computational backend
As previously discussed, optical fabrication and alignment noise are unavoidable in meta-surface kernels. These include scaling, translation, rotation, image aberration, and optical noises. To address this issue, we propose adding a calibration function to remap the optical convolution outputs to align with those of the previously trained backend. Specifically, we use a fully connected layer as the calibration function and corresponding loss function is defined as:
(8) |
This approach aims to refine the experimental outputs to align more closely with the pre-designed network. To prevent overfitting, we strategically limit our training to only 20% of the available data, ensuring that our model remains efficient [xiang2022tkil,xiang2023tkil, 52].
5.6 Transfer learning
Generalization is a key feature to test our hybrid optical/digital CNN. Ensuring that the network can generalize well to new, unseen data is crucial for several reasons. First, our hybrid network is compressed from AlexNet, which was originally designed with a large dataset. The pre-trained AlexNet achieves high accuracy across various datasets and can be easily adapted or fine-tuned to out-of-distribution datasets. This adaptability is essential for practical applications where the data distribution may differ from the training set. Second, exploring the generalization capabilities of hybrid models is important because designing and fabricating different meta-surface kernels for different tasks is inefficient. By enhancing generalization, we can use a single hybrid model for multiple tasks, reducing the need for extensive redesigns and fabrications. Details and schematics of our transfer learning plan are shown in the Supplementary Materials.
To implement the transfer learning, we add two types of losses: feature loss and label loss. The feature loss minimizes the discrepancy between the optical features and the digital features, ensuring that the representations learned by the optical and digital components are aligned. The label loss minimizes the discrepancy between the model’s predictions and the actual labels, improving the overall prediction accuracy. During the transfer learning process, the optical frontend and digital backend remain unchanged. We add two fully connected layers between the optical front end and backends and fine-tune these layers using the two losses. Specifically, the function is:
(9) |
where is the feature loss and is the label loss, with as the respective weights balancing these losses.
Declarations
Acknowledgements The research is supported by National Science Foundation (EFRI-BRAID-2223495). Part of this work was conducted at the Washington Nanofabrication Facility/ Molecular Analysis Facility, a National Nanotechnology Coordinated Infrastructure (NNCI) site at the University of Washington with partial support from the National Science Foundation via awards NNCI-1542101 and NNCI-2025489.
Conflict of interest A.M. is a co-founder of Tunoptix, which aims to commercialize meta-optics technology.
Code availability The code and data generated and/or analyzed are available from the corresponding author upon reasonable request.
Author contribution M.C. and J.X. contributed equally to this work. A.M. and E.S. conceived of the idea and provided funding. A.W.S. involved in developing the PSF-engineering approach. J.X. and E.S. developed the knowledge distillation approach. J.X. trained the neural network and analyzed the experiment data. M.C. developed the polychromatic PSF-engineering approach and designed the meta-optics. M.C. fabricated the meta-optics and conducted the optical experiments. S.H.B. advised on the result analysis. M.C., J.X., and A.M. wrote the manuscript with input from all authors.
Appendix A Extended Data
Layers | AlexNet | Compressed CNN | Hybrid optical/digital CNN |
3641111224224 | |||
64192555555 | |||
Convolution | 192384552727 | 316773232 | 0 |
384256331313 | |||
2562563366 | |||
92164096 | 576256 | 576576∗ | |
Fully-connected | 40961024 | 25610 | 576256 |
102410 | 25610 | ||
Sum | 3,651,368,960 | 2,558,464 | 150,016 |
\botrule |
Appendix B Supplementary Materials
The online version contains supplementary materials.
References
- \bibcommenthead
- Thorpe et al. [1996] Thorpe, S., Fize, D., Marlot, C.: Speed of processing in the human visual system. nature 381(6582), 520–522 (1996)
- Graimann et al. [2010] Graimann, B., Allison, B.Z., Pfurtscheller, G.: Brain-computer Interfaces: Revolutionizing Human-computer Interaction. Springer, ??? (2010)
- De Cesarei et al. [2017] De Cesarei, A., Loftus, G.R., Mastria, S., Codispoti, M.: Understanding natural scenes: Contributions of image statistics. Neuroscience & Biobehavioral Reviews 74, 44–57 (2017)
- Iacaruso et al. [2017] Iacaruso, M.F., Gasler, I.T., Hofer, S.B.: Synaptic organization of visual space in primary visual cortex. Nature 547(7664), 449–452 (2017)
- Nassi and Callaway [2009] Nassi, J.J., Callaway, E.M.: Parallel processing strategies of the primate visual system. Nature reviews neuroscience 10(5), 360–372 (2009)
- Yang et al. [2024] Yang, Z., Wang, T., Lin, Y., Chen, Y., Zeng, H., Pei, J., Wang, J., Liu, X., Zhou, Y., Zhang, J., et al.: A vision chip with complementary pathways for open-world sensing. Nature 629(8014), 1027–1033 (2024)
- Gehrig and Scaramuzza [2024] Gehrig, D., Scaramuzza, D.: Low-latency automotive vision with event cameras. Nature 629(8014), 1034–1040 (2024)
- Li et al. [2023] Li, W., Ma, Q., Liu, C., Zhang, Y., Wu, X., Wang, J., Gao, S., Qiu, T., Liu, T., Xiao, Q., et al.: Intelligent metasurface system for automatic tracking of moving targets and wireless communications based on computer vision. Nature Communications 14(1), 989 (2023)
- Siemenn et al. [2024] Siemenn, A.E., Aissi, E., Sheng, F., Tiihonen, A., Kavak, H., Das, B., Buonassisi, T.: Using scalable computer vision to automate high-throughput semiconductor characterization. Nature Communications 15(1), 4654 (2024)
- Yang et al. [2024] Yang, Y., Pan, C., Li, Y., Yangdong, X., Wang, P., Li, Z.-A., Wang, S., Yu, W., Liu, G., Cheng, B., et al.: In-sensor dynamic computing for intelligent machine vision. Nature Electronics 7(3), 225–233 (2024)
- Woodget et al. [2017] Woodget, A.S., Austrums, R., Maddock, I.P., Habit, E.: Drones and digital photogrammetry: from classifications to continuums for monitoring river habitat and hydromorphology. Wiley Interdisciplinary Reviews: Water 4(4), 1222 (2017)
- Zhang [2020] Zhang, T.: Toward automated vehicle teleoperation: Vision, opportunities, and challenges. IEEE Internet of Things Journal 7(12) (2020)
- Wu et al. [2023] Wu, Y., Li, W., Zhang, J., Tang, B., Xiang, J., Li, S., Guo, G.: Driver’s hand-foot coordination and global-regional brain functional connectivity under fatigue: Via graph theory and explainable artificial intelligence. IEEE Transactions on Intelligent Vehicles (2023)
- McMahon [2023] McMahon, P.L.: The physics of optical computing. Nature Reviews Physics 5(12), 717–734 (2023)
- Wang et al. [2024] Wang, X., Redding, B., Karl, N., Long, C., Zhu, Z., Skowronek, J., Pang, S., Brady, D., Sarma, R.: Integrated photonic encoder for low power and high-speed image processing. Nature Communications 15(1), 4510 (2024)
- Chen et al. [2023] Chen, Y., Zhou, T., Wu, J., Qiao, H., Lin, X., Fang, L., Dai, Q.: Photonic unsupervised learning variational autoencoder for high-throughput and low-latency image transmission. Science Advances 9(7), 8437 (2023)
- Xue et al. [2024] Xue, Z., Zhou, T., Xu, Z., Yu, S., Dai, Q., Fang, L.: Fully forward mode training for optical neural networks. Nature 632(8024), 280–286 (2024)
- Bernstein et al. [2023] Bernstein, L., Sludds, A., Panuski, C., Trajtenberg-Mills, S., Hamerly, R., Englund, D.: Single-shot optical neural network. Science Advances 9(25), 7904 (2023)
- Xu et al. [2024] Xu, Z., Zhou, T., Ma, M., Deng, C., Dai, Q., Fang, L.: Large-scale photonic chiplet taichi empowers 160-tops/w artificial general intelligence. Science 384(6692), 202–209 (2024)
- Xia et al. [2024] Xia, F., Kim, K., Eliezer, Y., Han, S., Shaughnessy, L., Gigan, S., Cao, H.: Nonlinear optical encoding enabled by recurrent linear scattering. Nature Photonics, 1–9 (2024)
- Bai et al. [2023] Bai, B., Li, Y., Luo, Y., Li, X., Çetintaş, E., Jarrahi, M., Ozcan, A.: All-optical image classification through unknown random diffusers using a single-pixel diffractive network. Light: Science & Applications 12(1), 69 (2023)
- Huo et al. [2023] Huo, Y., Bao, H., Peng, Y., Gao, C., Hua, W., Yang, Q., Li, H., Wang, R., Yoon, S.-E.: Optical neural network via loose neuron array and functional learning. Nature Communications 14(1), 2535 (2023)
- Rahman and Ozcan [2023] Rahman, M.S.S., Ozcan, A.: Time-lapse image classification using a diffractive neural network. Advanced Intelligent Systems 5(5), 2200387 (2023)
- Yang et al. [2023] Yang, M., Robertson, E., Esguerra, K., Busch, K., Wolters, J.: Optical convolutional neural network with atomic nonlinearity. Optics Express 31(10), 16451–16459 (2023)
- Ryou et al. [2021] Ryou, A., Whitehead, J., Zhelyeznyakov, M., Anderson, P., Keskin, C., Bajcsy, M., Majumdar, A.: Free-space optical neural network based on thermal atomic nonlinearity. Photon. Res. 9, 128–134 (2021)
- Wang et al. [2023] Wang, T., Sohoni, M.M., Wright, L.G., Stein, M.M., Ma, S.-Y., Onodera, T., Anderson, M.G., McMahon, P.L.: Image sensing with multilayer nonlinear optical neural networks. Nature Photonics 17(5), 408–415 (2023)
- Jang and Tong [2024] Jang, H., Tong, F.: Improved modeling of human vision by incorporating robustness to blur in convolutional neural networks. Nature Communications 15(1), 1989 (2024)
- Ji et al. [2021] Ji, Y., Zhang, H., Zhang, Z., Liu, M.: Cnn-based encoder-decoder networks for salient object detection: A comprehensive review and recent advances. Information Sciences 546, 835–857 (2021)
- Xiang et al. [2022] Xiang, J., Colburn, S., Majumdar, A., Shlizerman, E.: Knowledge distillation circumvents nonlinearity for optical convolutional neural networks. Applied Optics 61(9), 2173–2183 (2022)
- Goodman [2005] Goodman, J.W.: Introduction to Fourier Optics. Roberts and Company publishers, ??? (2005)
- Wirth-Singh et al. [2024] Wirth-Singh, A., Xiang, J., Choi, M., Fröch, J.E., Huang, L., Colburn, S., Shlizerman, E., Majumdar, A.: Compressed meta-optical encoder for image classification. arXiv preprint arXiv:2406.06534 (2024)
- Yang et al. [2010] Yang, J., Yu, K., Huang, T.: Supervised translation-invariant sparse coding. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3517–3524 (2010). IEEE
- Alex [2009] Alex, K.: Learning multiple layers of features from tiny images. https://www. cs. toronto. edu/kriz/learning-features-2009-TR. pdf (2009)
- Coates et al. [2011] Coates, A., Ng, A., Lee, H.: An analysis of single-layer networks in unsupervised feature learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 215–223 (2011). JMLR Workshop and Conference Proceedings
- Anderson et al. [2023] Anderson, M.G., Ma, S.-Y., Wang, T., Wright, L.G., McMahon, P.L.: Optical transformers. arXiv preprint arXiv:2302.10360 (2023)
- Zhou and Chai [2020] Zhou, F., Chai, Y.: Near-sensor and in-sensor computing. Nature Electronics 3(11), 664–671 (2020)
- Tara et al. [2024] Tara, V., Chen, R., Fröch, J.E., Fang, Z., Fang, J., Audhkhasi, R., Choi, M., Majumdar, A.: Non-volatile reconfigurable transmissive notch filter using wide bandgap phase change material antimony sulfide. IEEE Journal of Selected Topics in Quantum Electronics (2024)
- Kang et al. [2024] Kang, D., Heo, H., Yang, Y., Seong, J., Kim, H., Kim, J., Rho, J.: Liquid crystal-integrated metasurfaces for an active photonic platform. Opto-Electronic Advances, 230216–1 (2024)
- Hinton et al. [2015] Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
- Chang et al. [2018] Chang, J., Sitzmann, V., Dun, X., Heidrich, W., Wetzstein, G.: Hybrid optical-electronic convolutional neural networks with optimized diffractive optics for image classification. Scientific reports 8(1), 1–10 (2018)
- Shimobaba et al. [2012] Shimobaba, T., Matsushima, K., Kakue, T., Masuda, N., Ito, T.: Scaled angular spectrum method. Optics letters 37(19), 4128–4130 (2012)
- Developers [2022] Developers, T.: Tensorflow. Zenodo (2022)
- Huang et al. [2024] Huang, L., Tanguy, Q.A., Fröch, J.E., Mukherjee, S., Böhringer, K.F., Majumdar, A.: Photonic advantage of optical encoders. Nanophotonics 13(7), 1191–1196 (2024)
- Wei et al. [2023] Wei, K., Li, X., Froech, J., Chakravarthula, P., Whitehead, J., Tseng, E., Majumdar, A., Heide, F.: Spatially varying nanophotonic neural networks. arXiv preprint arXiv:2308.03407 (2023)
- Huang et al. [2024] Huang, L., Han, Z., Wirth-Singh, A., Saragadam, V., Mukherjee, S., Fröch, J.E., Tanguy, Q.A.A., Rollag, J., Gibson, R., Hendrickson, J.R., Hon, P.W.C., Kigner, O., Coppens, Z., Böhringer, K.F., Veeraraghavan, A., Majumdar, A.: Broadband thermal imaging using meta-optics. Nature Communications 15(1662) (2024) https://doi.org/10.1038/s41467-024-45904-w
- Kim and Rush [2016] Kim, Y., Rush, A.M.: Sequence-level knowledge distillation. arXiv preprint arXiv:1606.07947 (2016)
- Koger et al. [2023] Koger, B., Deshpande, A., Kerby, J.T., Graving, J.M., Costelloe, B.R., Couzin, I.D.: Quantifying the movement, behaviour and environmental context of group-living animals using drones and computer vision. Journal of Animal Ecology 92(7), 1357–1371 (2023)
- Chalmers et al. [2021] Chalmers, C., Fergus, P., Curbelo Montanez, C.A., Longmore, S.N., Wich, S.A.: Video analysis for the detection of animals using convolutional neural networks and consumer-grade drones. Journal of Unmanned Vehicle Systems 9(2), 112–127 (2021)
- Seo et al. [2020] Seo, G., Yoo, J., Cho, J., Kwak, N.: Kl-divergence-based region proposal network for object detection. In: 2020 IEEE International Conference on Image Processing (ICIP), pp. 2001–2005 (2020). IEEE
- Liang et al. [2021] Liang, T., Glossner, J., Wang, L., Shi, S., Zhang, X.: Pruning and quantization for deep neural network acceleration: A survey. Neurocomputing 461, 370–403 (2021)
- Xiang and Shlizerman [2023] Xiang, J., Shlizerman, E.: Tkil: Tangent kernel optimization for class balanced incremental learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3529–3539 (2023)
- Zheng et al. [2020] Zheng, Y., Xiang, J., Su, K., Shlizerman, E.: Bi-maml: Balanced incremental approach for meta learning. arXiv preprint arXiv:2006.07412 (2020)