These authors contributed equally to this work.
These authors contributed equally to this work.
These authors contributed equally to this work.
These authors contributed equally to this work.
These authors contributed equally to this work.
These authors contributed equally to this work.
These authors contributed equally to this work.
These authors contributed equally to this work.
These authors contributed equally to this work.
These authors contributed equally to this work.
[1]\fnmXiaoqing \surZheng
1]\orgdivDepartment of Computer Science, \orgnameFudan University
2]\orgdivDepartment of Computer Science, \orgnameUniversity College London
Towards Biologically Plausible Computing:
A Comprehensive Comparison
Abstract
Backpropagation is a cornerstone algorithm in training neural networks for supervised learning, which uses a gradient descent method to update network weights by minimizing the discrepancy between actual and desired outputs. Despite its pivotal role in propelling deep learning advancements, the biological plausibility of backpropagation is questioned due to its requirements for weight symmetry, global error computation, and dual-phase training. To address this long-standing challenge, many studies have endeavored to devise biologically plausible training algorithms. However, a fully biologically plausible algorithm for training multilayer neural networks remains elusive, and interpretations of biological plausibility vary among researchers. In this study, we establish criteria for biological plausibility that a desirable learning algorithm should meet. Using these criteria, we evaluate a range of existing algorithms considered to be biologically plausible, including Hebbian learning, spike-timing-dependent plasticity, feedback alignment, target propagation, predictive coding, forward-forward algorithm, perturbation learning, local losses, and energy-based learning. Additionally, we empirically evaluate these algorithms across diverse network architectures and datasets. We compare the feature representations learned by these algorithms with brain activity recorded by non-invasive devices under identical stimuli, aiming to identify which algorithm can most accurately replicate brain activity patterns. We are hopeful that this study could inspire the development of new biologically plausible algorithms for training multilayer networks, thereby fostering progress in both the fields of neuroscience and machine learning.
keywords:
Biologically Plausible Computing, Learning Algorithms, Deep neural networks1 Introduction
Backpropagation [1] has been instrumental in the rapid development of deep learning [2], establishing itself as the standard approach for training neural networks in supervised learning settings. This algorithm leverages a gradient descent method to iteratively adjust network weights, thereby minimizing the errors between the actual outputs of the network and the desired outputs. Despite its undeniable success and widespread adoption in various applications ranging from image recognition [3] to natural language processing [4, 5], the biological plausibility of backpropagation remains a subject of intense debate among researchers in both neuroscience and computational science [6, 7, 8].
The primary criticisms of backpropagation’s biological plausibility stem from several unrealistic requirements: the symmetry of weight updates in the forward and backward passes [9], the computation of global errors that must be propagated backward through all layers [10], and the necessity of a dual-phase training process involving distinct forward and backward passes [11]. These features are not only computationally intensive but also lack clear analogs in neurobiological processes, which operate under constraints of local information processing and low energy consumption.
Recognizing these limitations, the research community has made significant strides toward developing alternative training algorithms that could potentially align more closely with biological processes. Efforts have ranged from revisiting classical theories such as Hebbian learning [12] to exploring newer concepts like spike-timing-dependent plasticity (STDP) [13] and feedback alignment [14]. Each of these approaches offers a unique perspective on how synaptic changes might occur in a biologically plausible manner, yet a consensus on a fully effective and biologically accurate training method remains out of reach.
In this paper, we aim to critically assess the current landscape of what are considered biologically plausible learning algorithms. We begin by establishing a set of criteria (See Section 2) that any algorithm must meet to be considered biologically plausible. These criteria are designed to encapsulate essential aspects of neurobiological learning, such as the locality of computations, the absence of a global error signal, and energy efficiency in synaptic adjustments.
Following the establishment of these criteria, we embark on a comprehensive evaluation of various learning algorithms that have been proposed in the literature as biologically plausible models. This includes but is not limited to, Hebbian learning [12], STDP [13], feedback alignment [14], target propagation [15], predictive coding [16], the forward-forward algorithm [17], perturbation learning [18, 19], local losses [20], and energy-based learning [21, 22]. Our evaluation not only examines the theoretical foundations and computational efficiency of these algorithms but also involves empirical assessments across various neural network architectures and datasets.
Furthermore, we extend our analysis to include a comparative study of the feature representations learned by these algorithms against actual brain activity patterns. This was achieved by using non-invasive brain recording techniques, such as fMRI and EEG, to record neural responses to identical stimuli and compare these responses to the activations within artificial neural networks trained by the aforementioned algorithms.
By providing a thorough analysis of these algorithms and their ability to model brain-like learning processes, we aspire to contribute to the ongoing dialogue between the fields of neuroscience and artificial intelligence. Ultimately, we hope that this study will not only shed light on the current capabilities and limitations of proposed biologically plausible learning algorithms but also inspire further research and development in this crucial area. This endeavor aims to bridge the gap between biological learning processes and artificial learning systems, paving the way for the development of more efficient, robust, and biologically inspired computational models.
2 Criteria for Biological Plausibility
As the field of artificial intelligence strives to develop algorithms that are not only efficient but also mimic the fundamental mechanisms of human cognition, the concept of biologically plausible computing has gained significant interest. Traditional backpropagation, while effective for training deep neural networks, diverges from known biological processes in several key areas. This divergence has prompted researchers to explore alternative algorithms that might adhere more closely to the principles observed in natural neural systems. However, the notion of biological plausibility is not universally defined and varies significantly across different studies. To navigate this complexity, it is essential to establish clear criteria against which these models can be evaluated. In this section, we outline these criteria and assess existing models accordingly.
We propose five criteria for evaluating the biological plausibility of learning algorithms, summarized from existing literature.
-
1.
Asymmetry of Forward and Backward Weights: In conventional neural networks, the forward-path neurons transmit their synaptic weights to a feedback path, a process known as weight transport, which is biologically implausible. Real neurons are unlikely to share precise synaptic weights in such a manner.
-
2.
Local Error Representation: Biological synapses are believed to modify their strength based on local information, without access to a global error signal. This contrasts with the gradient descent approach where the direction of the error gradient is typically computed using global information.
-
3.
Non-parallel Training (or Non-Two-Stage Learning): Traditional training methods often require a clear distinction between the phases of forward and backward propagation, which is not a feature of biological learning. Bio-plausible methods are explored for their ability to simplify the learning process into more continuous, possibly overlapping phases that better mimic biological learning dynamics.
-
4.
Models of Neurons: The majority of artificial neural networks utilize neurons that output continuous values, intended to represent the firing rates of biological neurons, which in reality use spikes. This discrepancy is addressed through models that incorporate more realistic, spiking neuron models and learning rules adapted to such models.
-
5.
Unsigned Error Signals: In biological systems, error signals are not typically signed or extreme-valued as in many artificial systems. Some learning rules attempt to approximate the way biological systems might handle error feedback without relying on these artificial constructs.
3 Brain-Inspired Learning Algorithms
In this section, we review nine representative brain-inspired learning algorithms in detail, including Hebbian learning, spike-timing-dependent plasticity, feedback alignment, target propagation, predictive coding, forward-forward algorithm, perturbation learning, local losses, and energy-based learning method. Moreover, we show the illustration of these methods in Figure 1.

3.1 Hebbian Learning
Hebbian learning is a fundamental concept in neural network models aimed at explaining how neurons in the brain adapt and learn from experience [12]. Proposed by [23], the theory suggests that the efficiency of a neuron in contributing to the firing of another neuron can increase if the two neurons are repeatedly involved in each other’s activation. The theory is often summarized as “Cells that fire together wire together.” It is biologically plausible and ecologically valid, relying solely on inputs entering the system to produce patterns of activity, without explicit tasks or teaching signals[24].
In neural network models, Hebbian learning is implemented through changes in the strength of connection weights between units. The strength of a connection weight determines the efficacy of a sending unit in activating a receiving unit. Through Hebbian learning, weights change as a function of the activity levels of the units involved. The basic form of the Hebbian learning rule can be expressed as:
(1) |
where represents the weight change, , a function of the input and the weights, is the post-synaptic activation of the neuron , determines the speed at which the weights change in response to unit activation. The activation level of a unit typically falls within the range of 0 to 1 and is calculated as a nonlinear function of the activation of other units and the strength of their connections to the unit.
The main problem of rule 1.1 is that it only allows weights to grow, not decrease. To prevent the weight vector from growing unbounded, it is possible to normalize it after every update, introducing a weight decay term proportional to the value of the weight, given by:
(2) |
Equation (2) has the physical interpretation that at each iteration, the weight vector is modified by taking a step towards the input, with the size of the step being proportional to the similarity between the input and the weight vector. Consequently, if a similar input is presented again in the future, the neuron will be more likely to produce a stronger response. If an input (or a cluster of similar inputs) is repeatedly presented to the neuron, the weight vector tends to converge towards it, eventually acting as a matching filter. In other words, the input is memorized in the synaptic weights. From this perspective, the neuron can be seen as an entity that, when stimulated with a frequent pattern, learns to recognize it.
The abstract nature of the Hebbian learning rule allows it to be applicable to different types of neural networks and learning tasks. There can be multiple methods to implement Hebbian learning. HebbNet[25] is a neural network that utilizes an improved Hebbian approach with updated activation thresholds and gradient sparsity. SoftHebb[26] is a variant that combines standard deep learning elements with a Hebbian-like plasticity. It maintains a Bayesian generative model of the data without supervision and minimizes cross-entropy, offering advantages beyond traditional Hebbian efficiency.
3.2 Spike-timing-dependent Plasticity
In biological neural networks, signals are transmitted between neurons through spikes. The pre-synaptic neuron, responsible for signaling, generates a spike, releasing neurotransmitters within the synapse—a structure connecting it to the post-synaptic neuron, which, in turn, triggers the rise of the postsynaptic potential. As the potential accumulates and surpasses a certain threshold, a spike occurs in the postsynaptic neuron. The efficacy of pre-synaptic excitement in inducing post-synaptic excitement serves as the synaptic weight between two layers of neurons.
When a post-synaptic spike closely follows a pre-synaptic spike, it implies a strong association between the excitations of the two neurons, suggesting a causal relationship. In such cases, it is recommended to strengthen the synaptic weight between them. The shorter the interval between the spikes, the stronger the inferred causal relationship, leading to a greater increase in weight. Conversely, if the post-synaptic spike precedes the pre-synaptic spike, the intuitive assumption is that there is no causal relationship, resulting in a reduction in the synaptic weight. Spike-time-dependent plasticity (STDP), a learning rule within Hebbian learning models, adjusts synaptic weights based on the timing of spikes between pre- and post-synaptic neurons at the synapse, aligning with the aforementioned approach.
As proposed by [13], pre-synaptic excitation preceding post-synaptic excitation leads to long-term potentiation (LTP), resulting in an increased synaptic weight. Conversely, when pre-synaptic excitation follows post-synaptic excitation, it induces long-term depression (LTD), causing a decrease in synaptic weight. The magnitude of weight change is contingent upon the timing difference between spikes. This phenomenon, initially observed in neuroscience experiments, has been formulated as a learning rule for neuronal synapses. The mathematical expression for this rule is:
(3) |
Where the represents the time difference of spikes between the two neurons at the synapse. and determine the maximum amounts of synaptic modification and and are the time constants.
Spike-timing-dependent plasticity (STDP) manifests across a variety of species. Following an initial discussion of STDP, we transition to an exploration of its associated biological mechanisms, particularly the hypothesis[27, 28]. Notably, a high influx of in the post-synaptic neuron induces Long-Term Potentiation (LTP), while a moderate influx results in Long-Term Depression (LTD). Distinct levels activate specific molecular pathways; high activates CaMKII for LTP, while moderate levels activate PP1 and calcineurin for LTD[29]. The level of influx hinges on the interval between the excitatory postsynaptic potential (EPSP) triggered by the pre-synaptic neuron’s activation and the backpropagating action potential (BAP) initiated by the post-synaptic neuron’s activation. Positive intervals signify BAP occurring shortly after EPSP, while negative intervals denote BAP preceding EPSP. When the interval is positive, the BAP leads to unblocking NMDA receptors[30]. Simultaneously, EPSP deactivates specific ion channels, augmenting BAP magnitude[31, 32]. This intensified BAP activates voltage-dependent channels (VDCCs)[33]. The interplay between EPSE and BAP in positive intervals results in a substantial influx in the post-synaptic neuron. Conversely, when the interval is negative, the afterdepolarization of BAP coinciding with EPSP induces a moderate influx[29]. BAP, causing influx through VDCCs, inhibits NMDA receptors[34]. These interactions between BAP and EPSP in negative intervals lead to a moderate level of .
The STDP learning rule has many different mathematical forms and boasts numerous variations. Rather than computing between the current spike and every single spike in the past when a spike occurs at the synapse, a more efficient approach involves maintaining a synapse trace for both pre- and post-synaptic neurons. These traces undergo continuous modification over successive timesteps:
(4) | ||||
Where and are two adjacent timesteps. stands for the synapse trace and decays over time by a coefficient (). when a spike occurs at , and if no spike occurs.
Numerous studies on the effects of Spike-Timing-Dependent Plasticity (STDP) through virtual neural network simulations have been conducted. Research indicates that STDP confers selectivity to neurons towards different input signals [3] and fosters synchronous neuron activities [35][36]. Notably, the weights between pre- and post-synaptic neurons, influenced by STDP, typically exhibit a bimodal distribution. Moreover, STDP demonstrates the ability to maximize mutual information between input and output neuron spikes [37]. Applying the STDP learning rule enables a single neuron to detect a single hidden repeating spatiotemporal pattern [38], while multiple neurons can selectively detect different patterns [39] from a large set of spike trains with inherent noise. This suggests that STDP serves as an unsupervised learning algorithm with the capability to handle time-series data.
Significant efforts have been devoted to applying the STDP algorithm, primarily through spiking neural networks, to various machine learning tasks, particularly in the field of image processing. Experimental datasets include well-known sets such as MNIST, CIFAR-10, Caltech face, and motorbike datasets. Diverse neural network architectures have been employed, ranging from hierarchical networks inspired by the human visual cortex [40][41][42], single-layer fully connected networks with inhibitory neurons [43][44], to deep convolutional networks [45] and liquid state machines [46]. It is noteworthy that certain techniques are commonly employed across these studies. For instance, preprocessing of the original image, including the use of neuromorphic datasets and the application of Difference of Gaussian filters, specific receptive field configurations, latency encoding, and lateral inhibition, have been integral components in many works.
In practical experiments, we usually use equation3 to apply STDP, but this learning rule only be found in a small wide brain area in specific environments. There are two classes of neurons which can be called excitatory and inhibitory neurons playing an important role in the processes of learning, the STDP between them are apparently different, additionally, STDP between the neurons from the same position of the brain in different species also be different[47]. the trigger patterns of STDP in vivo are also different from the binary pair pattern from equation3[48], such as the different frequency burst to apply in pre-synaptic neuron, the triple pairs of spikes between pre-synaptic neuron and post-synaptic neuron and the pairs of burst and spike. The STDP in vivo also be influenced by the dendritic position which the post-synaptic membrane is in[49]. In addition to the spiking of the pre-and postsynaptic neurons, STDP is also regulated by other inputs. In particular, neuromodulators and inhibitory activity in the network can affect both the magnitude and the temporal window of STDP[50]. So, these phenomenons in vivo prove that the STDP we use might not fit in the biological plausibility.
3.3 Feedback Alignment
The concept of Feedback Alignment (FA), originally proposed by Lillicrap et al. [14], presents an intriguing paradigm in neural network training. One of FA’s remarkable attributes lies in its resolution of the weight transport problem, as elucidated in [51], aligning it more closely to the mechanisms observed in the human brain. Subsequent research derived from FA has not only facilitated the localization of error signal transmission [52], but also delved into aspects like update locking [53], further enhancing its biological plausibility.
Diverging from conventional backpropagation methods, FA shows that one can achieve similar effects and accuracy on classification tasks by replacing feedback weights with fixed random synaptic weights. Central to this theory is the notion that the precise alignment of feedback weights to the transpose of the forward weights is unnecessary during training. Instead, a designated weight matrix suffices in steering the network in roughly the same direction as backpropagation, thereby facilitating network training. The replacement matrix chosen only has to ensure the fulfillment of the following equation on an average basis:
(5) |
Where denotes the error of the network’s output, and represents the synaptic weights of the forward path. However, studies [54] have revealed a notable decline in FA’s performance when employing deeper convolutional architectures, in contrast to its comparable performance to BP in simpler MLP networks.
Direct Feedback Alignment (DFA) [52] , an offshoot of FA, emerges as a prominent avenue of research aimed at addressing this limitation. DFA tackles this challenge by introducing direct feedback paths and distinct fixed random weights for each hidden layer. This strategic design localizes learning signals by establishing direct connections between errors and individual layers, bypassing the conventional layer-by-layer backpropagation from the output. Consequently, training deeper networks becomes more feasible, attributed to the disentangled feedback paths offering increased flexibility in transmitting error signals. Learning is thus regarded as an extension of a forward pass, marking DFA as a notable stride towards biological plausibility.
To formalize it, traditional FA updates previous hidden layers based on the change of the following layer, depicted as:
(6) |
While DFA directly updates each hidden layer in accordance to distinct weight matrices and the error of the output layer:
(7) |
Here represents a fixed random weight matrix, indicates element-wise multiplication, and denotes the derivative of the non-linearity function of hidden layers.
Empirical evidence underscores the efficacy of DFA, notably in reducing training time and narrowing the accuracy gap compared to BP on datasets like MNIST and CIFAR [53]. Owing to its simplicity and effectiveness, DFA is widely adopted as a foundational model in subsequent research endeavors. Nevertheless, experimental outcomes [55] remain somewhat limited when scaling to more intricate network architectures or larger datasets like ImageNet.
3.4 Target Propagation
The backpropagation algorithm lacks biological plausibility, as in the human brain, biological neurons are interspersed with linear and nonlinear elements[56]. Utilizing backpropagation through feedback paths for the propagation of credit assignment necessitates precise knowledge of the nonlinear derivatives employed in the corresponding feedforward computations. Additionally, it requires alternating between exact feedforward propagation and backpropagation processes across different neuronal layers. This mechanism of gradient communication and weight transfer is biologically impractical[57].
Inspired by earlier research, Bengio et al. [58] proposed Target Propagation (TP) as a novel credit assignment approach in response to the challenges posed by the backpropagation method. TP assigns a target value to each layer , rather than employing a loss gradient. These target values are designed to be in close proximity to the activation values, with the potential to yield a reduced loss if achieved during the feedforward phase.
A distinct aspect of Target Propagation[59] is that its backward pass operates within the same dimensional framework as the forward-pass neural activities. The objective in this phase is to align the layer activities with those induced backward, thereby facilitating the generation of the desired output. Upon receiving an input, the final output layer undergoes feedforward propagation and is directly optimized to minimize the loss. In contrast, the remaining layers are oriented towards aligning with their assigned target values. The training process involves two types of losses at each layer level. Inverse loss is used to train an approximate inverse, which is parameterized in a manner analogous to the forward computation:
(8) |
where is the approximate inverse: . Forward loss imposes a penalty on layer parameters that result in activations divergent from their designated targets:
(9) |
These losses are localized, impacting only the parameters of the individual layer, and do not take into account any implicit dependencies on the parameters of other layers.
Vanilla Target Propagation calculates targets by back-propagating the targets from higher layers through layer-specific inverses:
(10) |
However, this simplistic approach may encounter difficulties in scenarios where different instances of the same class present diverse appearances. In such cases, TP tends to enforce uniformity in their representations across all layers, including the early ones. To address this issue, the Difference Target Propagation (DTP) was introduced, incorporating linear correction terms into the feedback process:
(11) |
The second term is the reconstruction error, providing a linear stabilizer for the inaccuracies in inverse functions. This enhancement significantly improves the recognition performance of the Target Propagation method. In the seminal study by Lee et al. [59], the target for the penultimate layer was determined using network loss gradients, deviating from standard TP methods. [60] propose the Simplified DTP (SDTP) as a refinement to DTP, where the target for the penultimate layer is computed according to eq.10. This modification effectively eliminates the biologically unrealistic aspects of gradient communication and associated weight-transport in the TP algorithm.
3.5 Predictive Coding
Originally proposed by [16], predictive coding (PC) is an influential theory in computational and cognitive neuroscience. The central idea of the theory is that the brain is composed of a hierarchy of layers, while high layers predict the activities of adjacent low layers. The entire brain maintains a cognitive model of the world, activities that cannot be accurately predicted are regarded as prediction errors which will be transmitted upwards for high-level to process. Over time, the synaptic connections between high and low levels are updated until the prediction error of the entire system is minimized. While predictive coding originated in theoretical neuroscience as a model of information processing in the cortex, recent work has developed the idea into a general-purpose algorithm able to train neural networks using only local computations. Compared to the classical backpropagation algorithm, hierarchical predictive coding models are considered to be more biologically plausible. When Rao and Ballard first proposed the hierarchical predictive coding model in 1999, they cautiously suggested the biological interpretability of predictive coding by indirectly highlighting similarities between the hierarchical predictive coding model and classical effects such as endstopping in the visual processing regions of the brain cortex [16]. Since then, the emergence of predictive coding theory has, to some extent, challenged cognitive neuroscience. Consequently, a substantial amount of neuroanatomical research has been conducted to validate or refute predictive coding. This research includes a wealth of anatomical and physiological evidence supporting predictive coding as an assumption for information transmission within the cortical hierarchy, especially in early visual processing [61]. However, despite abundant indirect evidence suggesting the potential existence of predictive coding mechanisms in the cortical regions of the brain, crucial direct evidence is still lacking. Caution should be exercised in assessing the biological plausibility of predictive coding.
Hierarchical predictive coding networks (PCNs) composed of hidden layers include two kinds of neurons. denotes the neuron that encodes time-depend predictions in layer at time , denotes the neuron that compute prediction errors in layer at time (). In such cases, the input signal is transmitted from low-level to high-level, and high-level neurons predict the value from the following layer according to:
(12) |
where is a nonlinear function, and denotes the matrix of weights connecting layer to layer . Prediction error represents the difference between actual activity and its prediction, which is denoted by . The errors are then propagated down the hierarchy and used in the learning process to update the weights of the network. Ultimately, the learning algorithm optimizes a global energy function, defined as the sum of squared prediction errors at each layer:
(13) |
During training, the highest layer is fixed to an input data point, and the lowest layer is fixed to a label or target vector. During a process called inference, the weight parameters are fixed, and the neural activities are continuously updated to minimize the energy function by running gradient descent until convergence, at which point a single weight update is performed. During the weight update, the value nodes are fixed, and the weight parameters are updated via gradient descent on the same energy function. When defining inference and weight update this way, every computation only needs local information to be updated [62].
During testing, only the lowest layer is fixed to the data, so the network infers the label given a test point. This process is equivalent to the inference phase described above: the weight parameters are fixed, and the neural activities are updated until convergence by running gradient descent on the energy function. Note that different works follow different paradigms for the order of updating the neural activities and weights . In most works, neural activities are all simultaneously updated for T iterations with the aim of reaching convergence of the inference process, and the weights are updated once upon convergence [63, 62, 64]. However, in certain cases, it has been noted that updating the weights and activities of different layers in different moments yields a better performance [65, 66].
PCNs can be mathematically derived as variational inference on hierarchical Gaussian generative models. The hierarchical model consists of multiple layers indexed by .The distribution of activations at each layer can be approximated as Gaussian distribution with a mean given by a nonlinear function with parameters and an identity covariance ,
For example, consider a scenario where the input layer is held constant at a specific data item. Our objective is to deduce the state of the remaining network based on this conditioning . Solving this inference problem can be accomplished through variational inference. Broadly speaking, variational inference tackles the challenge of approximating an intractable inference by framing it as an optimization problem. This involves optimizing the parameters of an approximate variational posterior distribution to minimize its divergence from the optimal posterior .The optimization process revolves around minimizing an upper bound on this divergence, referred to as the variational free energy . In PCNs, our assumption is that the variational posterior is factorized into independent posteriors for each layer . When combined with the Laplace approximation, this simplification enables us to express the free energy as a sum of squared prediction errors.
(14) |
As we mentioned above, is the prediction error for each layer.When applied to ANNs, we generally operate under the assumption that the dependencies between layers are characterized by a parameter matrix . This matrix corresponds to the weights within an ANN. Consequently, updates to both the activations and weights can be performed through gradient descent on the free energy.
(15) |
(16) |
The operation of PCNs involves two distinct phases. Initially, the activation undergoes updates to minimize the free energy until they attain equilibrium. Subsequently, the weights undergo a single-step update based on the equilibrium values of the activations . These phases are commonly referred to as inference and learning.
PC converts the feedforward pass of an ANN (artificial neural network) into an inference problem, which requires manipulating the activations of the ANN layers under certain constraints on the input or output layers, or both. The uncertainty about the optimal activations is represented by a Gaussian distribution with a mean given by the top-down prediction from the higher layer. Importantly, this inference problem is solved dynamically in each inference stage. Furthermore, the conditioning variables can be modified adaptively according to the task demands. During execution, the PCN (predictive coding network) utilizes its learned generative model embedded in the weights , enabling it to cope with diverse inference problems. This demonstrates the enhanced flexibility of PCNs over ANNs, as evidenced by recent studies.
A common assumption is that PC provides the best inversion method for hierarchical Gaussian generative models. However, this assumption does not hold when compared to the state-of-the-art generative AI systems that employ deep neural networks, which can achieve superior performance in various domains. Moreover, there are several aspects of cortical microcircuitry remain which have so far resisted simple interpretation while using the PC framework. As such, predictive coding might be ’right’ in some sense, but still missing core aspects of the computation that actually goes on in the cortex. Memory is another vital function of the brain. It seems that cortical areas implement short-term and long-term memory by means of persistent neural activity. However, modeling these memory processes within a predictive coding framework poses significant challenges.
3.6 Forward-Forward Algorithm
Proposed by Geoffrey Hinton, the Forward-Forward algorithm is a new learning procedure to update the weights of the network which is more biologically plausible than backpropagation. Compared with the backpropagation algorithm, the Forward-Forward algorithm updates the weights of each layer with two forward propagation processes, specifically, it updates network weights layer by layer in place, without the need to store neural activities or wait for the backpropagation of error gradients. In addition, it does not require full knowledge of the forward calculation function therefore it can tolerate black-box functions. These properties show greater biological plausibility, more akin to the functioning of the cerebral cortex. So this algorithm deserves our attention.
With the advantages of the algorithm established, it’s pertinent to explore its basic design. Generally, during the training phase, the algorithm will use two forward passes to replace the forward and backward passes, one with positive data and the other with negative data. Each layer has its own objective function, as a result, the train of each layer is independent, and the parameter updates of the former layers do not depend on the activities of the neurons in the layers behind them. The training target of each layer is to increase the goodness of positive data and decrease the goodness of negative data. In other words, the training objective for each layer is to be able to differentiate between positive data and negative data. For a given layer, The probability that an input vector is positive can be formulated as below:
(17) |
where is the activity of hidden unit , is a threshold, and is the logistic function.
Nowadays the Forward-Forward algorithm shows comparable speed and accuracy on some small problems[17], but for large models, its performance is inferior to the backpropagation algorithm. Additionally, the Forward-Forward algorithm still has points that do not satisfy biological plausibility, for example, the output of each layer is still continuous signals rather than spikes, and the error signals are unsigned. In conclusion, while the Forward-Forward algorithm exhibits significant strengths and an innovative design in terms of biological plausibility, it is not without limitations. This, however, does not diminish its potential. It is the hope that this exploration of the Forward-Forward algorithm will contribute to further understanding of biological plausibility.
3.7 Perturbation Learning
The concept of “perturbation learning” was first proposed by [18], which presented an associative reinforcement learning algorithm for networks containing stochastic units. Then, [19] introduced a perturbation learning paradigm to linear feedback neural networks. In recent years, perturbation learning methods for training neural networks mostly mean utilizing a random perturbation on network components. If we can observe a decrease in the error, this perturbation will be accepted. Otherwise, we will reject this perturbation and try another perturbation. There are generally two perturbation learning methods nowadays: weight perturbation and node perturbation. While perturbation learning may necessitate multiple iterations for neural networks to achieve convergence, it can be considered a biologically plausible training methodology. This characterization arises from the absence of direct gradient descents guided by global targets in the process.
3.7.1 Weight Perturbation
The technique that employs perturbations to rectify the connection weights of a learning machine, such as neural networks, is referred to as weight perturbation learning. Weight perturbation learning has been introduced as a learning rule involving the addition of perturbations to the learnable parameters of neural networks. The generalization performance of weight perturbation learning has been scrutinized through statistical mechanical methods, revealing an asymptotic generalization property analogous to perceptron learning.
Following [19], we denote the input of a neural network as , the learnable weights of a neural network as , and then the output can be represented as . Denote desired corresponding outputs is , and assume there is an ideal weight such that . If we choose mean square error as our loss function, then the loss can be defined as:
(18) |
where . For weight perturbation, there will be a noise matrix from a Gaussian distribution with mean and variance to perturb the .
(19) |
If the loss decreases, this perturbation on will be accepted. Therefore, the weight will be updated to :
(20) |
where is the learning rate. Then repeat in this way until there are no perturbations that can decrease the .
3.7.2 Node Perturbation
Different from weight perturbation learning, node perturbation learning constitutes a variant of the statistical gradient descent algorithm applicable to scenarios where the objective function is not explicitly defined, especially in reinforcement learning. This method approximates the gradient of the objective function by assessing changes in the objective function resulting from perturbations. The baseline, denoting the objective function value for an unperturbed output, plays a pivotal role in this process. This approach can be conceptualized as reinforcement learning with a scalar reward, wherein all weight vectors are adjusted based on the scalar reward, in contrast to gradient methods that employ target vectors. Consequently, node perturbation learning exhibits versatility, serving not only as a valuable neural network learning algorithm but also amenable to formulation as reinforcement learning or application within a brain model.
Again, we choose Equation 18 as the loss function. Different from weight perturbation, node perturbation introduces a random noise matrix on output rather than weight:
(21) |
If the loss decreases, this perturbation will be accepted. Therefore, the weight will be updated to :
(22) |
Then repeat in this way until there are no perturbations that can decrease the .
3.7.3 Forward Gradient Learning
Forward Gradient learning is a recently raised method that implements perturbation learning using the forward-mode AD technique [67, 68]. In the context of biological plausibility, Forward Gradient learning dispenses with the need for weight transport and necessitates only a single-phase update in the training process. Notably, the target signal is conveyed through the feed-forward process, distinguishing it from backpropagation. Furthermore, recent studies have demonstrated the applicability of this training algorithm to Spiking Neural Networks (SNNs), emphasizing its promise in structures more akin to the human brain.
Forward-mode Automatic Differentiation (AD) was first proposed by [69]. For a given function , forward-mode AD computes the matrix-vector Jacobian product , which is defined as the directional gradient at x with a perturbation vector :
(23) |
Note that the Jacobian product is computed in a single forward evaluation. On the contrary, reverse-mode Automatic Differentiation (AD), commonly referred to as backpropagation, computes the vector Jacobian product through a combination of forward and backward passes. Given the efficiency of forward-mode AD in comparison to backpropagation, researchers have been exploring the implementation of learning methods utilizing forward-mode AD. The majority of deep learning approaches optimize the parameters of neural architectures through gradient descent, although calculating the exact gradient is computationally demanding in terms of both space and time. In contrast, Forward Gradient learning utilizes directional derivatives at randomly selected directions as an unbiased estimator for the true gradient, which is then used to facilitate gradient descent. Despite introducing some additional variance to the gradient estimate, this approach incurs significantly lower computational costs compared to computing the true gradient. Furthermore, the elimination of the need for back-propagation in this process enhances its biological plausibility.
While this methodology proves adequate for addressing small-scale problems, when applied to large-scale neural architectures, the expanded dimensions within the loss space introduce a greater degree of perturbation possibilities. Consequently, this leads to increased computational expenses and diminished effectiveness. To address this issue, greedy local learning objectives have been introduced to scale this method to more intricate tasks [70]. The same study also proposed an alternative method for estimating gradients, which involves activity perturbation instead of weight perturbation within the same framework, and notably, it yielded improved performance.
3.8 Local Losses
The local Losses method [71, 15] represents a paradigm shift from global error correction to layer-specific training. Instead of propagating a single global error signal backward through the entire network, Local Losses employs local error signals for each layer. Each layer is equipped with its own classification layer that generates an error signal used to update that specific layer’s parameters. This approach aligns more closely with biological neural processes, where learning appears to be more distributed and locally governed.
Formally, consider a neural network composed of layers. Let denote the activations of the -th layer, where and represents the transformation applied by layer . In traditional back-propagation, the overall loss depends on the final output and the target labels :
Gradients of this global loss with respect to each layer’s parameters are computed and propagated backward from the output layer to the input layer. Conversely, in the Local Losses method, each layer has its own local loss function , which depends on the activations and a set of auxiliary target labels :
The auxiliary targets can be derived from the true labels or be unsupervised signals appropriate for the task at hand.
Each layer is paired with a classifier , which maps the activations to a prediction :
The local loss for layer can be written as:
The parameters of both the transformation and the classifier are updated based on the gradient of this local loss:
where represents the parameters of layer , and is the learning rate.
This layer-specific training allows each layer to learn independently, guided by its own local loss. This method mimics the potentially distributed nature of learning in the brain, where different regions may adapt based on localized feedback. The Local Losses approach offers several benefits, including biological plausibility, scalability, and robustness. However, this approach also presents challenges: 1. Auxiliary Target Design: Designing appropriate auxiliary targets for each layer can be complex and may require domain-specific knowledge; 2. Coordination: Ensuring that independently trained layers work harmoniously to achieve the overall task requires careful architectural and procedural design.
3.9 Energy-based Learning
The continuous Hopfield models [21] are recurrent neural networks that serve as content-addressable memory systems. An appropriate “energy” function is constructed that is always decreased by any state change produced by the activity of each neuron. The energy function’s minima correspond to the preferred states of the model. The examination of iterations of Hopfield models continues to be a subject of ongoing research over several decades, including the exploration of diverse learning mechanisms. Hebbian learning and the Storkey learning rule represent two established traditional learning approaches in Hopfield models, while in recent years, a novel learning paradigm known as equilibrium propagation [22] has been introduced.
The combination of Equilibrium Propagation with Hopfield models is regarded as having significant biological plausibility. Hopfield models are commonly seen as approximations of human memory systems, deviating from the typical layered architecture observed in most artificial neural networks (ANNs). The weight update functions of Equilibrium Propagation can be constructed by synaptic learning rules based on pre- and post-synaptic activities. While this algorithm, akin to Back-propagation, calculates the gradient of the objective function during the second phase, it is notable that the neural computations in both phases are consistent, enhancing the possibility of a biological implementation.
Equilibrium Propagation comprises a two-phase training process. In the initial phase, a prediction is generated by fixing the inputs and allowing the Hopfield network to converge to a local energy function minimum. Subsequently, during the second phase, the outputs are adjusted toward their desired targets, and the network converges to a new state with a marginally reduced prediction error. Temporal derivatives of the neural activities in Equilibrium Propagation and Back-propagation have been proved equal by [72]. This algorithm also exhibits a connection to contrastive Hebbian learning, as it learns the second phase fixed point by reducing the total energy and in comparison to the first phase fixed point achieved through prediction. In the wake of the introduction of Equilibrium Propagation, it has been extended to encompass Convolutional and Spiking Neural Networks [73, 74], highlighting a compelling approach as a biologically plausible method for gradient computation in deep neural networks. However, an existing issue is that Hopfield models require symmetric connections between nodes, and there is yet no biological evidence supporting the existence of a symmetric connection among actual neurons in the brain [22].
4 Biologically Plausibility
As depicted in Table 1, we assess the biological plausibility of the aforementioned algorithms based on the criteria outlined in Section 2.
Algorithms | Asymmetry Weights | Local Error | Non-parallel Training | Neuron Model | Unsigned Errors |
Hebbain Learning | ✓ | ✓ | ✓ | ✓ | ✓ |
STDP | ✓ | ✓ | ✓ | ✓ | ✓ |
Feedback Alignment | ✓ | ✓ | |||
Target Propagation | ✓ | ✓ | |||
Predictive Coding | ✓ | ✓ | |||
Foward-Forward | ✓ | ✓ | ✓ | ||
Perturbation Learning | ✓ | ✓ | ✓ | ✓ | |
Local Losses | ✓ | ✓ | |||
Energy-based Learning | ✓ | ✓ | ✓ |
From Table 1, we observe that:
-
•
For Hebbian learning and STDP algorithms, they are the most biologically plausible methods, as they meet all the criteria. Hebbian learning aligns with the principle of neurons that fire together wire together, and STDP refines this by adjusting synaptic strength based on the precise timing of spikes. Both algorithms avoid the use of backpropagation, relying on local information for synaptic changes, and do not require symmetric weights, thereby adhering closely to biological processes.
-
•
Following them, perturbation learning also exhibits biological plausibility, but it still has issues with negative values in error propagation. Perturbation learning makes small adjustments to synaptic weights based on the perturbation of the network’s output, which aligns with local error processing. However, the presence of negative error signals does not perfectly match the typically unsigned nature of biological error signals.
-
•
For forward-forward algorithm and energy-based learning, they succeed in satisfying three criteria, i.e., asymmetry weights, local error representation, and non-parallel training. The forward-forward algorithm simplifies the learning process by overlapping the phases of forward and backward propagation, which is more in line with biological systems. Energy-based learning focuses on minimizing a global energy function but still uses local errors and maintains weight asymmetry. Both methods, however, do not address the spiking nature of biological neurons or the unsigned nature of biological errors.
-
•
Feedback alignment and target propagation mainly focus on the learnable weight. Therefore, they only match two criteria, i.e., asymmetry weights and local error. Feedback alignment replaces the backpropagation of errors with a random feedback matrix, ensuring weight asymmetry and local error signals, but it still involves a form of parallel training. Target propagation, while adjusting weights to minimize output error, also relies on weight asymmetry and local errors but does not satisfy other biological criteria.
-
•
Lastly, Local Losses is the least biologically plausible method, as it is essentially the back-propagation algorithm trained layer by layer. This method does not incorporate the asymmetry of weights, non-parallel training, or the use of local error signals. It operates in a staged manner, unlike continuous biological learning, and does not address neuron models or unsigned error signals. The criteria "Neuron Model" indicates whether the algorithm can be realized as a spiking neuron version, highlighting the importance of mimicking the spiking behavior of biological neurons.
5 Experiments
In this section, we conduct a series of comprehensive experiments. Firstly, we empirically evaluate the performance of all biologically plausible algorithms across diverse model structures and datasets on image classification tasks. Moreover, in order to assess how well these algorithms can present human brains’ activity patterns, we innovatively compare the feature representations learned by algorithms with non-invasive brain activity records.
5.1 Datasets
In this section, we will introduce the datasets we utilized
MNIST The MNIST dataset consists of 70,000 grayscale images of handwritten digits, divided into 60,000 training images and 10,000 testing images. Each image is 28x28 pixels, and the digits range from 0 to 9. This dataset is a benchmark in the field of machine learning and image classification due to its simplicity and ease of use, providing a solid baseline for evaluating the performance of different algorithms.
CIFAR The CIFAR dataset, created by the Canadian Institute For Advanced Research, is a widely-used benchmark in machine learning and computer vision, consisting of two versions: CIFAR-10 and CIFAR-100. CIFAR-10 contains 60,000 32x32 color images in 10 classes, split into 50,000 training and 10,000 testing images, covering categories such as airplanes, cars, and animals. CIFAR-100 extends this to 100 classes grouped into 20 superclasses, with the same train-test split ratio.
Haxby The Haxby dataset is a neuroimaging dataset that records brain activity using functional Magnetic Resonance Imaging (fMRI) while subjects view images from different categories. This dataset is particularly valuable for our research as it allows us to compare the feature representations learned by biologically plausible algorithms with actual brain activity patterns. The dataset includes multiple subjects and a variety of visual stimuli, making it ideal for studying how well machine-learning models can mimic human brain processing.
By utilizing these datasets, we aim to provide a comprehensive evaluation of the algorithms’ performance across simple and complex visual tasks, as well as their ability to replicate human brain activity patterns.
5.2 Implementation Details
In our experiments, we employed two primary types of models: Convolutional Neural Networks (CNNs) and Multi-Layer Perceptrons (MLPs). The CNN model consists of three convolutional layers with kernel sizes of , , and , respectively, and channel sizes increasing from input to 64, 128, and 256, with a final fully connected (FC) layer mapping to the output classes. Each convolutional layer is followed by a max pooling layer with a kernel size of . The activation function used throughout the CNN is ReLU. The MLP model is composed of an input layer, two hidden layers with 1024 and 256 units respectively, and an output layer, all utilizing ReLU activation functions. Both models were trained using a standard backpropagation algorithm with categorical cross-entropy loss and optimized with stochastic gradient descent (SGD) with momentum. Hyperparameters, such as learning rate and batch size, were fine-tuned based on validation set performance. This implementation enabled us to evaluate and compare the performance of biologically plausible algorithms across different network architectures and datasets, providing insights into their effectiveness in mimicking human brain activity patterns.
Specifically, for the STDP method, we utilize the Pytorch-based framework called SpikingJelly [75] which is tailored for spiking neural networks.
5.3 Image Classification
We evaluate all bio-plausible algorithms on image classification benchmarks, reported in Table 2.
Algorithm | MNIST | CIFAR-10 | CIFAR-100 | |||
MLP | CNN | MLP | CNN | MLP | CNN | |
Backpropagation | ||||||
Hebbian Learning | ||||||
STDP | ||||||
Feedback Alignment | ||||||
Direct Feedback Alignment | ||||||
Target Propagation | ||||||
Predictive coding | ||||||
Forward-Forward | ||||||
Perturbation Learning | ||||||
Local Losses | ||||||
Equilibrium Propagation |
As shown in Table 2, we can draw conclusion that: (1) Local losses, being the most similar to backpropagation, achieve performance closest to traditional backpropagation. This aligns with our expectations, as layer-by-layer training of neural networks still relies on global label signals, utilizing local errors exclusively for weight updates. (2) While Hebbian learning and STDP are the most biologically plausible algorithms, they exhibit significantly poorer performance compared to alternative methods. Particularly, they both fail to converge on the CIFAR-100 dataset. (3) The forward-forward algorithm and equilibrium propagation (an energy-based method) appear effective primarily with the MLP architecture on tasks such as MNIST. However, these methods encounter challenges and fail to converge when applied to more complex tasks or architectures like CNNs.
Taking Table 1 into consideration, it is evident that the current bio-plausible learning algorithms, while showing promise, often fall short of achieving the high performance demonstrated by traditional backpropagation, especially on more complex datasets like CIFAR-10 and CIFAR-100. Algorithms like Feedback Alignment, Direct Feedback Alignment, Predictive Coding, and Local Losses demonstrate competitive performance on simpler datasets such as MNIST, indicating their potential viability. However, their diminished effectiveness on more challenging tasks underscores the need for further advancements in this field.
For the community to make significant strides, it is necessary to develop a learning algorithm that seamlessly integrates biological plausibility with high performance. Such an algorithm would need to address the key criteria of biological learning, including asymmetry of forward and backward weights, local error representation, non-parallel training, realistic neuron models, and unsigned error signals, without compromising on accuracy and scalability. This endeavor will likely involve innovative approaches that combine insights from neuroscience with advanced machine learning techniques.
The development of such an algorithm would not only bridge the gap between biological realism and computational efficiency but also potentially lead to more robust, efficient, and adaptable learning systems. It would pave the way for a new era of machine learning that is not only inspired by but also closely aligned with the principles of biological learning, ultimately enhancing our ability to develop intelligent systems that can learn and adapt in more human-like ways.
5.4 Comparative Analysis of Representations Learned by Algorithms and Activity Patterns Elicited by Human Brains
To investigate the biological plausibility of the previously discussed algorithms, we have developed a method to compute representation similarity between biologically plausible algorithms and the human brain. Specifically, we utilize NeuroRA, a toolbox designed for representation similarity analysis. NeuroRA allows us to acquire various types of neural data, such as fMRI and ROI, and to compute the Representational Dissimilarity Matrix (RDM). Using the Haxby dataset, which is compatible with NeuroRA, we train our model and compute an average representation for each category within the dataset. Based on these average representations, we calculate the RDM for our model. Subsequently, we measure the similarity between the RDMs of human brains and our model using cosine similarity, which serves as our similarity metric. The results are presented in Table 3.
Algorithm | Metric | Haxby | |
MLP | CNN | ||
Backpropagation | Accuracy | ||
Similarity | |||
Feedback Alignment | Accuracy | ||
Similarity | |||
Predictive coding | Accuracy | ||
Similarity | |||
Local Losses | Accuracy | ||
Similarity | |||
Forward-Forward | Accuracy | ||
Similarity |
Note that we chose algorithms because they were the only ones that successfully converged during training. For CNN trained by the forward-forward algorithm, it fails to converge. From Table 3, we find that: (1) Backpropagation significantly outperforms other bio-plausible algorithms on accuracy. This observation is reasonable and aligns with findings in datasets like MNIST and CIFAR. (2) For RDM similarity, Backpropagation is not the top performer; predictive coding shows the highest similarity. This suggests that although backpropagation achieves high accuracy, it might not be as biologically plausible in terms of the representation it learns. (3) There appears to be a trade-off between accuracy and similarity. Backpropagation achieves relatively high RDM similarity alongside its superior performance, suggesting that high accuracy might contribute to higher similarity. This leads us to speculate that if the performance of algorithms like predictive coding and forward-forward were improved to match the accuracy of backpropagation, they might exhibit even higher similarity scores. This hypothesis indicates that enhancing the accuracy of biologically plausible algorithms could potentially increase their alignment with human neural representations, offering a promising direction for future research in developing models that are both effective and biologically plausible.
6 Open Research Questions
The field of biologically plausible deep learning is in its nascent stages, and there are several critical research questions that need to be addressed to advance the state-of-the-art. This discussion aims to delve into these questions, exploring potential pathways and challenges associated with scaling these models, learning temporal sequences, and optimizing neural circuit dynamics.
6.1 Scaling Biologically Plausible Implementations
Biologically inspired models often face challenges in scalability due to their intricate architectures and the need to adhere to the constraints of biological plausibility. Traditional deep learning models have demonstrated remarkable success in handling complex, high-dimensional tasks, such as image and speech recognition, but they often rely on computational techniques and resources that are not biologically feasible. To bridge this gap, researchers must explore methods to enhance the scalability of biologically plausible models without compromising their inherent principles. This involves investigating new algorithms, optimizing hardware implementations, and perhaps most critically, developing hybrid approaches that combine the strengths of both biologically inspired and traditional deep learning methods.
6.2 Learning Temporal Sequences
Biological networks excel at processing temporal sequences, a capability that remains a significant challenge for artificial systems. The dynamic and recurrent nature of biological neurons allows for sophisticated temporal processing, enabling organisms to navigate, predict, and learn from sequential events in their environments. In contrast, most artificial neural networks, particularly those used in deep learning, struggle with temporal dependencies unless specifically designed with recurrent or attention-based mechanisms. Developing biologically plausible methods for learning temporal sequences could involve leveraging the properties of spiking neural networks (SNNs) or other neuromorphic computing approaches that naturally accommodate time-dependent processing. Additionally, understanding how biological networks balance short-term and long-term memory and applying these principles to artificial systems could lead to more robust and efficient temporal learning models.
6.3 Optimizing Neural Circuit Dynamics
Biological systems are not only efficient in learning but also in their energy consumption and adaptive capabilities. The optimization of neural circuit dynamics to support efficient learning is a multifaceted problem that encompasses the minimization of energy use, the maximization of learning speed, and the enhancement of adaptability to changing environments. Biological neurons exhibit a variety of dynamic behaviors, such as synaptic plasticity and homeostasis, which contribute to their learning efficiency. Translating these dynamics into artificial systems requires a deep understanding of the underlying biological processes and the development of algorithms that can mimic these processes effectively. This might involve creating more sophisticated models of synaptic plasticity, incorporating adaptive mechanisms that allow artificial networks to self-tune in response to environmental changes, and designing hardware that can support these dynamic processes efficiently.
6.4 Fully Biologically Plausible Algorithms
The development of fully biologically plausible algorithms is a crucial area of research within the field of biologically inspired deep learning. These algorithms aim to replicate the principles and mechanisms of learning observed in biological systems, such as local learning rules, Hebbian plasticity, and the balance between excitatory and inhibitory signals. Achieving full biological plausibility involves not only mimicking the structural and functional aspects of biological neurons but also adhering to the constraints of biological systems, such as limited energy resources and real-time processing capabilities.
6.5 Hardware Implementation
The advancement of hardware specifically designed for biologically plausible models is vital for the practical application of these systems. Neuromorphic chips, such as those designed for spiking neural networks (SNNs) [76], represent a significant step towards this goal. These chips, like IBM’s TrueNorth [77] and Intel’s Loihi [78], aim to emulate the structure and functionality of biological neural circuits, enabling more efficient processing of information through event-based computation and asynchronous communication. Unlike traditional processors, neuromorphic hardware can inherently handle the temporal dynamics and adaptive behaviors characteristic of biological systems. This includes the ability to support synaptic plasticity, dynamic learning, and low-power operation. Developing and optimizing such hardware involves overcoming challenges related to scalability, integration with existing computing infrastructure, and ensuring that the hardware can support the complex dynamics required for advanced learning tasks. By focusing on these areas, researchers can create more effective and efficient hardware solutions that bring biologically plausible deep learning closer to practical reality.
7 Conclusion
In this study, we first established fix criteria for biological plausibility and applied them to assess a range of existing representative algorithms, including Hebbian learning, spike-timing-dependent plasticity, feedback alignment, target propagation, predictive coding, forward-forward algorithm, perturbation learning, local losses, and energy-based learning, across diverse network architectures and datasets. Additionally, we compared the feature representations learned by these algorithms with non-invasive brain activity records under identical stimuli to identify those that most accurately replicate brain activity patterns. We find that predictive coding and forward-forward algorithm achieve the largest similarity with brain activity patterns, indicating these two methods not only show considerable performance but also have a certain biological plausibility. Our findings provide a comprehensive assessment and insights into biologically plausible algorithms, aiming to inspire new developments that bridge the gap between neuroscience and machine learning.
References
- \bibcommenthead
- Rumelhart et al. [1986] Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations by error propagation, parallel distributed processing, explorations in the microstructure of cognition, ed. de rumelhart and j. mcclelland. vol. 1. 1986. Biometrika 71, 599–607 (1986)
- LeCun et al. [2015] LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. nature 521(7553), 436–444 (2015)
- Delorme et al. [2001] Delorme, A., Perrinet, L.U., Thorpe, S.J.: Networks of integrate-and-fire neurons using rank order coding b: Spike timing dependent plasticity and emergence of orientation selectivity. Neurocomputing 38-40, 539–545 (2001)
- Devlin et al. [2019] Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics (2019)
- Brown et al. [2020] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020)
- Bianchini et al. [1997] Bianchini, M., Fanelli, S., Gori, M., Maggini, M.: Terminal attractor algorithms: A critical analysis. Neurocomputing 15(1), 3–13 (1997)
- Payeur et al. [2021] Payeur, A., Guerguiev, J., Zenke, F., Richards, B.A., Naud, R.: Burst-dependent synaptic plasticity can coordinate learning in hierarchical circuits. Nature neuroscience 24(7), 1010–1019 (2021)
- Zahid et al. [2023] Zahid, U., Guo, Q., Fountas, Z.: Predictive coding as a neuromorphic alternative to backpropagation: A critical evaluation. Neural Computation 35(12), 1881–1909 (2023)
- Stork [1989] Stork: Is backpropagation biologically plausible? In: International 1989 Joint Conference on Neural Networks, pp. 241–246 (1989). IEEE
- Crick [1989] Crick, F.: The recent excitement about neural networks. Nature 337(6203), 129–132 (1989)
- Lillicrap et al. [2020] Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020)
- Munakata and Pfaffly [2004] Munakata, Y., Pfaffly, J.: Hebbian learning and development. Developmental science 7(2), 141–148 (2004)
- Song et al. [2000] Song, S., Miller, K.D., Abbott, L.F.: Competitive hebbian learning through spike-timing-dependent synaptic plasticity. Nature Neuroscience 3, 919–926 (2000)
- Lillicrap et al. [2016] Lillicrap, T.P., Cownden, D., Tweed, D.B., Akerman, C.J.: Random synaptic feedback weights support error backpropagation for deep learning. Nature Communications 7 (2016)
- Bengio [2014] Bengio, Y.: How auto-encoders could provide credit assignment in deep networks via target propagation. CoRR abs/1407.7906 (2014)
- Rao and Ballard [1999] Rao, R.P., Ballard, D.H.: Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nature neuroscience 2(1), 79–87 (1999)
- Hinton [2022] Hinton, G.: The forward-forward algorithm: Some preliminary investigations. arXiv preprint arXiv:2212.13345 (2022)
- Williams [1992] Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8, 229–256 (1992)
- Werfel et al. [2003] Werfel, J., Xie, X., Seung, H.S.: Learning curves for stochastic gradient descent in linear feedforward networks. In: Advances in Neural Information Processing Systems, pp. 1197–1204 (2003)
- Marblestone et al. [2016] Marblestone, A.H., Wayne, G., Körding, K.P.: Toward an integration of deep learning and neuroscience. Frontiers Comput. Neurosci. 10, 94 (2016)
- Hopfield [1984] Hopfield, J.J.: Neurons with graded response have collective computational properties like those of two-state neurons. Proceedings of the national academy of sciences 81(10), 3088–3092 (1984)
- Scellier and Bengio [2017] Scellier, B., Bengio, Y.: Equilibrium propagation: Bridging the gap between energy-based models and backpropagation. Frontiers in computational neuroscience 11, 24 (2017)
- Hebb [1949] Hebb, D.O.: The organisation of behaviour: a neuropsychological theory (1949)
- Löwel and Singer [1992] Löwel, S., Singer, W.: Selection of intrinsic horizontal connections in the visual cortex by correlated neuronal activity. Science 255(5041), 209–212 (1992) https://doi.org/10.1126/science.1372754 https://www.science.org/doi/pdf/10.1126/science.1372754
- Gupta et al. [2021] Gupta, M., Ambikapathi, A., Ramasamy, S.: Hebbnet: A simplified hebbian learning framework to do biologically plausible learning. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3115–3119 (2021). IEEE
- Moraitis et al. [2022] Moraitis, T., Toichkin, D., Journé, A., Chua, Y., Guo, Q.: Softhebb: Bayesian inference in unsupervised hebbian soft winner-take-all networks. Neuromorphic Computing and Engineering 2(4), 044017 (2022)
- Artola and Singer [1993] Artola, A., Singer, W.: Long-term depression of excitatory synaptic transmission and its relationship to long-term potentiation. Trends in neurosciences 16(11), 480–487 (1993)
- Lisman [1989] Lisman, J.: A mechanism for the hebb and the anti-hebb processes underlying learning and memory. Proceedings of the National Academy of Sciences 86(23), 9574–9578 (1989)
- Caporale and Dan [2008] Caporale, N., Dan, Y.: Spike timing–dependent plasticity: a hebbian learning rule. Annu. Rev. Neurosci. 31, 25–46 (2008)
- Kampa et al. [2004] Kampa, B.M., Clements, J., Jonas, P., Stuart, G.J.: Kinetics of mg2+unblock of nmda receptors: implications for spike-timing dependent synaptic plasticity. The Journal of Physiology 556(2), 337–345 (2004) https://doi.org/%****␣main.bbl␣Line␣450␣****10.1113/jphysiol.2003.058842
- Watanabe et al. [2002] Watanabe, S., Hoffman, D.A., Migliore, M., Johnston, D.: Dendritic k + channels contribute to spike-timing dependent long-term potentiation in hippocampal pyramidal neurons. Proceedings of the National Academy of Sciences 99(12), 8366–8371 (2002) https://doi.org/10.1073/pnas.122210599
- Hoffman et al. [1997] Hoffman, D.A., Magee, J.C., Colbert, C.M., Johnston, D.: K+ channel regulation of signal propagation in dendrites of hippocampal pyramidal neurons. Nature 387(6636), 869–875 (1997) https://doi.org/10.1038/43119
- Bi and Poo [1998] Bi, G.-q., Poo, M.-m.: Synaptic modifications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type. The Journal of Neuroscience, 10464–10472 (1998) https://doi.org/10.1523/jneurosci.18-24-10464.1998
- Rosenmund et al. [1995] Rosenmund, C., Feltz, A., Westbrook, G.L.: Calcium-dependent inactivation of synaptic nmda receptors in hippocampal neurons. Journal of Neurophysiology 73(1), 427–430 (1995) https://doi.org/10.1152/jn.1995.73.1.427
- Zhigulin and Rabinovich [2004] Zhigulin, V.P., Rabinovich, M.I.: An important role of spike timing dependent synaptic plasticity in the formation of synchronized neural ensembles. Neurocomputing 58-60, 373–378 (2004)
- i Petit and Murray [2004] Bofill-i-Petit, A., Murray, A.F.: Synchrony detection and amplification by silicon neurons with stdp synapses. IEEE Transactions on Neural Networks 15, 1296–1304 (2004)
- Toyoizumi et al. [2004] Toyoizumi, T., Pfister, J.-P., Aihara, K., Gerstner, W.: Spike-timing dependent plasticity and mutual information maximization for a spiking neuron model. In: Neural Information Processing Systems (2004)
- Masquelier et al. [2008] Masquelier, T., Guyonneau, R., Thorpe, S.J.: Spike timing dependent plasticity finds the start of repeating patterns in continuous spike trains. PLoS ONE 3 (2008)
- Masquelier et al. [2009] Masquelier, T., Guyonneau, R., Thorpe, S.J.: Competitive stdp-based spike pattern learning. Neural Computation 21, 1259–1276 (2009)
- Beyeler et al. [2013] Beyeler, M., Dutt, N.D., Krichmar, J.L.: Categorization and decision-making in a neurobiologically plausible spiking network using a stdp-like learning rule. Neural networks : the official journal of the International Neural Network Society 48, 109–24 (2013)
- Masquelier and Thorpe [2007] Masquelier, T., Thorpe, S.J.: Unsupervised learning of visual features through spike timing dependent plasticity. PLoS Computational Biology 3 (2007)
- Kheradpisheh et al. [2015] Kheradpisheh, S.R., Ganjtabesh, M., Masquelier, T.: Bio-inspired unsupervised learning of visual features leads to robust invariant object recognition. Neurocomputing 205, 382–392 (2015)
- Diehl and Cook [2015] Diehl, P.U., Cook, M.: Unsupervised learning of digit recognition using spike-timing-dependent plasticity. Frontiers in Computational Neuroscience 9 (2015)
- Querlioz et al. [2013] Querlioz, D., Bichler, O., Dollfus, P., Gamrat, C.: Immunity to device variations in a spiking neural network with memristive nanodevices. IEEE Transactions on Nanotechnology 12, 288–295 (2013)
- Kheradpisheh et al. [2016] Kheradpisheh, S.R., Ganjtabesh, M., Thorpe, S.J., Masquelier, T.: Stdp-based spiking deep neural networks for object recognition. Neural networks : the official journal of the International Neural Network Society 99, 56–67 (2016)
- Ivanov and Michmizos [2021] Ivanov, V.A., Michmizos, K.P.: Increasing liquid state machine performance with edge-of-chaos dynamics organized by astrocyte-modulated plasticity. ArXiv abs/2111.01760 (2021)
- Tzounopoulos et al. [2007] Tzounopoulos, T., Rubio, M.E., Keen, J.E., Trussell, L.O.: Coactivation of pre- and postsynaptic signaling mechanisms determines cell-specific spike-timing-dependent plasticity. Neuron 54(2), 291–301 (2007) https://doi.org/10.1016/j.neuron.2007.03.026
- Softky and Koch [1993] Softky, W., Koch, C.: The highly irregular firing of cortical cells is inconsistent with temporal integration of random epsps. The Journal of Neuroscience, 334–350 (1993) https://doi.org/10.1523/jneurosci.13-01-00334.1993
- Rao and Sejnowski [2001] Rao, R.P.N., Sejnowski, T.J.: Spike-timing-dependent hebbian plasticity as temporal difference learning. Neural Computation 13(10), 2221–2237 (2001) https://doi.org/10.1162/089976601750541787
- Bear and Singer [1986] Bear, M.F., Singer, W.: Modulation of visual cortical plasticity by acetylcholine and noradrenaline. Nature 320(6058), 172–176 (1986) https://doi.org/10.1038/320172a0
- Grossberg [1987] Grossberg, S.: Competitive learning: From interactive activation to adaptive resonance. Cogn. Sci. 11, 23–63 (1987)
- Nøkland [2016] Nøkland, A.: Direct feedback alignment provides learning in deep neural networks. In: Neural Information Processing Systems (2016)
- Frenkel et al. [2021] Frenkel, C., Lefebvre, M., Bol, D.: Learning without feedback: Fixed random learning signals allow for feedforward training of deep neural networks. Frontiers in Neuroscience 15 (2021)
- Liao et al. [2015] Liao, Q., Leibo, J.Z., Poggio, T.A.: How important is weight symmetry in backpropagation? ArXiv abs/1510.05067 (2015)
- Bartunov et al. [2018] Bartunov, S., Santoro, A., Richards, B.A., Hinton, G.E., Lillicrap, T.P.: Assessing the scalability of biologically-motivated deep learning algorithms and architectures. In: Neural Information Processing Systems (2018)
- Grossberg [1987] Grossberg, S.: Competitive learning: From interactive activation to adaptive resonance. Cognitive Science, 23–63 (1987) https://doi.org/10.1111/j.1551-6708.1987.tb00862.x
- Lillicrap et al. [2020] Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience, 335–346 (2020) https://doi.org/10.1038/s41583-020-0277-3
- Bengio [2014] Bengio, Y.: How auto-encoders could provide credit assignment in deep networks via target propagation. arXiv: Learning,arXiv: Learning (2014)
- Lee et al. [2014] Lee, D.-H., Zhang, S., Fischer, A., Bengio, Y.: Difference target propagation. ECML/PKDD (2014)
- Bartunov et al. [2018] Bartunov, S., Santoro, A., Richards, B., Marris, L., Hinton, G., Lillicrap, T.: Assessing the scalability of biologically-motivated deep learning algorithms and architectures. NIPS (2018)
- Friston [2008] Friston, K.: Hierarchical models in the brain. PLoS computational biology 4(11), 1000211 (2008)
- Whittington and Bogacz [2017] Whittington, J.C., Bogacz, R.: An approximation of the error backpropagation algorithm in a predictive coding network with local hebbian synaptic plasticity. Neural computation 29(5), 1229–1262 (2017)
- Millidge et al. [2022] Millidge, B., Tschantz, A., Buckley, C.L.: Predictive coding approximates backprop along arbitrary computation graphs. Neural Computation 34(6), 1329–1368 (2022)
- Buckley et al. [2017] Buckley, C.L., Kim, C.S., McGregor, S., Seth, A.K.: The free energy principle for action and perception: A mathematical review. Journal of Mathematical Psychology 81, 55–79 (2017)
- Han et al. [2018] Han, K., Wen, H., Zhang, Y., Fu, D., Culurciello, E., Liu, Z.: Deep predictive coding network with local recurrent processing for object recognition. Advances in neural information processing systems 31 (2018)
- Ororbia and Kifer [2022] Ororbia, A., Kifer, D.: The neural coding framework for learning generative models. Nature communications 13(1), 2064 (2022)
- Baydin et al. [2022] Baydin, A.G., Pearlmutter, B.A., Syme, D., Wood, F., Torr, P.: Gradients without backpropagation. arXiv preprint arXiv:2202.08587 (2022)
- Silver et al. [2021] Silver, D., Goyal, A., Danihelka, I., Hessel, M., Hasselt, H.: Learning by directional gradient descent. In: International Conference on Learning Representations (2021)
- Wengert [1964] Wengert, R.E.: A simple automatic derivative evaluation program. Communications of the ACM 7(8), 463–464 (1964)
- Ren et al. [2022] Ren, M., Kornblith, S., Liao, R., Hinton, G.: Scaling forward gradient with local losses. arXiv preprint arXiv:2210.03310 (2022)
- Bengio et al. [2006] Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H.: Greedy layer-wise training of deep networks. In: Schölkopf, B., Platt, J.C., Hofmann, T. (eds.) Advances in Neural Information Processing Systems 19, Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 4-7, 2006, pp. 153–160. MIT Press, ??? (2006)
- Scellier and Bengio [2019] Scellier, B., Bengio, Y.: Equivalence of equilibrium propagation and recurrent backpropagation. Neural computation 31(2), 312–329 (2019)
- Laborieux et al. [2021] Laborieux, A., Ernoult, M., Scellier, B., Bengio, Y., Grollier, J., Querlioz, D.: Scaling equilibrium propagation to deep convnets by drastically reducing its gradient estimator bias. Frontiers in neuroscience 15, 633674 (2021)
- O’Connor et al. [2019] O’Connor, P., Gavves, E., Welling, M.: Training a spiking neural network with equilibrium propagation. In: The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1516–1523 (2019). PMLR
- Fang et al. [2023] Fang, W., Chen, Y., Ding, J., Yu, Z., Masquelier, T., Chen, D., Huang, L., Zhou, H., Li, G., Tian, Y.: Spikingjelly: An open-source machine learning infrastructure platform for spike-based intelligence. Science Advances 9(40), 1480 (2023)
- Maass [1997] Maass, W.: Networks of spiking neurons: the third generation of neural network models. Neural Networks 14, 1659–1671 (1997)
- Akopyan et al. [2015] Akopyan, F., Sawada, J., Cassidy, A., Alvarez-Icaza, R., Arthur, J., Merolla, P., Imam, N., Nakamura, Y., Datta, P., Nam, G.-J., et al.: Truenorth: Design and tool flow of a 65 mw 1 million neuron programmable neurosynaptic chip. IEEE transactions on computer-aided design of integrated circuits and systems 34(10), 1537–1557 (2015)
- Davies et al. [2018] Davies, M., Srinivasa, N., Lin, T.-H., Chinya, G., Cao, Y., Choday, S.H., Dimou, G., Joshi, P., Imam, N., Jain, S., et al.: Loihi: A neuromorphic manycore processor with on-chip learning. Ieee Micro 38(1), 82–99 (2018)