This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\DeclareMathOperator\var

Var \coltauthor\NameMax Dabagia \Emailmaxdabagia@gatech.edu
\addrSchool of Computer Science, Georgia Tech and \NameChristos H. Papadimitriou \Emailchristos@columbia.edu
\addrDepartment of Computer Science, Columbia University and \NameSantosh S. Vempala \Emailvempala@gatech.edu
\addrSchool of Computer Science, Georgia Tech

Assemblies of Neurons Learn to Classify Well-Separated Distributions

Abstract

An assembly is a large population of neurons whose synchronous firing is hypothesized to represent a memory, concept, word, and other cognitive categories. Assemblies are believed to provide a bridge between high-level cognitive phenomena and low-level neural activity. Recently, a computational system called the Assembly Calculus (AC), with a repertoire of biologically plausible operations on assemblies, has been shown capable of simulating arbitrary space-bounded computation, but also of simulating complex cognitive phenomena such as language, reasoning, and planning. However, the mechanism whereby assemblies can mediate learning has not been known. Here we present such a mechanism, and prove rigorously that, for simple classification problems defined on distributions of labeled assemblies, a new assembly representing each class can be reliably formed in response to a few stimuli from the class; this assembly is henceforth reliably recalled in response to new stimuli from the same class. Furthermore, such class assemblies will be distinguishable as long as the respective classes are reasonably separated — for example, when they are clusters of similar assemblies, or more generally separable with margin by a linear threshold function. To prove these results, we draw on random graph theory with dynamic edge weights to estimate sequences of activated vertices, yielding strong generalizations of previous calculations and theorems in this field over the past five years. These theorems are backed up by experiments demonstrating the successful formation of assemblies which represent concept classes on synthetic data drawn from such distributions, and also on MNIST, which lends itself to classification through one assembly per digit. Seen as a learning algorithm, this mechanism is entirely online, generalizes from very few samples, and requires only mild supervision — all key attributes of learning in a model of the brain. We argue that this learning mechanism, supported by separate sensory pre-processing mechanisms for extracting attributes, such as edges or phonemes, from real world data, can be the basis of biological learning in cortex.

keywords:
List of keywords

1 Introduction

The brain has been a productive source of inspiration for AI, from the perceptron and the neocognitron to deep neural nets. Machine learning has since advanced to dizzying heights of analytical understanding and practical success, but the study of the brain has lagged behind in one important dimension: After half a century of intensive effort by neuroscientists (both computational and experimental), and despite great advances in our understanding of the brain at the level of neurons, synapses, and neural circuits, we still have no plausible mechanism for explaining intelligence, that is, the brain’s performance in planning, decision-making, language, etc. As Nobel laureate Richard Axel put it, “we have no logic for translating neural activity into thought and action” (Axel, 2018).

Recently, a high-level computational framework was developed with the explicit goal to fill this gap: the Assembly Calculus (AC) (Papadimitriou et al., 2020), a computational model whose basic data type is the assembly of neurons. Assemblies, called “the alphabet of the brain” (Buzsáki, 2019), are large sets of neurons whose simultaneous excitation is tantamount to the subject’s thinking of an object, idea, episode, or word (see Piantadosi et al. (2016)). Dating back to the birth of neuroscience, the “million-fold democracy” by which groups of neurons act collectively without central control was first proposed by Sherrington (1906) and was the empirical phenomenon that Hebb attempted to explain with his theory of plasticity (Hebb, 1949). Assemblies are initially created to record memories of external stimuli (Quiroga, 2016), and are believed to be subsequently recalled, copied, altered, and manipulated in the non-sensory brain (Piantadosi et al., 2012; Buzsáki, 2010). The Assembly Calculus provides a repertoire of operations for such manipulation, namely project, reciprocal-project, associate, pattern-complete, and merge encompassing a complete computational system. Since the Assembly Calculus is, to our knowledge, the only extant computational system whose purpose is to bridge the gap identified by Axel in the above quote (Axel, 2018), it is of great interest to establish that complex cognitive functions can be plausibly expressed in it. Indeed, significance progress has been made over the past year, see for example Mitropolsky et al. (2021) for a parser of English and d’Amore et al. (2021) for a program mediating planning in the blocks world, both written in the AC programming system. Yet despite these recent advances, one fundamental question is left unanswered: If the Assembly Calculus is a meaningful abstraction of cognition and intelligence, why does it not have a learn command? How can the brain learn through assembly representations?

This is the question addressed and answered in this paper. As assembly operations are a new learning framework and device, one has to start from the most basic questions: Can this model classify assembly-encoded stimuli that are separated through clustering, or by half spaces? Recall that learning linear thresholds is a theoretical cornerstone of supervised learning, leading to a legion of fundamental algorithms: Perceptron, Winnow, multiplicative weights, isotron, kernels and SVMs, and many variants of gradient descent.

Following Papadimitriou et al. (2020), we model the brain as a directed graph of excitatory neurons with dynamic edge weights (due to plasticity). The brain is subdivided into areas, for simplicity each containing nn neurons connected through a Gn,pG_{n,p} random directed graph (Erdős and Rényi, 1960). Certain ordered pairs of areas are also connected, through random bipartite graphs. We assume that neurons fire in discrete time steps. At each time step, each neuron in a brain area will fire if its synaptic input from the firings of the previous step is among the top kk highest out of the nn neurons in its brain area. This selection process is called kk-cap, and is an abstraction of the process of inhibition in the brain, in which a separate population of inhibitory neurons is induced to fire by the firing of excitatory neurons in the area, and through negatively-weighted connections prevents all but the most stimulated excitatory neurons from firing. Synaptic weights are altered via Hebbian plasticity and homeostasis (see Section 2 for a full description). In this stylized mathematical model, reasonably consistent with what is known about the brain, it has been shown that the operations of the Assembly Calculus converge and work as specified (with high probability relative to the underlying random graphs). These results have also been replicated by simulations in the model above, and also in more biologically realistic networks of spiking neurons (see Legenstein et al. (2018); Papadimitriou et al. (2020)). In this paper we develop, in the same model, mechanisms for learning to classify well-separated classes of stimuli, including clustered distributions and linear threshold functions with margin. Moreover, considering that the ability to learn from few examples, and with mild supervision, are crucial characteristics of any brain-like learning algorithm, we show that learning with assemblies does both quite naturally.

2 A mathematical model of the brain

Here we outline the basics of the model in Papadimitriou et al. (2020). There are a finite number aa of brain areas, denoted X,Y,X,Y,\ldots (but in this paper, we will only need one brain area where learning happens, plus another area where the stimuli are presented). Each area is a random directed graph with nn nodes called neurons with each directed edge present independently with probability pp (for simplicity, we take nn and pp to be the same across areas). Some ordered pairs of brain areas are also connected by random bipartite graphs, with the same connection probability pp. Importantly, each area may be inhibited, which means that its neurons cannot fire; the status of the areas is determined by explicit inhibit/disinhibit commands of the AC. 111The brain’s neuromodulatory systems (Jones, 2003; Harris and Thiele, 2011) are plausible candidates to implement these mechanisms. This defines a large random graph G=(N,E)G=(N,E) with |N|=an|N|=an nodes and a number |E||E| of directed edges which is in expectation (a+b)pn2apn(a+b)pn^{2}-apn, where bb is the number of pairs of areas that are connected. Each edge (i,j)E(i,j)\in E in this graph, called a synapse, has a dynamic non-negative weight wij(t)w_{ij}(t), initially 11.

This framework gives rise to a discrete-time dynamical system, as follows: The state of the system at any time step tt consists of (a) a bit for each area XX, inh(X,t)(X,t), initially 0, denoting whether the area is inhibited; (b) a bit for each neuron ii, fires(i,t)(i,t), denoting whether ii spikes at time tt (zero if the area XX of ii has inh(X,t)=1(X,t)=1); and (c) the weights of all synapses wij(t)w_{ij}(t), initially one.

The state transition of the dynamical system is as follows: For each neuron ii in area AA with inh(X,t+1)=0(X,t+1)=0 (see the next paragraph for how inh(X,t+1)(X,t+1) is determined), define its synaptic input at time t+1t+1,

SI(i,t+1)=(j,i)Efires(j,t)wji(t).\mbox{SI}(i,t+1)=\sum_{(j,i)\in E}\hbox{\rm fires}(j,t)w_{ji}(t).

For each ii in area XX with inh(X,t+1)=0(X,t+1)=0, we set fires(i,t)=1(i,t)=1 iff ii is among the kk neurons in its area that have highest SI(i,t+1)(i,t+1) (breaking ties arbitrarily). This is the kk-cap operation, a basic ingredient of the AC framework, modeling the inhibitory/excitatory balance of a brain area.222Binas et al. (2014) showed rigorously how a kk-cap dynamic could be emerge in a network of excitatory and inhibitory neurons. As for the synaptic weights,

wji(t+1)=wji(t)(1+βfires(j,t)fires(i,t+1)).w_{ji}(t+1)=w_{ji}(t)(1+\beta\cdot\hbox{\rm fires}(j,t)\hbox{\rm fires}(i,t+1)).

That is, if jj fires at time tt and ii fires at time t+1t+1, Hebbian plasticity dictates that wjiw_{ji} be increased by a factor of 1+β1+\beta at time t+1t+1. So that the weights do not grow unlimited, a homeostasis process renormalizes, at a slower time scale, the sum of weights along the incoming synapses of each neuron (see Davis (2006) and Turrigiano (2011) for reviews of this mechanism in the brain).

Finally, the AC is a computational system driving the dynamical system by executing commands at each time step tt (like a programming language driving the physical system that is the computer’s hardware). The AC commands (dis)inhibit(X)(X) change the inhibition status of an area at time tt; and the command fire(x)(x), where xx is the name of an assembly (defined next) in a disinhibited area, overrides the selection by kk-cap, and causes the kk neurons of assembly xx to fire at time tt.

An assembly is a highly interconnected (in terms of both number of synapses and their weights) set of kk neurons in an area encoding a real world entity. Initially, assembly-like representations exist only in a special sensory area, as representations of perceived real-world entities such as a heard (or read) word. Assemblies in the remaining, non-sensory areas are an emergent behavior of the system, copied and re-copied, merged, associated, etc., through further commands of the AC. This is how the model is able to simulate arbitrary nk\frac{n}{k} space bounded computations (Papadimitriou et al., 2020). The most basic such command is project(x,Y,y)(x,Y,y), which, starting from an assembly xx in area XX, creates in area YY (where there is connectivity from XX to YY) a new assembly yy, which has strong synaptic connectivity from xx and which will henceforth fire every time xx fires in the previous step, and YY is not inhibited. This command entails disinhibiting the areas X,YX,Y, and then firing (the neurons in) assembly xx for the next TT time steps. It was shown by Papadimitriou and Vempala (2019) that, with high probability, after a small number of steps, a stable assembly yy in YY will emerge, which is densely intraconnected and has high connectivity from xx. The mechanism achieving this convergence involves synaptic input from xx, which creates an initial set y1y^{1} of firing neurons in YY, which then evolves to yt,t=2,y^{t},t=2,\ldots through sustained synaptic input from xx and recurrent input from yt1y^{t-1}, while these two effects are further enhanced by plasticity.

Incidentally, this convergence proof (see Legenstein (2018); Papadimitriou and Vempala (2019); Papadimitriou et al. (2020) for a sequence of sharpened versions of this proof over the past years) is the most mathematically sophisticated contribution of this theory to date. The theorems of the present paper can be seen as substantial generalizations of that result: Whereas in previous work an assembly is formed as a copy of one stimulus firing repeatedly (memorization), so that this new assembly will henceforth fire whenever the same stimulus us presented again, in this paper we show rigorously that an assembly will be formed in response to the sequential firing of many stimuli, all drown from the same distribution (generalization), and the formed assembly will fire reliably every time another stimulus from the same distribution is presented.

Refer to caption
Figure 1: A mathematical model of learning in the brain. Our model (left) has a sensory area (column) connected to a brain area (square), both made up of spiking neurons. Two different stimuli classes (with their core sets in blue/green on the left) project from the sensory area via synaptic connections (arrows). Assemblies in the brain area (with core sets in the corresponding colors) form in response to these stimuli classes, each of which consistently fires when a constant fraction of the associated stimuli class’s core set does. Our learning algorithm (right) consists of presenting a stream of stimuli from each class.

The key parameters of our model are n,k,pn,k,p, and β\beta. Intended values for the brain are n=107,k=104,p=103,β=0.1n=10^{7},k=10^{4},p=10^{-3},\beta=0.1, but in our simulations we have also had success on a much smaller scale, with n=103,k=102,p=101,β=0.1n=10^{3},k=10^{2},p=10^{-1},\beta=0.1. β\beta is an important parameter, in that adequately large values of β\beta guarantee the convergence of the AC operations. For a publicly available simulator of the Assembly Calculus (in which the Learning System below can be readily implemented) see \urlhttp://brain.cc.gatech.edu.

The learning mechanism.

For the purpose of demonstrating learning within the framework of AC, we consider the specific setting described below. First, there is a special area, called the sensory area, in which training and testing data are encoded as assembly-like representations called stimuli. There is only one other brain area (besides the sensory area), and that is where learning happens, through the formation of assemblies in response to sequences of stimuli.

A stimulus is a set of about kk neurons firing simultaneously (“presented”) in the sensory area. Note that, exceptionally in the sensory area, a number of neurons that is a little different from kk may fire at a step. A stimulus class AA is a distribution over stimuli, defined by three parameters: two scalars r,q[0,1],r>qr,q\in[0,1],r>q, and a set of kk neurons SAS_{A} in the sensory area. To generate a stimulus x{0,1}nx\in\{0,1\}^{n} in the class AA, each neuron iSAi\in S_{A} is chosen with probability rr, while for each iSAi\not\in S_{A}, the probability of choosing neuron ii is qk/nqk/n. It follows immediately that, in expectation, an rr fraction of the neurons in the stimulus core are set to 11 and the number of neurons outside the core that are set to 11 is also O(k)O(k).

The presentation of a sequence of stimuli from a class AA in the sensory area evokes in the learning system a response RR, a distribution over assemblies in the brain area. We show that, as a consequence of plasticity and kk-cap, this distribution RR will be highly concentrated, in the following sense: Consider the set SRS_{R} of all assemblies xx that have positive probability in RR. Then the numbers of neurons in both the intersection R=xSRx{R}^{*}=\bigcap_{x\in S_{R}}x, called the core of RR and the union R¯=xSRx\bar{R}=\bigcup_{x\in S_{R}}x are close to kk, in particular ko(k)k-o(k) and k+o(k)k+o(k) respectively.333The larger the plasticity, the closer these two values are (see Papadimitriou and Vempala (2019), Fig. 2). In other words, neurons in R{R}^{*} fire far more often on average than neurons in R¯R\bar{R}\setminus{R}^{*}.

Finally, our learning protocol is this: Beginning with the brain area at rest, stimuli are repeatedly sampled from the class, and made to fire. After a small number of training samples, the brain area returns to rest, and then the same procedure is repeated for the next stimulus class, and so on. Then testing stimuli are presented in random order to test the extent of learning. (see Algorithm 1 in an AC-derived programming language.)

{algorithm}

The learning mechanism. (BB denotes the brain area.) \KwIna set of stimulus classes A1,,AcA_{1},\ldots,A_{c}; T1T\geq 1 \KwOutA set of assemblies y1,,ycy_{1},\ldots,y_{c} in the brain area encoding these classes \ForEach stimulus class ii inh(B)0(B)\leftarrow 0
\ForEach time step 1tT1\leq t\leq T Sample xAix\sim A_{i} and fire xxyiread(B)y_{i}\leftarrow\texttt{read}(B)
inh(B)1(B)\leftarrow 1

That is, we sample TT stimuli xx from each class, fire each xx to cause synaptic input in the brain area, and after the TTth sample has fired we record the assembly which has been formed in the brain area. This is the representation for this class.

Related work

There are numerous learning models in the neuroscience literature. In a variation of the model we consider here, Rangamani and Gandhi (2020) have considered supervised learning of Boolean functions using assemblies of neurons, by setting up separate brain areas for each label value. Amongst other systems with rigorous guarantees, assemblies are superficially similar to the “items” of Valiant’s neuroidal model (Valiant, 1994), in which supervised learning experiments have been conducted (Valiant, 2000; Feldman and Valiant, 2009), where an output neuron is clamped to the correct label value, while the network weights are updated under the model. The neuroidal model is considerably more powerful than ours, allowing for arbitrary state changes of neurons and synapses; in contrast, our assemblies rely on only two biologically sound mechanisms, plasticity and inhibition.

Hopfield nets (Hopfield, 1982) are recurrent networks of neurons with symmetric connection weights which will converge to a memorized state from a sufficiently similar one, when properly trained using a local and incremental update rule. In contrast, the memorized states our model produces (which we call assemblies) emerge through plasticity and randomization from the structure of a random directed network, whose weights are asymmetric and nonnegative, and in which inhibition — not the sign of total input — selects which neurons will fire.

Stronger learning mechanisms have recently been proposed. Inspired by the success of deep learning, a large body of work has shown that cleverly laid-out microcircuits of neurons can approximate backpropagation to perform gradient descent (Lillicrap et al., 2016; Sacramento et al., 2017; Guerguiev et al., 2017; Sacramento et al., 2018; Whittington and Bogacz, 2019; Lillicrap et al., 2020). These models rely crucially on novel types of neural circuits which, although biologically possible, are not presently known or hypothesized in neurobiology, nor are they proposed as a theory of the way the brain works. These models are capable of matching the performance of deep networks on many tasks, which are more complex than the simple, classical learning problems we consider here. The difference between this work and ours is, again, that here we are showing that learning arises naturally from well-understood mechanisms in the brain, in the context of the assembly calculus.

3 Results

Very few stimuli sampled from an input distribution are activated sequentially at the sensory area. The only form of supervision required is that all training samples from a given class are presented consecutively. Plasticity and inhibition alone ensure that, in response to this activation, an assembly will be formed for each class, and that this same assembly will be recalled at testing upon presentation of other samples from the same distribution. In other words, learning happens. And in fact, despite all these limitations, we show that the device is an efficient learner of interesting concept classes.

Our first theorem is about the creation of an assembly in response to inputs from a stimulus class. This is a generalization of a theorem from Papadimitriou and Vempala (2019), where the input stimulus was held constant; here the input is a stream of random samples from the same stimulus class. Like all our results, it is a statement holding with high probability (WHP), where the underlying random event is the random graph and the random samples. When sampled stimuli fire, the assembly in the brain area changes. The neurons participating in the current assembly (those whose synaptic input from the previous step is among the kk highest) are called the current winners. A first-time winner is a current winner that participated in no previous assembly (for the current stimulus class).

Theorem 3.1 (Creation).

Consider a stimulus class AA projected to a brain area. Assume that

ββ0=1r2(2r2)2ln(nk)+6kp+2ln(nk)\beta\geq\beta_{0}=\frac{1}{r^{2}}\frac{\left(\sqrt{2}-r^{2}\right)\sqrt{2\ln\left(\frac{n}{k}\right)}+\sqrt{6}}{\sqrt{kp}+\sqrt{2\ln\left(\frac{n}{k}\right)}}

Then WHP no first-time winners will enter the cap after O(logk)O(\log k) rounds, and moreover the total number of winners A¯\bar{A} can be bounded as

|A¯|k1exp((ββ0)2)k+O(lognr3pβ2)|\bar{A}|\leq\frac{k}{1-\exp(-(\frac{\beta}{\beta_{0}})^{2})}\leq k+O\left(\frac{\log n}{r^{3}p\beta^{2}}\right)
Remark 3.2.

The theorem implies that for a small constant cc, it suffices to have plasticity parameter

β1r2ckp/(2ln(n/k))+1.\beta\geq\frac{1}{r^{2}}\frac{c}{\sqrt{kp/(2\ln(n/k))}+1}.

Our second theorem is about recall for a single assembly, when a new stimulus from the same class is presented. We assume that examples from an an assembly class AA have been presented, and a response assembly AA^{*} encoding this class has been created, by the previous theorem.

Theorem 3.3 (Recall).

WHP over the stimulus class, the set C1C_{1} firing in response to a test assembly from the class AA will overlap A{A}^{*} by a fraction of at least 1ekpr1-e^{-kpr}, i.e.

|C1A|kekpr\frac{|C_{1}\setminus{A}^{*}|}{k}\leq e^{-kpr}

The proof entails showing that the average weight of incoming connections to a neuron in A{A}^{*} from neurons in SAS_{A} is at least

1+1r(2+2kprln(nk)+2)1+\frac{1}{\sqrt{r}}\left(\sqrt{2}+\sqrt{\frac{2}{kpr}\ln\left(\frac{n}{k}\right)+2}\right)

Our third theorem is about the creation of a second assembly corresponding to a second stimulus class. This can easily be extended to many classes and assemblies. As in the previous theorem, we assume that O(logk)O(\log k) examples from assembly class AA have been presented, and A¯\bar{A} has been created. Then we introduce BB, a second stimulus class, with |SASB|=αk|S_{A}\cap S_{B}|=\alpha k, and present O(logk)O(\log k) samples to induce a series of caps, B1,B2,B_{1},B_{2},\ldots, with BB^{*} as their union.

Theorem 3.4 (Multiple Assemblies).

The total support of BB^{*} can be bounded WHP as

|B|k1exp((ββ0)2)k+O(lognr3pβ2)|B^{*}|\leq\frac{k}{1-\exp(-(\frac{\beta}{\beta_{0}})^{2})}\leq k+O\left(\frac{\log n}{r^{3}p\beta^{2}}\right)

Moreover, WHP, the overlap in the core sets A{A}^{*} and B{B}^{*} will preserve the overlap of the stimulus classes, so that |AB|αk|{A}^{*}\cap{B}^{*}|\leq\alpha k.

This time the proof relies on the fact that the average weight of incoming connections to a neuron in A{A}^{*} is upper-bounded by

γ1+2ln(nk)2ln((1+r)/rα)αrkp\gamma\leq 1+\frac{\sqrt{2\ln\left(\frac{n}{k}\right)}-\sqrt{2\ln((1+r)/r\alpha)}}{\alpha r\sqrt{kp}}

Our fourth theorem is about classification after the creation of multiple assemblies, and shows that random stimuli from any class are mapped to their corresponding assembly. We state it here for two stimuli classes, but again it is extended to several. We assume that stimulus classes AA and BB overlap in their core sets by a fraction of α\alpha, and that they have been projected to form a distribution of assemblies A{A}^{*} and B{B}^{*}, respectively.

Theorem 3.5 (Classification).

If a random stimulus chosen from a particular class (WLOG, say BB) fires to cause a set C1C_{1} of learning area neurons to fire, then WHP over the stimulus class the fraction of neurons in the cap C1C_{1} and in B{B}^{*} will be at least

|C1B|k12exp(12(γα1)2kpr)\frac{|C_{1}\cap{B}^{*}|}{k}\geq 1-2\exp\left(-\frac{1}{2}(\gamma\alpha-1)^{2}kpr\right)

where γ\gamma is a lower bound on the average weight of incoming connections to a neuron in A{A}^{*} (resp. B{B}^{*}) from neurons in SAS_{A} (resp. SBS_{B}).

Taken together, the above results guarantee that this mechanism can learn to classify well-separated distributions, where each distribution has a constant fraction of its nonzero coordinates in a subset of kk input coordinates. The process is naturally interpretable: an assembly is created for each distribution, so that random stimuli are mapped to their corresponding assemblies, and the assemblies for different distributions overlap in no more than the core subsets of their corresponding distributions.

Finally, we consider the setting where the labeling function is a linear threshold function, parameterized by an arbitrary nonnegative vector vv and margin Δ\Delta. We will create a single assembly to represent examples on one side of the threshold, i.e. those for which vXv1k/nv\cdot X\geq\|v\|_{1}k/n. We define 𝒟+\mathcal{D}_{+} denote the distribution of these examples, where each coordinate is an independent Bernoulli variable with mean 𝔼(Xi)=k/n+Δvi\mathbb{E}(X_{i})=k/n+\Delta v_{i}, and define 𝒟\mathcal{D}_{-} to be the distribution of negative examples, where each coordinate is again an independent Bernoulli variable yet now all identically distributed with mean k/nk/n. (Note that the support of the positive and negative distributions is the same; there is a small probability of drawing a positive example from the negative distribution, or vice versa.) To serve as a classifier, a fraction 1ϵ+1-\epsilon_{+} of neurons in the assembly must be guaranteed to fire for a positive example, and a fraction ϵ<1ϵ+\epsilon_{-}<1-\epsilon_{+} guaranteed not to fire for a negative one. A test example is then classified as positive if at least a 1ϵ1-\epsilon fraction of neurons in the assembly fire (for ϵ[ϵ,1ϵ+]\epsilon\in[\epsilon_{-},1-\epsilon_{+}]), and negative otherwise. The last theorem shows that this can in fact be done with high probability, as long as the normal vector vv of the linear threshold is neither too dense nor too sparse. Additionally, we assume synapses are subject to homeostasis in between training and evaluation; that is, all of the incoming weights to a neuron are normalized to sum to 1.

Theorem 3.6 (Learning Linear Thresholds).

Let vv be a nonnegative vector normalized to be of unit Euclidean length (v2=1\|v\|_{2}=1). Assume that Ω(k)=v1n/2\Omega(k)=\|v\|_{1}\leq\sqrt{n}/2 and

Δ2β2kp(2ln(n/k)+2)+1).\Delta^{2}\beta\geq\sqrt{\frac{2k}{p}}(\sqrt{2\ln(n/k)+2)}+1).

Then, sequentially presenting Ω(logk)\Omega(\log k) samples drawn at random from 𝒟+\mathcal{D}^{+} forms an assembly A{A}^{*} that correctly separates D+D^{+} from DD^{-}: with probability 1o(1)1-o(1) a randomly drawn example from 𝒟+\mathcal{D}^{+} will result in a cap which overlaps at least 3k/43k/4 neurons in A{A}^{*}, and an example from 𝒟\mathcal{D}^{-} will create a cap which overlaps no more than k/4k/4 neurons in A{A}^{*}.

Remark 3.7.

The bound on Δ2β\Delta^{2}\beta leads to two regimes of particular interest: In the first,

β2ln(n/k)+2+1kp\beta\geq\frac{\sqrt{2\ln(n/k)+2}+1}{\sqrt{kp}}

and Δk\Delta\geq\sqrt{k}, which is similar to the plasticity parameter required for a fixed stimulus (Papadimitriou and Vempala, 2019) or stimulus classes; in the second, β\beta is a constant, and

Δ(2kβ2p)1/4(2ln(n/k)+2+1)1/2.\Delta\geq\left(\frac{2k}{\beta^{2}p}\right)^{1/4}\left(\sqrt{2\ln(n/k)+2}+1\right)^{1/2}.
Remark 3.8.

We can ensure that the number of neurons outside of A{A}^{*} for a positive example or in A{A}^{*} for a negative example are both o(k)o(k) with small overhead444i.e. increasing the plasticity constant β\beta by a factor of 1+o(1)1+o(1), so that plasticity can be active during the classification phase.

Since our focus in this paper is on highlighting the brain-like aspects of this learning mechanism, we emphasize stimulus classes as a case of particular interest, as they are a probabilistic generalization of the single stimuli considered in Papadimitriou and Vempala (2019). Linear threshold functions are an equally natural way to generalize a single kk-sparse stimulus, say vv; all the 0/1 points on the positive side of the threshold vxαkv^{\top}x\geq\alpha k have at least an α\alpha fraction of the kk neurons of the stimulus active.

Finally, reading the output of the device by the Assembly Calculus is simple: Add a readout area to the two areas so far (stimulus and learning), and project to this area one of the assemblies formed in the learning area for each stimulus class. The assembly in the learning area that fires in response to a test sample will cause the assembly in the readout area corresponding to the class to fire, and this can be sensed through the readout operation of the AC.

Proof overview.

The proofs of all five theorems can be found in the Appendix. The proofs hinge on showing that large numbers of certain neurons of interest will be included in the cap on a particular round — or excluded from it. More specifically:

  • To create an assembly, the sequence of caps should converge to the assembly’s core set. In other words, WHP an increasing fraction of the neurons selected by the cap in a particular step will also be selected at the next one.

  • For recall, a large fraction of the assembly should fire (i.e. be included in the cap) when presented with an example from the class.

  • To differentiate stimuli (i.e. classify), we need to ensure that a large fraction of the correct assembly will fire, while no more than a small fraction of the other assemblies do.

Following Papadimitriou and Vempala (2019), we observe that if the probability of a neuron having input at least tt is no more than ϵ\epsilon, then no more than an ϵ\epsilon fraction of the cohort of neurons will have input exceeding tt (with constant probability). By approximating the total input to a neuron as Gaussian and using well-known bounds on Gaussian tail probabilities, we can solve for tt, which gives an explicit input threshold neurons must surpass to make a particular cap. Then, we argue that the advantage conferred by plasticity, combined with the similarity of examples from the same class, gives the neurons of interest enough of an advantage that the input to all but a small constant fraction will exceed the threshold.

4 Experiments

The learning algorithm has been run on both synthetic and real-world datasets, as illustrated in the figures below. Code for experiments is available at \urlhttps://github.com/mdabagia/learning-with-assemblies.

Beyond the basic method of presenting a few examples from the same class and allowing plasticity to alter synaptic weights, the training procedure is slightly different for each of the concept classes (stimulus classes, linearly-separated, and MNIST digits). In each case, we renormalize the incoming weights of each neuron to sum to one after concluding the presentation of each class, and classification is performed on top of the learned assemblies by predicting the class corresponding to the assembly with the most neurons on.

  • For stimulus classes, we estimate the assembly for each class as composed of the neurons which fired in response to the last training example, which in practice are the same as those most likely to fire for a random test example.

  • For a linear threshold, we only present positive examples, and thus only form an assembly for one class. As with stimulus classes, the neurons in the assembly can be estimated by the last training cap or by averaging over test examples. We classify by comparing against a fixed threshold, generally half the cap size.

Additionally, it is important to set the plasticity parameter (β\beta) large enough that assemblies are reliably formed. We had success with β=0.1\beta=0.1 for stimulus classes and β=1.0\beta=1.0 for linear thresholds.

In Figure 2 (a) & (b), we demonstrate learning of two stimulus classes, while in Figure 2 (c) & (d), we demonstrate the result of learning a well-separated linear threshold function with assemblies. Both had perfect accuracy. Additionally, assemblies readily generalize to a larger number of classes (see Figure 6 in the appendix). We also recorded sharp threshold transitions in classification performance as the key parameters of the model are varied (see Figures 3 & 4).

Refer to caption
Refer to caption
Figure 2: Assemblies learned for various concept classes. On the top two lines, we show assemblies learned for stimulus classes, and on the bottom two lines, for a linear threshold with margin. In (a) & (c) we exhibit the distribution of firing probabilities over neurons of the learning area. In (b) & (d) we show the average overlap of different input samples (red square) and the overlaps of the corresponding representations in the assemblies (blue square). Using a simple sum readout over assembly neurons, both stimulus classes and linear thresholds are classified with 100% accuracy. Here, n=103,k=102,p=0.1,r=0.9,q=0.1,Δ=1.0n=10^{3},k=10^{2},p=0.1,r=0.9,q=0.1,\Delta=1.0, with 5 samples per class, and β=0.01\beta=0.01 (stimulus classes) and β=1.0\beta=1.0 (linear threshold).
Refer to caption
Refer to caption
Figure 3: Mean (dark line) and range (shaded area) of classification accuracy for two stimulus classes (left) and a fixed linear threshold (right) over 20 trials, as the classes become more separable. For stimulus classes, we vary the firing probability of neurons in the stimulus core while fixing the probability for the rest at k/nk/n, while for the linear threshold, we vary the margin. For both we used 5 training examples with n=1000,k=100,p=0.1n=1000,k=100,p=0.1, and β=0.1\beta=0.1 (stimulus classes), β=1.0\beta=1.0 (linear threshold).
Refer to caption
Refer to caption
Figure 4: Mean (dark line) and range (shaded area) of classification accuracy of two stimulus classes for various values of the number of neurons (nn, left) and the cap size (kk, right). For variable nn, we let k=n/10k=n/10; for variable kk, we fix n=1000n=1000. Other parameters are fixed, as p=0.1,r=0.9,q=k/np=0.1,r=0.9,q=k/n, and β=0.1\beta=0.1.

There are a number of possible extensions to the simplest strategy, where within a single brain region we learn an assembly for each concept class and classify based on which assembly is most activated in response to an example. We compared the performance of various classification models on MNIST as the number of features increases. The high-level model is to extract a certain number of features using one of the five different methods, and then find the best linear classifier (of the training data) on these features to measure performance (on the test data). The five different feature extractors are:

  • Linear features. Each feature’s weights are sampled i.i.d. from a Gaussian with standard deviation 0.10.1.

  • Nonlinear features. Each feature is a binary neuron: it has 784784 i.i.d. Bernoulli(0.2)(0.2) weights, and ‘fires’ (has output 11, otherwise 0) if its total input exceeds the expected input (70×0.270\times 0.2).

  • Large area assembly features. In a single brain area of size mm with cap size m/10m/10, we attempt to form an assembly for each class. The area sees a sequence of 55 examples from each class, with homeostasis applied after each class. Weights are updated according to Hebbian plasticity with β=1.0\beta=1.0. Additionally, we apply a negative bias: A neuron which has fired for a given class is heavily penalized against firing for subsequent classes.

  • ’Random’ assembly features. For a total of mm features, we create m/100m/100 different areas of 100100 neurons each, with cap size 1010. We then repeat the large area training procedure above in each area, with the order of the presentation of classes randomized for each area.

  • ’Split’ assembly features: For a total of mm features, we create 1010 different areas of m/10m/10 neurons each, with cap size m/100m/100. Area ii sees a sequence of 55 examples from class ii. Weights are updated according to Hebbian plasticity, and homeostasis is applied after training.

After extracting features, we train the linear classification layer to minimize cross-entropy loss on the standard MNIST training set (6000060000 images) and finally test on the full test set (1000010000 images).

Refer to caption
Figure 5: MNIST test accuracy as the number of features increases, for various classification models. ’Split’ assembly features, which forms an assembly for class ii in area ii, achieves the highest accuracy with the largest number of features.

The results as the total number of features ranges from 10001000 to 1000010000 is shown in Fig. 5. ’Split’ assembly features are ultimately the best of the five, with ’split’ features achieving 96%96\% accuracy with 1000010000 features. However, nonlinear features outperform ’split’ and large-area features and match ’random’ assembly features when the number of features is less than 80008000. For reference, the linear classifier gets to 89%89\%, while a two-layer neural network with width 800800 trained end-to-end gets to 98.4%98.4\%.

Going further, one could even create a hierarchy of brain areas, so that the areas in the first “layer” all project to a higher-level area, in hopes of forming assemblies for each digit in the higher-level area which are more robust. In this paper, our goal was to highlight the potential to form useful representations of a classification dataset using assemblies, and so we concentrated on a single layer of brain areas with a very simple classification layer on top. It will be interesting to explore what is possible with more complex architectures.

5 Discussion

Assemblies are widely believed to be involved in cognitive phenomena, and the AC provides evidence of their computational aptitude. Here we have made the first steps towards understanding how learning can happen in assemblies. Normally, an assembly is associated with a stimulus, such as Grandma. We have shown that this can be extended to a distribution over stimuli. Furthermore, for a wide range of model parameters, distinct assemblies can be formed for multiple stimulus classes in a single brain area, so long as the classes are reasonably differentiated.

A model of the brain at this level of abstraction should allow for the kind of classification that the brain does effortlessly — e.g., the mechanism that enables us to understand that individual frames in a video of an object depict the same object. With this in mind, the learning algorithm we present is remarkably parsimonious: it generalizes from a handful of examples which are seen only once, and requires no outside control or supervision other than ensuring multiple samples from the same concept class are presented in succession (and this latter requirement could be relaxed in a more complex architecture which channels stimuli from different classes). Finally, even though our results are framed within the Assembly Calculus and the underlying brain model, we note that they have implications far beyond this realm. In particular, they suggest that any recurrent neural network, equipped with the mechanisms of plasticity and inhibition, will naturally form an assembly-like group of neurons to represent similar patterns of stimuli.

But of course, many questions remain. In this first step we considered a single brain area — whereas it is known that assemblies draw their computational power from the interaction, through the AC, among many areas. We believe that a more general architecture encompassing a hierarchy of interconnected brain areas, where the assemblies in one area act like stimulus classes for others, can succeed in learning more complex tasks — and even within a single brain area improvements can result from optimizing the various parameters, something that we have not tried yet.

In another direction, here we only considered Hebbian plasticity, the simplest and most well-understood mechanism for synaptic changes. Evidence is mounting in experimental neuroscience that the range of plasticity mechanisms is far more diverse (Magee and Grienberger, 2020), and in fact it has been demonstrated recently (Payeur et al., 2021) that more complex rules are sufficient to learn harder tasks. Which plasticity rules make learning by assemblies more powerful?

We showed that assemblies can learn nonnegative linear threshold functions with sufficiently large margins. Experimental results suggest that the requirement of nonnegativity is a limitation of our proof technique, as empirically assemblies readily learn arbitrary linear threshold functions (with margin). What other concept classes can assemblies provably learn? We know from support vector machines that linear threshold functions can be the basis of far more sophisticated learning when their input is pre-processed in specific ways, while the celebrated results of Rahimi and Recht (2007) demonstrated that certain families of random nonlinear features can approximate sophisticated kernels quite well. What would constitute a kernel in the context of assemblies? The sensory areas of the cortex (of which the visual cortex is the best studied example) do pre-process sensory inputs extracting features such as edges, colors, and motions. Presumably learning by the non-sensory brain — which is our focus here — operates on the output of such pre-processing. We believe that studying the implementation of kernels in cortex is a very promising direction for discovering powerful learning mechanisms in the brain based on assemblies.

\acks

We thank Shivam Garg, Chris Jung, and Mirabel Reid for helpful discussions. MD is supported by an NSF Graduate Research Fellowship. SV is supported in part by NSF awards CCF-1909756, CCF-2007443 and CCF-2134105. CP is supported by NSF Awards CCF-1763970 and CCF-1910700, and by a research contract with Softbank.

References

  • Axel (2018) Richard Axel. Q & A. Neuron, 99:1110–1112, 2018.
  • Binas et al. (2014) Jonathan Binas, Ueli Rutishauser, Giacomo Indiveri, and Michael Pfeiffer. Learning and stabilization of winner-take-all dynamics through interacting excitatory and inhibitory plasticity. Frontiers in computational neuroscience, 8:68, 2014.
  • Buzsáki (2010) György Buzsáki. Neural syntax: cell assemblies, synapsembles, and readers. Neuron, 68(3), 2010.
  • Buzsáki (2019) György Buzsáki. The Brain from Inside Out. Oxford University Press, 2019.
  • d’Amore et al. (2021) Francesco d’Amore, Daniel Mitropolsky, Pierluigi Crescenzi, Emanuele Natale, and Christos H Papadimitriou. Planning with biological neurons and synapses. arXiv preprint arXiv:2112.08186, 2021.
  • Davis (2006) Graeme W Davis. Homeostatic control of neural activity: from phenomenology to molecular design. Annu. Rev. Neurosci., 29:307–323, 2006.
  • Erdős and Rényi (1960) Paul Erdős and Alfréd Rényi. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci, 5(1):17–60, 1960.
  • Feldman and Valiant (2009) Vitaly Feldman and Leslie G. Valiant. Experience-induced neural circuits that achieve high capacity. Neural Computation, 21(10):2715–2754, 2009. 10.1162/neco.2009.08-08-851.
  • Guerguiev et al. (2017) Jordan Guerguiev, Timothy P Lillicrap, and Blake A Richards. Towards deep learning with segregated dendrites. ELife, 6:e22901, 2017.
  • Harris and Thiele (2011) Kenneth D Harris and Alexander Thiele. Cortical state and attention. Nature reviews neuroscience, 12(9):509–523, 2011.
  • Hebb (1949) Donald Olding Hebb. The organization of behavior: A neuropsychological theory. Wiley, New York, 1949.
  • Hopfield (1982) John J Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences, 79(8):2554–2558, 1982.
  • Jones (2003) Barbara E Jones. Arousal systems. Front Biosci, 8(5):438–51, 2003.
  • Legenstein et al. (2018) R. Legenstein, W. Maass, C. H. Papadimitriou, and S. S. Vempala. Long-term memory and the densest k-subgraph problem. In Proc. of 9th Innovations in Theoretical Computer Science (ITCS) conference, Cambridge, USA, Jan 11-14. 2018, 2018.
  • Legenstein (2018) Robert A Legenstein. Long term memory and the densest k-subgraph problem. In 9th Innovations in Theoretical Computer Science Conference, 2018.
  • Lillicrap et al. (2016) Timothy P. Lillicrap, Daniel Cownden, Douglas Blair Tweed, and Colin J. Akerman. Random synaptic feedback weights support error backpropagation for deep learning. In Nature communications, 2016.
  • Lillicrap et al. (2020) Timothy P Lillicrap, Adam Santoro, Luke Marris, Colin J Akerman, and Geoffrey Hinton. Backpropagation and the brain. Nature Reviews Neuroscience, pages 1–12, 2020.
  • Magee and Grienberger (2020) Jeffrey C Magee and Christine Grienberger. Synaptic plasticity forms and functions. Annual review of neuroscience, 43:95–117, 2020.
  • Mitropolsky et al. (2021) Daniel Mitropolsky, Michael J Collins, and Christos H Papadimitriou. A biologically plausible parser. To appear in TACL, 2021.
  • Papadimitriou and Vempala (2019) Christos H Papadimitriou and Santosh S Vempala. Random projection in the brain and computation with assemblies of neurons. In 10th Innovations in Theoretical Computer Science Conference, 2019.
  • Papadimitriou et al. (2020) Christos H Papadimitriou, Santosh S Vempala, Daniel Mitropolsky, Michael Collins, and Wolfgang Maass. Brain computation by assemblies of neurons. Proceedings of the National Academy of Sciences, 117(25):14464–14472, 2020.
  • Payeur et al. (2021) Alexandre Payeur, Jordan Guerguiev, Friedemann Zenke, Blake A Richards, and Richard Naud. Burst-dependent synaptic plasticity can coordinate learning in hierarchical circuits. Nature neuroscience, pages 1–10, 2021.
  • Piantadosi et al. (2012) Steven T Piantadosi, Joshua B Tenenbaum, and Noah D Goodman. Bootstrapping in a language of thought: A formal model of numerical concept learning. Cognition, 123(2):199–217, 2012.
  • Piantadosi et al. (2016) Steven T Piantadosi, Joshua B Tenenbaum, and Noah D Goodman. The logical primitives of thought: Empirical foundations for compositional cognitive models. Psychological review, 123(4):392, 2016.
  • Quiroga (2016) Rodrigo Quian Quiroga. Neuronal codes for visual perception and memory. Neuropsychologia, 83:227–241, 2016.
  • Rahimi and Recht (2007) Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. Advances in neural information processing systems, 20, 2007.
  • Rangamani and Gandhi (2020) Akshay Rangamani and A Gandhi. Supervised learning with brain assemblies. Preprint, private communication, 2020.
  • Sacramento et al. (2017) Joao Sacramento, Rui Ponte Costa, Yoshua Bengio, and Walter Senn. Dendritic error backpropagation in deep cortical microcircuits. arXiv preprint arXiv:1801.00062, 2017.
  • Sacramento et al. (2018) João Sacramento, Rui Ponte Costa, Yoshua Bengio, and Walter Senn. Dendritic cortical microcircuits approximate the backpropagation algorithm. Advances in Neural Information Processing Systems, 31:8721–8732, 2018.
  • Sherrington (1906) Charles Scott Sherrington. The Integrative Action of the Nervous System, volume 2. Yale University Press, 1906.
  • Turrigiano (2011) Gina Turrigiano. Too many cooks? intrinsic and synaptic homeostatic mechanisms in cortical circuit refinement. Annual review of neuroscience, 34:89–103, 2011.
  • Valiant (1994) Leslie G. Valiant. Circuits of the mind. Oxford University Press, 1994. ISBN 978-0-19-508926-4.
  • Valiant (2000) Leslie G. Valiant. A neuroidal architecture for cognitive computation. J. ACM, 47(5):854–882, 2000. 10.1145/355483.355486.
  • Whittington and Bogacz (2019) James CR Whittington and Rafal Bogacz. Theories of error back-propagation in the brain. Trends in cognitive sciences, 23(3):235–250, 2019.

Appendix: Further experimental results

Refer to caption
Figure 6: Assemblies learned for four stimulus classes. On the left, the distributions of firing probabilities over neurons; on the right, the average overlap of the assemblies. Each additional class overlaps with previous ones, yet a simple readout over assembly neurons allows for perfect classification accuracy. Here, n=103,k=102,p=0.1,r=0.9,q=0.1,β=0.1n=10^{3},k=10^{2},p=0.1,r=0.9,q=0.1,\beta=0.1, with 5 samples per class.
Refer to caption
Figure 7: Assemblies formed during training. Each row is the response of the neural population to examples from the respective class. At each step, a new example from the appropriate class is presented. Due to plasticity, a core set emerges for the class after only a few rounds. (Here, five rounds are shown.) Inputs are drawn from two stimulus classes, with n=103,k=102,p=0.1,r=0.9,q=0.1n=10^{3},k=10^{2},p=0.1,r=0.9,q=0.1 and β=0.1\beta=0.1.

Appendix: Proofs

Preliminaries

We will need a few lemmas. The first is the well-known Berry-Esseen theorem:

Lemma .1.

Let X1,,XnX_{1},\ldots,X_{n} be independent random variables with 𝔼[Xi]=0,𝔼[Xi2]=σi2\mathbb{E}[X_{i}]=0,\mathbb{E}[X_{i}^{2}]=\sigma_{i}^{2} and 𝔼[|Xi|3]=ρi<\mathbb{E}[|X_{i}|^{3}]=\rho_{i}<\infty. Let

S=(i=1nσi2)1/2(i=1nXi)S=\left(\sum_{i=1}^{n}\sigma_{i}^{2}\right)^{-1/2}\left(\sum_{i=1}^{n}X_{i}\right)

Denote by FSF_{S} the CDF of SS, and Φ\Phi the standard normal CDF. There exists some constant CC such that

supx|F(x)Φ(x)|C(i=1nσi2)1/2maxiρiσi2\sup_{x\in\mathbb{R}}|F(x)-\Phi(x)|\leq C\left(\sum_{i=1}^{n}\sigma_{i}^{2}\right)^{-1/2}\max_{i}\frac{\rho_{i}}{\sigma_{i}^{2}}

This implies the following:

Lemma .2.

Let X1,,X2nX_{1},\ldots,X_{2n} denote the weights of the edges incoming to a neuron in the brain area from its neighbors (i.e. Xi=wiX_{i}=w_{i} w.p. pp, and 0 otherwise). Denote their sum as S=iXiS=\sum_{i}X_{i}, and consider the normal random variable

Y𝒩(𝔼[S],\var[S])Y\sim\mathcal{N}(\mathbb{E}[S],\var[S])

Then for p=o(1)p=o(1), and wi=o(np)w_{i}=o(\sqrt{np}),

supx|Pr(S<t)Pr(Y<t)|=o(1)\sup_{x\in\mathbb{R}}|\Pr(S<t)-\Pr(Y<t)|=o(1)
Proof .3.

Let X~i=Xi𝔼[Xi]\tilde{X}_{i}=X_{i}-\mathbb{E}[X_{i}]. We have

𝔼[X~i2]=p(1p)2wi2+(1p)p2wi2=p(1p)wi2\mathbb{E}[\tilde{X}_{i}^{2}]=p(1-p)^{2}w_{i}^{2}+(1-p)p^{2}w_{i}^{2}=p(1-p)w_{i}^{2}

and

𝔼[|X~i|3]=p(1p)3wi3+(1p)p3wi3p(1p)wi3\mathbb{E}[|\tilde{X}_{i}|^{3}]=p(1-p)^{3}w_{i}^{3}+(1-p)p^{3}w_{i}^{3}\leq p(1-p)w_{i}^{3}

Let

S~=S𝔼[S]\var[S]=(i=12n𝔼[X~i2])1/2(i=12nX~i)\tilde{S}=\frac{S-\mathbb{E}[S]}{\sqrt{\var[S]}}=\left(\sum_{i=1}^{2n}\mathbb{E}[\tilde{X}_{i}^{2}]\right)^{-1/2}\left(\sum_{i=1}^{2n}\tilde{X}_{i}\right)

Then by Lemma .1, there exists some constant CC so that

supx|FS~(x)Φ(x)|C(i=12n𝔼[X~i2])1/2maxi𝔼[|X~i|3]𝔼[X~i2]\sup_{x\in\mathbb{R}}|F_{\tilde{S}}(x)-\Phi(x)|\leq C\left(\sum_{i=1}^{2n}\mathbb{E}[\tilde{X}_{i}^{2}]\right)^{-1/2}\max_{i}\frac{\mathbb{E}[|\tilde{X}_{i}|^{3}]}{\mathbb{E}[\tilde{X}_{i}^{2}]}

As 1wi=o(np)1\leq w_{i}=o(\sqrt{np}), it follows that

i=12n𝔼[X~i2]=p(1p)i=12nwi22p(1p)n\sum_{i=1}^{2n}\mathbb{E}[\tilde{X}_{i}^{2}]=p(1-p)\sum_{i=1}^{2n}w_{i}^{2}\geq 2p(1-p)n

and

𝔼[|X~i|3]𝔼[X~i2]wi=o(np)\frac{\mathbb{E}[|\tilde{X}_{i}|^{3}]}{\mathbb{E}[\tilde{X}_{i}^{2}]}\leq w_{i}=o(\sqrt{np})

Hence,

supx|FS~(x)Φ(x)|Cwi2p(1p)n=o(11p)=o(1)\sup_{x\in\mathbb{R}}|F_{\tilde{S}}(x)-\Phi(x)|\leq C\frac{w_{i}}{\sqrt{2p(1-p)n}}=o\left(\frac{1}{\sqrt{1-p}}\right)=o(1)

for p=o(1)p=o(1). Noting that

Pr(S<t)=Pr(S~<t𝔼[S]\var[S)=FS~(t𝔼[S]\var[S])\Pr(S<t)=\Pr\left(\tilde{S}<\frac{t-\mathbb{E}[S]}{\sqrt{\var[S}}\right)=F_{\tilde{S}}\left(\frac{t-\mathbb{E}[S]}{\sqrt{\var[S]}}\right)

and using the affine property of normal random variables completes the proof.

In particular, we may substitute appropriate Gaussian tail bounds (such as the one provided in the following lemma) for tail bounds on sums of independent weighted Bernoullis throughout.

Lemma .4.

For X𝒩(0,1)X\sim\mathcal{N}(0,1) and t>0t>0,

Pr(X>t)12πtet2/2\Pr(X>t)\leq\frac{1}{\sqrt{2\pi}t}e^{-t^{2}/2}

For X𝒩(μ,σ2)X\sim\mathcal{N}(\mu,\sigma^{2}) and P(X>t)=pP(X>t)=p, we have

t=μ+σ2ln(1/p)+ln(2ln(1/p))+o(1)t=\mu+\sigma\sqrt{2\ln(1/p)+\ln(2\ln(1/p))}+o(1)
Proof .5.

Recall that

Pr(X>t)=t12πeτ2/2\text𝑑τ\Pr(X>t)=\int_{t}^{\infty}\frac{1}{\sqrt{2\pi}}e^{-\tau^{2}/2}\text{d}\tau

Then observing that for τt\tau\geq t,

eτ2/2τteτ2/2e^{-\tau^{2}/2}\leq\frac{\tau}{t}e^{-\tau^{2}/2}

we have

Pr(X>t)12πttτeτ2/2=12πtet2/2\Pr(X>t)\leq\frac{1}{\sqrt{2\pi}t}\int_{t}^{\infty}\tau e^{-\tau^{2}/2}=\frac{1}{\sqrt{2\pi}t}e^{-t^{2}/2}

For the second part, simply solve for tt.

Next, we will use the distribution of a normal random variable, conditioned on its sum with another normal random variable.

Lemma .6.

For X𝒩(μx,σx2)X\sim\mathcal{N}(\mu_{x},\sigma_{x}^{2}), Y𝒩(μy,σy2)Y\sim\mathcal{N}(\mu_{y},\sigma_{y}^{2}), and Z=X+YZ=X+Y, then conditioning XX on ZZ gives

X|(Z=z)𝒩(σx2σx2+σy2z+σy2μxσx2μyσx2+σy2,σx2σy2σx2+σy2)X|(Z=z)\sim\mathcal{N}\left(\frac{\sigma_{x}^{2}}{\sigma_{x}^{2}+\sigma_{y}^{2}}z+\frac{\sigma_{y}^{2}\mu_{x}-\sigma_{x}^{2}\mu_{y}}{\sigma_{x}^{2}+\sigma_{y}^{2}},\frac{\sigma_{x}^{2}\sigma_{y}^{2}}{\sigma_{x}^{2}+\sigma_{y}^{2}}\right)
Proof .7.

Bayes’ theorem provides

fX|Z=z(x,z)=fZ|X=x(z,x)fX(x)fZ(z)f_{X|Z=z}(x,z)=\frac{f_{Z|X=x}(z,x)f_{X}(x)}{f_{Z}(z)}

where fWf_{W} is the probability density function for variable WW. From the definition, fZ|X=x(z,x)=fY(zx)f_{Z|X=x}(z,x)=f_{Y}(z-x). Then substituting the Gaussian probability density function and simplifying, we have: {align*} f_X—Z=z(x, z) &= 12πσy2exp(-(z-x - μy)2y2)12πσx2exp(-(x-μx)2x2)12π(σx2+ σy2)exp(-(z-(μx+ μy))22(σx2+ σy2))
= 12πσx2σy2σx2+ σy2
exp(-12σx2σy2σx2y2(x - (σx2σx2+ σy2z + σy2μx- σx2μyσx2+ σy2))^2)

The following is the distribution of a binomial variable XX, given that we know the value of another binomial variable YY which uses XX as its number of trials.

Lemma .8.

Denote by (n,p)\mathcal{B}(n,p) the binomial distribution over nn trials with probability of success pp. Let X(n,p)X\sum\mathcal{B}(n,p) and Y|X(X,q)Y|X\sim\mathcal{B}(X,q). Then

X|YY+(nY,\tfracp(1q)1pq)X|Y\sim Y+\mathcal{B}(n-Y,\tfrac{p(1-q)}{1-pq})
Proof .9.

Via Bayes’ rule,

Pr(X=x|Y=y)=Pr(Y=y|X=x)Pr(X=x)Pr(Y=y)\Pr(X=x|Y=y)=\frac{\Pr(Y=y|X=x)\Pr(X=x)}{\Pr(Y=y)}

It is well-known that Y(n,pq)Y\sim\mathcal{B}(n,pq). Hence, using the formulae for the distributions and simplifying, {align*} Pr(X = x — Y = y) &= (x y)qy(1-q)x-y(n x)px(1-p)n-x(n y)(pq)y(1-pq)n-y
= (n-y x-y) (p(1-q))x-y(1-p)n-x(1-pq)n-y
= (n-y x-y) (p(1-q)1-pq)^x-y (1-p1-pq)^n-x
= (n-y x-y) (p(1-q)1-pq)^x-y (1 - p(1-q)1-pq)^n-x Note that for Z(ny,\tfracp(1q)1pq)Z\sim\mathcal{B}(n-y,\tfrac{p(1-q)}{1-pq}), we have

Pr(X=x|Y=y)=Pr(Z=xy)\Pr(X=x|Y=y)=\Pr(Z=x-y)

and so X|YY+ZX|Y\sim Y+Z.

The next observation is useful: Exponentiating a random variable by a base close to one will increase its concentration.

Lemma .10.

Let X𝒩(μx,σx2)X\sim\mathcal{N}(\mu_{x},\sigma_{x}^{2}) be a normal variable, and let Y=(1+β)XY=(1+\beta)^{X}. Then YY is lognormal with {align*} E(Y) &= (1+β)^μ_x(1+β)^ln(1+β)σ_x^2/2
\varY = ((1+β)^
ln(1+β)σ_x^2 - 1)(1+β)^2μ_x(1+β)^ln(1+β)σ_x^2

In particular, for ln(1+β)σx2\ln(1+\beta)\sigma_{x}^{2} close to 0, YY is highly concentrated at (1+β)μx(1+\beta)^{\mu_{x}}.

Proof .11.

Observe that Y=eln(1+β)XY=e^{\ln(1+\beta)X}, so it is clearly lognormal. So, define X~=ln(1+β)X𝒩(ln(1+β)μx,ln(1+β)2σx2)\tilde{X}=\ln(1+\beta)X\sim\mathcal{N}(\ln(1+\beta)\mu_{x},\ln(1+\beta)^{2}\sigma_{x}^{2}). Then we have {align*} E(Y) &= exp(E(~X) + 12\var~X)
=
exp(ln(1+β)μ_x + 12ln(1+β)^2σ_x^2)
= (1+β)^μ_x(1+β)^
ln(1+β)σ_x^2/2
and {align*} \varY &= (
exp(\var~X) - 1)exp(2E(~X) + \var~X)
= (
exp(ln(1+β)^2σ_x^2) - 1)exp(2ln(1+β)μ_x + ln(1+β)^2σ_x^2)
= ((1+β)^
ln(1+β)σ_x^2 - 1)(1+β)^2μ_x(1+β)^ln(1+β)σ_x^2 Furthermore, if ln(1+β)σx20\ln(1+\beta)\sigma_{x}^{2}\approx 0, then (1+β)ln(1+β)σx21(1+\beta)^{\ln(1+\beta)\sigma_{x}^{2}}\approx 1 and we obtain the concentration.

For learning a linear threshold function with an assembly (Theorem 3.6), we will require an additional lemma.

Lemma .12.

Let X1,,XnX_{1},\ldots,X_{n} and Y1,,YnY_{1},\ldots,Y_{n} be independent Bernoulli variables, and let Z=j=1nXiYiZ=\sum_{j=1}^{n}X_{i}Y_{i}. Then for any t0t\geq 0,

𝔼(Yi|Z𝔼(Z)+t)𝔼(Yi)\mathbb{E}(Y_{i}|Z\geq\mathbb{E}(Z)+t)\geq\mathbb{E}(Y_{i})
Proof .13.

Bayes’ rule gives

𝔼(Yi|Z𝔼(Z)+t)=Pr(Yi=1|Z𝔼(Z)+t)=Pr(Z𝔼(Z)+t|Yi=1)Pr(Yi=1)Pr(Z𝔼(Z)+t)\mathbb{E}(Y_{i}|Z\geq\mathbb{E}(Z)+t)=\Pr(Y_{i}=1|Z\geq\mathbb{E}(Z)+t)=\frac{\Pr(Z\geq\mathbb{E}(Z)+t|Y_{i}=1)\Pr(Y_{i}=1)}{\Pr(Z\geq\mathbb{E}(Z)+t)}

Then observe that the events Z𝔼Z+tZ\geq\mathbb{E}Z+t and Yi=1Y_{i}=1 are positively correlated, and so

Pr(Z𝔼Z+t|Yi=1)Pr(Z𝔼Z+t)\Pr(Z\geq\mathbb{E}Z+t|Y_{i}=1)\geq\Pr(Z\geq\mathbb{E}Z+t)

Then substituting gives {align*} E(Y_i —Z E(Z) + t) &Pr(Z E(Z) + t)Pr(Yi= 1)Pr(Z E(Z) + t)
= E(Y_i) as required.

Lastly, the following lemma allows us to translate a bound on the weight between certain synapses into a bound on the number of rounds (or samples) required.

Lemma .14.

Consider a neuron ii, connected by a synapse to a neuron jj with weight initially 1, and equipped with a plasticity parameter β\beta. Assume that jj fires with probability pp and ii fires with probability qq on each round, and that there are at least TT rounds, with

T1pqlnγln(1+β)T\geq\frac{1}{pq}\frac{\ln\gamma}{\ln(1+\beta)}

Then the synapse will have weight at least γ\gamma in expectation.

We are now equipped to prove the theorems.

Proof of Theorem 3.1

Let μt\mu_{t} be the fraction of first-timers in the cap on round tt. The process stabilizes when μt<1/k\mu_{t}<1/k, as then no new neurons have entered the cap.

For a given neuron ii, let X(t)X(t) and Y(t)Y(t) denote the input from connections to the kk neurons in SAS_{A} and the nkn-k neurons outside of SAS_{A}, respectively, on round tt. For a neuron which has never fired before, they are distributed approximately as

X(t)𝒩(kpr,kpr)Y(t)𝒩(kpq,kpq)X(t)\sim\mathcal{N}(kpr,kpr)\quad Y(t)\sim\mathcal{N}(kpq,kpq)

which follows from Lemma .2, for a total input of X(t)+Y(t)𝒩(kpr+kpq,kpr+kpq)X(t)+Y(t)\sim\mathcal{N}(kpr+kpq,kpr+kpq). (Note that we ignore small second-order terms in the variance.) To determine which neurons will make the cap on the first round, we need a threshold that roughly kk of nn draws from X(1)+Y(1)X(1)+Y(1) will exceed, with constant probability. In other words, we need the probability that X(1)+Y(1)X(1)+Y(1) exceeds this threshold to be about k/nk/n. Taking L=2ln(n/k)L=2\ln(n/k) and using the tail bound in Lemma .4, we find the threshold for the first cap to be at least

C1=kp(r+q)+kp(r+q)LC_{1}=kp(r+q)+\sqrt{kp(r+q)L}

On subsequent rounds, there is additional input from connections to the previous cap, distributed as 𝒩(kp,kp(1p))\mathcal{N}(kp,kp(1-p)). Using μt\mu_{t} as the fraction of first-timers, a first-time neuron must be in the top μtk\mu_{t}k of the nknn-k\sim n neurons left out of the previous cap. The activation threshold is thus

Ct=kp(1+r)+kpq+kp(1+r+q)(L+2ln(1/μt))C_{t}=kp(1+r)+kpq+\sqrt{kp(1+r+q)(L+2\ln(1/\mu_{t}))}

Now consider a neuron ii which fired on the first round. We know that X(1)+Y(1)C1X(1)+Y(1)\geq C_{1}, so using Lemma .6,

X(1)|(X(1)+Y(1)=C1)𝒩(rr+qC1,kprqr+q)X(1)|(X(1)+Y(1)=C_{1})\sim\mathcal{N}\left(\frac{r}{r+q}C_{1},kp\frac{rq}{r+q}\right)

If X(1)=xX(1)=x, Lemma .8 indicates that the true number of connections with stimulus neurons is distributed roughly as X~|(X(1)=x)𝒩(x+(kx)p(1r),(kx)p(1r))\tilde{X}|(X(1)=x)\sim\mathcal{N}(x+(k-x)p(1-r),(k-x)p(1-r)). Conditioning on X(1)+Y(1)=C1X(1)+Y(1)=C_{1}, ignoring second-order terms, and bounding the variance as kpkp, we have

X~|(X(1)+Y(1)=C1)𝒩(kp(1r)+rr+qC1,kp)\tilde{X}|(X(1)+Y(1)=C_{1})\sim\mathcal{N}\left(kp(1-r)+\frac{r}{r+q}C_{1},kp\right)

On the second round, the synapses between neuron ii and stimulus neurons which fired have had their weights increased by a factor of 1+β1+\beta, and these stimulus neurons will fire on the second round with probability rr. An additional X~X(1)\tilde{X}-X(1) stimulus neurons have a chance to fire for the first time. Neuron ii also receives recurrent input from the kk other neurons which fired the previous round, which it is connected to with probability pp. So, the total input to neuron ii is roughly

𝒩((1+β)r2r+qC1+kpr(1r),kp(1+rqr+q))+𝒩(kp(1+q),kp(1+q))\mathcal{N}\left((1+\beta)\frac{r^{2}}{r+q}C_{1}+kpr(1-r),kp\left(1+\frac{rq}{r+q}\right)\right)+\mathcal{N}(kp(1+q),kp(1+q))

In order for ii to make the second cap, we need that its input exceeds the threshold for first-timers, i.e.

(1+β)r2r+qC1+kp(1+r(1r)+q)+ZC2(1+\beta)\frac{r^{2}}{r+q}C_{1}+kp(1+r(1-r)+q)+Z\geq C_{2}

where Z𝒩(0,kp(1+r+q))Z\sim\mathcal{N}\left(0,kp(1+r+q)\right). Taking μ=μ2\mu=\mu_{2}, we have the following: {align*} Pr(i ∈C_2 &— i ∈C_1) = 1 - μ
≥Pr(Z ≥C_2 - (1+β)r2r+qC_1 - kp(1+ r(1-r)+q))
≥Pr(Z ≥-βkpr^2 - (1+β)r2r+qkpL + kp(1+r + q)(L + 2ln(1/μ))) Now, normalizing ZZ to 𝒩(0,1)\mathcal{N}(0,1) we have (again by the tail bound)

1μ1exp((βkpr2+(1+β)r2r+qL+(1+r+q)(L+2ln(1/μ)))22(1+r+q))1-\mu\geq 1-\exp\left(-\frac{\left(\beta\sqrt{kp}r^{2}+(1+\beta)\frac{r^{2}}{\sqrt{r+q}}\sqrt{L}+\sqrt{(1+r+q)(L+2\ln(1/\mu))}\right)^{2}}{2(1+r+q)}\right)

More clearly, this means

2(1+r+q)ln(1/μ)βkpr2+(1+β)r2r+qL+(1+r+q)(L+2ln(1/μ))\sqrt{2(1+r+q)\ln(1/\mu)}\leq\beta\sqrt{kp}r^{2}+(1+\beta)\frac{r^{2}}{\sqrt{r+q}}\sqrt{L}+\sqrt{(1+r+q)(L+2\ln(1/\mu))}

Then taking

ββ0=r+qr2(1+r+qr2r+q)L+2(1+r+q)kp+L\beta\geq\beta_{0}=\frac{\sqrt{r+q}}{r^{2}}\frac{\left(\sqrt{1+r+q}-\frac{r^{2}}{\sqrt{r+q}}\right)\sqrt{L}+\sqrt{2(1+r+q)}}{\sqrt{kp}+\sqrt{L}}

gives μ1/e\mu\leq 1/e, i.e. the overlap between the first two caps is at least a 11/e1-1/e fraction.

Now, we seek to show that the probability of a neuron leaving the cap drops off exponentially the more rounds it makes it in. Suppose that neuron ii makes it into the first cap and stays for tt consecutive caps. Each of its connections with stimulus neurons will be strengthened by the number of times that stimulus neuron fired, roughly 𝒩(tr,tr(1r))\mathcal{N}(tr,tr(1-r)) times. Using Lemma .10, the weight of the connection with a stimulus neuron is highly concentrated around (1+β)tr(1+\beta)^{tr}. Furthermore we know that ii has at least X~|(X(1)+Y(1)=C1)𝒩(kp(1r)+rr+qC1,kp)\tilde{X}|(X(1)+Y(1)=C_{1})\sim\mathcal{N}\left(kp(1-r)+\frac{r}{r+q}C_{1},kp\right) such connections, of which 𝒩(kpr(1r)+r2r+qC1,kpr)\mathcal{N}\left(kpr(1-r)+\frac{r^{2}}{r+q}C_{1},kpr\right) will fire. So, the input to neuron ii will be at least

(1+β)tr(kpr(1r)+r2r+qC1)+kp(1+q)+Z(1+\beta)^{tr}(kpr(1-r)+\frac{r^{2}}{r+q}C_{1})+kp(1+q)+Z

where Z𝒩(0,kp(1+(1+β)2trr+q))Z\sim\mathcal{N}(0,kp(1+(1+\beta)^{2tr}r+q))

To stay in the (t+1)(t+1)th cap, it suffices that this input is greater than Ct+1C_{t+1}, the threshold for first-timers. Using μ=μt+1\mu=\mu_{t+1} and reasoning as before: {align*} Pr(i ∈C_t+1 &— i ∈C_1 ∩…∩C_t) = 1 - μ
≥Pr(Z ¿ C_t+1 - (1+β)^tr(kpr(1-r) + r2r + qC_1) - kp(1+q))
= Pr(Z ¿ -tβkpr - (1+trβ)r2r+qkpL + kp(1+r + q)(L + 2ln(1/μ)))
≥1 - exp(-(trβkp+ (1+trβ)r2r+qL- (1+r + q)(L + 2ln(1/μ)))22(1+r+q)) where in the last step we approximately normalized ZZ to 𝒩(0,1)\mathcal{N}(0,1). Then

β1tr2(1+r+q)(L+2t2)r2)L+t2kp+L\beta\geq\frac{1}{tr^{2}}\frac{\sqrt{(1+r+q)(L+2t^{2})}-r^{2})\sqrt{L}+t\sqrt{2}}{\sqrt{kp}+\sqrt{L}}

will ensure μet2\mu\leq e^{-t^{2}}, which is no more than β0\beta_{0}.

Now, let neuron ii be a first time winner on round tt. Let X𝒩(kpr,kpr)X\sim\mathcal{N}(kpr,kpr) denote the input from stimulus neurons, Y𝒩(kp,kp)Y\sim\mathcal{N}(kp,kp) the input from recurrent connections to neurons in the previous cap, and Z𝒩(kpq,kpq)Z\sim\mathcal{N}(kpq,kpq) the input from nonstimulus neurons. Then conditioned on X+Y+Z=CtX+Y+Z=C_{t}, the second lemma indicates that {align*} X — (X + Y + Z = C_t) &∼N(r1 + r + qC_t, kpr(1+q)1 + r + q)
Y — (X + Y + Z = C_t) ∼N(11 + r + qC_t, kpr+q1 + r+q) So, the input on round t+1t+1 is at least

(1+β)1μt+r21+r+qCt+kpr(1r)+kpμt+kpq+Z(1+\beta)\frac{1-\mu_{t}+r^{2}}{1+r+q}C_{t}+kpr(1-r)+kp\mu_{t}+kpq+Z

where Z𝒩(0,kp((1+β)2r2(1+q)+(1μt)(r+q)1+r+q+r(1r)+q))Z\sim\mathcal{N}\left(0,kp\left((1+\beta)^{2}\frac{r^{2}(1+q)+(1-\mu_{t})(r+q)}{1+r+q}+r(1-r)+q\right)\right). By the usual argument we have {align*} Pr&(i ∈C_t+1 — i ∈C_t) = 1 - μ_t+1
≥Pr(Z ≥C_t+1 - (1+β)1 - μt+ r21 + r + qC_t - kpr(1-r) - kpμ_t - kpq) So, we will have μt+1<e1μt\mu_{t+1}<e^{-1}\mu_{t} when

β11μt+r2r(1r)+q+μt1+r+qL+2ln(1/μt)+2ln(1/μt)kp+L+2ln(1/μt)1+r+q\beta\geq\frac{1}{1-\mu_{t}+r^{2}}\frac{\frac{r(1-r)+q+\mu_{t}}{\sqrt{1+r+q}}\sqrt{L+2\ln(1/\mu_{t})}+\sqrt{2\ln(1/\mu_{t})}}{\sqrt{kp}+\sqrt{\frac{L+2\ln(1/\mu_{t})}{1+r+q}}}

which is smaller than β0\beta_{0}. Assuming r+q1r+q\sim 1, we may simplify β0\beta_{0}, so that

β0=1r2(2r2)L+6kp+L\beta_{0}=\frac{1}{r^{2}}\frac{\left(\sqrt{2}-r^{2}\right)\sqrt{L}+\sqrt{6}}{\sqrt{kp}+\sqrt{L}}

So, if ββ0\beta\geq\beta_{0}, the probability of leaving the cap once in the cap tt times drops off exponentially. We can conclude that no more than ln(k)\ln(k) rounds will be required for convergence. Additionally, assuming that a neuron enters the cap at time tt, let 1pτ1-p_{\tau} denote the probability it leaves after τ\tau rounds. Then its probability of staying in the cap on all subsequent rounds is

τ1pττ1(1exp(τ2(ββ0)2))1exp((ββ0)2)\prod_{\tau\geq 1}p_{\tau}\geq\prod_{\tau\geq 1}\left(1-\exp\left(-\tau^{2}\left(\frac{\beta}{\beta_{0}}\right)^{2}\right)\right)\geq 1-\exp\left(-\left(\frac{\beta}{\beta_{0}}\right)^{2}\right)

Thus, every neuron that makes it into the cap has a probability at least 1exp((β/β0)2))1-\exp(-(\beta/\beta_{0})^{2})) of making every subsequent cap, so the total support of all caps together is no more than k/(1exp((β/β0)2)))k/(1-\exp(-(\beta/\beta_{0})^{2}))) in expectation. \blacksquare\blacksquare

Proof of Theorem 3.3

Let μ\mu denote the fraction of newcomers in the cap. A neuron in A{A}^{*} can expect an input of

Xa=γkpr+kpq+ZaX_{a}=\gamma kpr+kpq+Z_{a}

where Za𝒩(0,γ2kpr+kpq)Z_{a}\sim\mathcal{N}(0,\gamma^{2}kpr+kpq), while neurons outside of A{A}^{*} can expect an input of

X=kpr+kpq+ZX=kpr+kpq+Z

where Z𝒩(0,kpr+kpq)Z\sim\mathcal{N}(0,kpr+kpq). Then the threshold is roughly

C1=kpr+kpq+kp(r+q)(L+2ln(1/μ))C_{1}=kpr+kpq+\sqrt{kp(r+q)(L+2\ln(1/\mu))}

For a neuron ii in A{A}^{*} to make the cap, it needs to exceed this threshold. We have {align*} Pr(i ∈C_1 — i ∈A^*)