Embodied Neuromorphic Vision with Continuous Random Backpropagation

The brain outperforms computer architectures in aspects of energy efficiency, robustness and adaptivity. Brain computations are modeled in silico with spiking neural networks and neuromorphic hardware. Recently, three-factor synaptic plasticity rules approximating backpropagation have been derived. Suited to neuromorphic hardware, these rules can learn online with asynchronous updates. In this paper, we present Continuous Random Backpropagation (cRBP), a continuous version of Event-Driven Random Backpropagation. This learning rule performs comparably to state-of-the-art rules on the DvsGesture dataset. We additionally show that the accuracy can be significantly increased with a simple attention mechanism. This mechanism provides translation invariance at low computational cost compared to convolutions by exploiting event stream sparsity. Subsequently, we integrate cRBP in a real robotic setup, where a gripper grasps objects according to the detected visual affordances. In this setup, visual information is actively sensed by a Dynamic Vision Sensor (DVS) mounted on a robotic head performing microsaccadic eye movements. Our results suggest that advances in neuromorphic technology and plasticity rules enable the development of learning robots operating at high speed and low power.


I. INTRODUCTION
The brain outperforms computer architectures in aspects of energy efficiency, robustness and adaptivity. The computational paradigms of the brain are vastly different from modern computer architectures. Biological neural networks base their computations on local information and communicate asynchronously with spikes. Understanding how these paradigms can be implemented in hardware would enable the design of autonomous learning robots operating at high speed for a fraction of the energy budget of current solutions.
Learning in the brain is believed to be based on synaptic plasticity. Unlike conventional machine learning methods, synaptic plasticity rules characterize weight updates in terms of information local to the synapse. Synaptic learning enables an efficient neuromorphic hardware implementation, asynchronous updates and online learning.
Recently, a family of synaptic plasticity rules for training multi-layer spiking neural networks have been proposed in [1]- [4]. These rules implement variations of backpropagation by approximating gradients as a multiplication of three factors related to the input, output and error of a synapse [5]. In this paper, we evaluate the ability to efficiently learn spatio- temporal visual representations using three-factor rules embodied in a robotic setup.
We present the Continuous Random Backpropagation (cRBP) rule -a continuous version of Event-Driven Random Backpropagation (eRBP) [1] -following the derivation of Deep Continuous Local Learning (DECOLLE) [3]. Like DECOLLE, the synaptic weights have continuous-time dynamics, unlike eRBP which only updates synaptic weights on pre-synaptic spikes. Like eRBP, the error signals for the hidden neurons are computed at the network output, unlike DECOLLE which computes errors locally at the layer. These rules learn in an online fashion, in the sense that synaptic weights are updated while the input is streamed in the network, by propagating information required to compute the gradient forward as Real-Time Recurrent Learning (RTRL) [6]. This enables the space complexity of these rules to remain constant with respect to time. In contrast, other rules such as SLAYER [7] based on Backpropagation-Through-Time (BPTT) require to store an history of the past neural activity. In this case, memory consumption increases with the length of the time sequence, an important limitation, as reported in [8]. On the other hand, DECOLLE and cRBP can learn spatio-temporal patterns on long sequences with a fine temporal resolution. This makes them perfectly suited to neuromorphic vision sensor data. Additionally, unlike conventional artificial neural networks which require to integrate frames from event streams [9]- [11], our method is suitable to low latency applications.
We show that the accuracy of cRBP is comparable to state-of-the-art methods on the IBM DvsGesture dataset [12]. A covert attention mechanism is introduced which further improves the efficiency and accuracy of the learning rules by providing translation invariance at low computational cost compared to convolutions. Inspired by receptive field remapping in the visual cortex, this attention mechanism is tailored to the sparsity of the visual event streams. Finally, we integrate cRBP in a real-world closed-loop robotic grasping setup involving a robotic head, arm and a 5-finger hand. The spiking network learns to classify different types of affordances based on visual information obtained with microsaccades, and communicates this information to the arm for grasping. This real-world task has the potential to enhance neuromorphic and neurorobotics research since many functional components such as reaching, grasping and depth perception can be easily segregated and implemented with brain models. This work paves the way towards the integration of brain-inspired computational paradigms into the field of robotics.
A barrier to embodied learning robots is the offline (batch learning) nature of conventional implementations of backpropagation. In comparison, our model can learn from events streamed from the DVS [26] with little loss in accuracy. This enables continual updates without separation of training and testing phases. However, such life-long learning setups require to address the forgetting problem resulting from learning on temporally correlated input data.

A. Continuous Random Backpropagation
The backpropagation algorithm computes the gradient of a synaptic weight with respect to an arbitrary loss function defined on the network's output. The credit of a neuron -how a change in its output affects the loss -therefore depends on the synaptic weights of the subsequent layers. This weight transport problem is solved by Direct Feedback Alignment [13], an instance of random backpropagation [14], by computing the credit for a neuron i as a linear combination of network errors e k with fixed, random coefficients g ik . This solution also enables asynchronous weight updates by decoupling the conventional forward and backward phases of backpropagation. An adaptation of Direct Feedback Alignment to spiking networks of Leaky Integrate-And-Fire neurons was derived in [1]. Named eRBP, this synaptic plasticity rule can be formulated as: with w ij the synaptic weight from j to i, Θ the derivative of the spike function, u i the membrane potential of neuron i and s j the pre-synaptic spiketrain (either 0 or 1 at a given time t). The set rdout contains the indices of the readout neurons y k . The spike function Θ is the non-differentiable heaviside function (hard threshold), but its derivative can be approximated with a surrogate gradient [15]. As with eRBP and DECOLLE, we approximate this derivative with the boxcar function: Θ (x) ≈ Boxcar(x) = 1 if −0.5 < x < 0.5, otherwise 0. The symbol ∝ refers to a proportionality relation -the multiplicative constant is the learning rate which can be chosen freely.
In the special case of a Mean Square Error (MSE) loss, the errors e k are computed as the difference between network readouts y k and network targetsŷ k : with out the set containing the indices of the output neurons.
Since the weight update in Equation (1) is proportional to s j , weight updates in eRBP are triggered by pre-synaptic spikes, in an event-driven fashion. However, this formulation does not account for the dynamics of the post-synaptic potentials. More recent three-factor rule derivations now incorporate an eligibility trace to account for this dynamics [2]- [4] (see Equation (4) in [2]). We can integrate this term directly into Equation (1), yielding the cRBP rule: where * denotes a temporal convolution and is the postsynaptic potential kernel. This new rule describes continuous synapse dynamics rather than event-driven updates -we therefore refer to it as cRBP. The main difference with Super-Spike is the loss function: SuperSpike relies on a van Rossum distance with a target spiketrain. This leads SuperSpike to require one eligibility trace per synapse, whereas cRBP requires only one eligibility trace per neuron. Additionally, the computation of this eligibility trace can be factored into the neural dynamics, as presented in DECOLLE. The main difference with DECOLLE is that DECOLLE relies on local readouts y l k and local targetsŷ l k for every layer l to compute the errors e l k . Instead, hidden layers in cRBP are updated with respect to the global network loss.
The simulations presented in this paper rely on the same neuron model as DECOLLE, introduced in Equation 4 in [3]. Note that this neuron model does not account for synaptic delays, and the refractory period is approximated with a selfinhibition.

B. Network Architecture
The network is presented in Figure 1. It learns from event streams provided by a DVS. Since spikes are not signed events, we associate two neurons for each pixel to convey ON-and OFF-events separately. This distinction is important since event polarities carry information about the direction of motion (see Figure 3). Since events are emitted only upon light change, two different setups are analyzed: a dataset where changes originate from motion in the scene, and a dataset where changes originate from fixational eye movements. The evaluation on these two types of dataset can lead to different performance [16].
Only spikes are propagated from a layer to another. However, the errors computed at the network output (e k in Equation (2)) are communicated to the layers as an analog value. A previous implementation of the work presented in this paper relied on eRBP which was implemented with Auryn [17] 1 . This implementation computed and communicated errors using only neural dynamics and spikes, as in [1]. The newer implementation of this work is based on PyTorch which offers more tools for learning such as auto-differentiation, convolution, max pooling and advanced optimization methods.

C. Covert Attention Window
It was shown in biology that receptive fields of frontal eye field neurons are constantly remapped [18]- [20]. Inspired from this insight, we introduce a simple covert attention mechanism which consists of continuously moving an attention window across the input stream when new events are received. Covert attention, as opposed to overt attention, signifies an attention shift which was not marked by eye movements. Particularly suited to the sparsity of event streams, the center of the attention window is computed online as the median address of the last n attention events, see Figure 3. By remapping receptive fields relatively to the center of the motion, this technique enables translation invariance at low computational cost compared to convolutions. Indeed, convolutions process all the regions of the image identically and require a weight sharing mechanisms complicated to implement on neuromorphic hardware. Our method also allows to reduce the dimension of the event stream without rescaling, thus decreasing the size of the neural network.
A similar method was already introduced in [21] for classifying a dataset of three human motions (bend, sit/stand, walk) recorded with a DVS. Their approach consists of remapping the address of their feature neurons (C1) with respect to their mean activation before being fed to the classifier. Instead, our method consists of remapping the address events directly, with respect to the median address of the last events. Unlike the median, the mean activation can result in an event-less attention window in case of multiple objects in motion, such as two-hand gestures. Additionally, since our attention window is smaller than the event stream, eccentric events are not processed by the network. We show in this paper how this biologically motivated technique boosts the performance, even on DvsGesture, where multiple body parts are simultaneously in motion. We note that a similar mechanism could be integrated in a robotic head as the one used in this paper to perform saccadic eye movements (see 1 https://github.com/HBPNeurorobotics/auryn  Figure 2). In this case, an additional mechanism to discard events resulting of the ego-motion would be required.

D. Microsaccadic eye-movements
For our real-world grasping experiment, address events are sensed from static scenes by performing microsaccadic eye movements. This technique was already used to convert images to event streams [22], essentially extracting edge features [16]. To this end, we mounted the DVS on the robotic head presented in [23], see Figure 2. One Dynamixel servo MX-64AT is used to tilt both DVS simultaneously, while two other Dynamixel servos MX-28AT are used to pan each DVS independently. The center of all rotations is approximately the optical center of each DVS. In this work, only the events of the right DVS are processed. The microsaccadic motion consists of an isosceles triangle in joint space, with each motion lasting 0.2 s. The motions are a negative tilt of α and negative pan of α/2, followed by a tilt of α and negative pan of α/2, finalized with a return to the initial position. We chose the angle α = 1.833 • . This angle is much smaller in biology, but DVS pixels are much larger than the photoreceptors of the retina [24]. The precise microsaccadic motion is not relevant for learning, but similar motions should be used for training and testing.
The microsaccades are triggered either manually for recording training data, or automatically in a loop at test time. We allow the events to flow through the network only when a microsaccade is triggered. No information about the properties of the microsaccade is passed as an input to the network.

III. EVALUATION
Two different network architectures are used in this work: a convolutional network and a dense network. Both architectures expect input dimensions 2x32x32 at a given time step. The first architecture is the same 3-layers convolutional network used in DECOLLE [3]. The convolutions consist of 64, 128 and 128 kernels of size 7x7 respectively, interleaved with max pooling and spike dropout operations. The max pooling operation is applied before the spike function Θ.

A. DvsGesture
We evaluate cRBP on the DvsGesture dataset, following the same training procedure as the DECOLLE rule [3], currently achieving state-of-the-art accuracy on this dataset. DvsGesture is an action recognition dataset recorded by IBM using a DVS [12], [26]. It consists of 1342 recordings of 29 subjects performing 11 diverse actions in three different illumination conditions. This dataset is loaded into PyTorch using the torchneuromorphic library 3 developed in [3]. Specifically, training samples consist of 500 ms-long event streams, and test samples 1800 ms-long. These samples are sliced at random location in the dataset, but ensuring that the motion is presented during the whole sequence. This procedure maximizes the use of the dataset, leading to 1176 train samples and 288 test samples. The sequences were presented to the network in mini-batches of 72 samples.
The event streams recorded from the DVS are 2-channels (ON and OFF events) with 128x128 pixels. We compare the accuracy of the network on the downsized streams and using the covert attention window mechanism described in Section II-C. The downsize operation reduced the event stream to 32x32 by grouping neighboring pixels, and was used in [3]. The attention window re-address the events in a 32x32 window with respect to the median event. It was implemented as an alternative to the downsize operation in the torchneuromorphic library. The number of events to calculate the position of the attention window was set to n attention = 1000.
The final accuracies for the different experiments on DvsGesture are reported in Table I  Gesture shows that cRBP efficiently learns spatio-temporal patterns to classify motions from raw event streams. With the same convolutional architecture and training procedure as DECOLLE, cRBP reaches 92.48% accuracy, close to stateof-the art accuracy, see Figure 4. When replacing the downsampling operation of the event stream with the attention mechanism, the accuracy further increases to 95.34% for cRBP and to 96.37% for DECOLLE. This improvement is more significant for the dense network architecture. In this case, the attention window mechanism leads to a substantial improvement from 77.93% to 90.80% accuracy compared to the downsampling approach. This confirms our assumption that the attention window mechanism provides translation invariance with respect to the performed gestures. The reason why the performance of the convolutional networks only slightly improves is because convolution and max pooling operations already provide translation invariance. It results from the same kernel being convolved on the whole image  to form a feature map. We therefore expect that locally connected layers (convolutional topology where kernels are not shared for the whole the image [27]) coupled with the presented attention mechanism could drastically reduce the amount of computations while retaining the accuracy of a convolutional network. Such networks are also more biologically plausible than convolutional networks since no mechanisms in the brain is known to support weight sharing.
The improvement of the attention mechanism over down-sampling is also reflected in the classification output of a test sample. Indeed, with the attention mechanism, the network unambiguously and correctly classifies the test sample early in the sequence, see Figure 5. With the downsampling approach, the confusion in the output of the network is higher, see Figure 6. We note that many neurons in the hidden layers spike with very high rates, including the maximum rate of 1000Hz imposed by the simulation time step of 1ms (neglecting dropout). Indeed, the weak refractory term in the neural dynamics decreases the membrane potential after a spike, but does not prevent subsequent spikes. A lower spiking rate can be favored by adding a regularization term in the loss function as mentioned in [3]. This is shown in the grasping experiment, see Figure 9.

B. Grasp-type Recognition
In this experiment, we embody cRBP in the real-world grasping robotic setup shown in Figure 7. In this setup, the spiking network is trained to recognize four labels corre- Fig. 7: Real-world grasp-type Recognition experiment setup integrating a Schunk LWA4P arm equipped with a Schunk SVH 5-finger hand and a DVS head. The DVS head performs microsaccadic eye movements to sense event streams from static scenes. We recorded a small four-classes dataset (ball, bottle, pen, background) of 50 samples per class. At test time, the detected grasp-type triggers the corresponding predefined reaching and grasping motion.
sponding to four different grasps: ball-grasp, bottle-grasp, pen-grasp or do nothing [28]. During training, an object of a particular class is placed on a table at a specific position. The robotic head performs microsaccadic eye movements (similar to the N-MNIST dataset [22]) to extract visual information from the static object. Only the event stream of one DVS is recorded, together with the corresponding object affordance. In this experiment, the attention window of dimension 32x32 is fixed to match the position of the objects on the table, see Figure 8 for example samples. During testing, a microsaccade is performed and the detected object affordance triggers the adequate predefined reaching and grasping motion on a Schunk LWA4P arm equipped with a Schunk SVH 5-finger hand. This demonstrator was implemented with the ROS Framework [29] and the ROS DVS driver introduced in [30].
With only 50 samples per class and 10 epochs, the network was capable of learning the four visual affordances (see the supplementary video 4 using the previous eRBP implementation). Example spiketrains and classification results at test time are shown in Figure 9. Spike rates are kept lower than in the DvsGesture experiment by using regularization in the loss function. Specifically, the loss function becomes: where L is the network loss, U l i is the membrane potential of neuron i in layer l, the · i denotes averaging over index i, [·] + is a linear rectification, λ 1 = 2.5 · 10 −2 and λ 2 = 1.5 · 10 −4 for both layers. The term with the λ 1 factor favors a minimum firing rate, and the term with the λ 2 factor keeps 4 https://neurorobotics-files.net/index.php/s/sBQzWFrBPoH9Dx7 ball bottle pen background the membrane potential below threshold on average. This regularization decreases the average spiking rate although we note that individual neurons can still spike with high rates up to 500Hz (neglecting dropout) for the second layer, see Figure 9. The network readout for the correct class is high (> .66) shortly after microsaccade onset: 43ms for the ball, 58ms for the bottle, 35ms for the pen and 33ms for the background, see Figure 9. These numbers are coherent with behavioral experiments on humans quantifying the reaction time to a visual stimuli [31], [32]. This resemblance should be further investigated on tasks identical to those used in the behavioral experiments. To this end, different neural dynamics enforcing plausible spike rates and including synaptic delays should be used.
Since the DVS does not sense colors, the network only relies on shape information, crucial for affordances. This allowed the network to moderately generalize despite the small amount of training samples. The learned weights projected to the input are displayed in Figure 9. A single object per affordance was used during training, but the network could recognize objects with different colors of the same shape. Recognition also worked when the objects were slightly moved from the reference point used for grasping. However, the network was not robust to change in background or unexpected background motions happening during the microsaccade. This is due to the background being learned as an additional class for the "do nothing" affordance.

IV. CONCLUSION
Neuromorphic engineering technology enables the design of autonomous learning robots operating at high speed for a fraction of the energy consumption of current solutions. Until Network activity for "ball" Network activity for "bottle" Network activity for "pen" Network activity for "background" recently, the advantages of this technology were limited due to the lack of synaptic plasticity rules for training multilayer spiking networks. This bottleneck has been addressed since the derivation of three-factor rules approximating backpropagation. In this paper, we demonstrated the ability of cRBP to learn spatio-temporal representations from event streams provided by a DVS. With the addition of a simple biologically-inspired covert attention mechanism, we have shown that cRBP and DECOLLE further improved their accuracy on the DvsGesture benchmark in comparison to classical rescaling approaches. This attention mechanism provides translation invariance at a low computational cost compared to convolutions. Lastly, we integrated cRBP in a real-world robotic grasping experiment, where affordances are detected from microsaccadic eye movements and conveyed to a robotic arm and hand setup for execution. Real robot learning experiments are challenging because of the difficulty and time required to collect relevant training data.
Our results show that correct affordances are detected within about 40ms after microsaccade onset, which is coherent with biological findings in humans. For future work, these results should be further investigated by replicating the behavioral experiments presented in [31], [32]. Additionally, other components of the grasp-type recognition experiment could be implemented with spiking networks, such as reaching motions [33], [34], grasping motions [35] and depth perception [23]. It was already shown in [8] that spiking networks can learn regression tasks from event streams. This would enable a wider variety of computational brain models to be compared against behavioral experimental results in real-world scenarios. This work paves the way towards the integration of brain-inspired computational paradigms into the field of robotics.