Classical Machine Learning architecture

Machine learning made dramatic changes in everything around us. But classic machine learning has bottlenecks:
– it needs huge computational resources to achieve good enough result;
– it can’t cover real deep learning tasks;

Brain inspired computation

Image from Mike Davies's presentation of new Intel chip Loihi
Image from Mike Davies’s presentation of new Intel chip Loihi at NICE 2019

New approach is more powerful and more power efficient comparing with classical machine learning approach. It based on four points:

  • Fine-grained parallelism with massive fanout – that integration of memory and computing resolve bus bottleneck between a computation part and a memory. Now memory is distributed with the computation unit.
  • Event-driven computation with time. In new approach time becomes one of fundamental point at the system. time now an independent variable that leads us to dynamical system. Computation happening through evolution of the system over the time.
  • Low precision and stochastic;
  • Adaptive, self – modifying. One of the biggest advantages of new conception, that leads to huge growth at parallelism in the system.

NEST simulator is first step in you Neuromorphic computation.

first I’ll just give a little bit of motivation here this is not yet talking specifically about the chip yet but just sort of some of the philosophy may be behind you know why we architected Loihi in the way we did why it looks like it does you know so we started this program at Intel a little over four years ago and for the most part nobody in the program had any prior experience with neuromorphic computing neuroscience anything like that so you know we my team specifically had been doing Ethernet switches of all tanks so pretty about as far as you can get to you know neural chips you may think so you know we at Intel naturally you may think we approach this from very much an engineering perspective and engineers like specs and you know gathering customer data you know informing the the definition of the architecture so you know we could look at nature so you see the fantastic you know 20 watt brain out there and and and perhaps take a biomimicry approach right so read a bunch of neuroscience papers construct a big laundry list of features and you know there’s our engineering architectural specs that we can then go and then implement and you know unfortunately it’s not that simple right because you know we’re in a really different design regime compared to nature we have just different tools semiconductors CMOS transistors as opposed to you know ion channels and membranes and lipids and all this that that nature is using and you know for the most part today it you know we’re at a disadvantage say if we want to reach the scale of the human brain so we’re if you take the neocortex and make a planar projection of that and look at the density that you know you achieve in in neurons and synapses compared to say the state-of-the-art neuromorphic implementations that have kind of as the equivalent as possible feature set you can compare those you know we have planar technology the neocortex can be looked at as a planar system and we’re maybe about 20 times off in terms of neuron density and what we can do 400 times off in in synaptic density and then in synaptic op energy you know sort of the energy per primitive operation if you want to count it into as synapses you know or even further off or maybe thousands you know times off so we’re at a pretty significant disadvantage I think from from what we’ve seen so far the area density disadvantages are more the troubling ones that that’s the that’s the critical constraint there because that limits what we’re able to achieve right what what the types of problems we can run in as you saw on my Wednesday presentation as as you scale up you know that’s when we start to see really really compelling gains for this type of a fine-grained parallel architecture so we need to get to scale and and we’re at a significant disadvantage there the the energy differential is even bigger but on the other hand we are so far off compared to nature you know in terms of if you look at everything a brain is doing compared to what we can even conceive of approaching with with chips today that I I don’t actually regard that 2000 X as the you know the the more critical issue it’s more the density that we have to offset so luckily we have some advantages so you can see you know I’ve listed the maximum firing rate in nature as you all know neurons are firing at really slow speeds compared to what we’re used to in computing millisecond time skills you know a period of about 10 milliseconds is about as fast as a neuron will spike compared to gigahertz frequencies right a nanosecond you know periodicity and our activity is is really not hard at all yeah with modern CMOS circuits so that’s a that’s a really good advantage we have and in particular we can trade off then multiplexing for offsetting this area of disadvantage so if you multiplex use like best known methods of design today pipelining and multiplexing you can reuse the same circuit many many times and effectively shrink down the area by that multiplexing factor so so immediately we recognized that there’s some differences that we can take in order to you know achieve the same functionality but adjusting for the different design regime that we’re in we can also design reliable circuits so nature can achieve reliable operation I mean you know mostly you know we roll out of bed every morning and we don’t fall over I mean it’s you know animals can be quite reliable but they achieve that through a certain amount of redundancy right so the more reliable and operation you need to spend more neural resources to achieve that so that’s another advantage we potentially have we don’t want to necessarily give that up and in particular that leads us to want to stick among other reasons to a digital design paradigm in our program at least for the time being because that means you know we can get fully deterministic reliable operation and that’s a that’s a key advantage we don’t want to give up too quickly there’s a bunch of other differences so these are maybe not all as appreciated on everyone’s mind but you know for example a really important one is being able to reprogram the system right so we don’t want to just wait 20 years for our chip to get fully programmed until we can then use it we want to be able to on a really fast timescale reprogram it to solve one problem after the next step to the next that’s actually a hard Peecher right nature hasn’t figured that one out right so so in some sense we’re overachieving compared to nature and so for that reason all told the systems that we build actually although we would regard them as neuromorphic meaning that we’re absolutely directly inspired by the form of you know what we see this nature solutions to the computational problems were interested in we’re trying to emulate that we’re approaching it and we find a different specific solution to that the the key point is that the objectives are the same for nature and for our artificial systems you know we want very good energy efficiency we want very fast response times you know that’s critical for survival and that’s critical for solving problems quickly and we want cheap manufacturing so broadly speaking the objectives the drivers of evolution over six hundred million years of brain evolution should be completely applicable to the systems that we want to and the answers that nature has discovered over that period of time should be applicable in some way to the systems we built so that means that we have to understand the principles that arise you know has arisen over the period of you know millions of years and understand them and adapt them into the system that we’re building so that’s our philosophy that’s the perspective that we brought into the Loihi you know neuromorphic research program and that’s good to keep in mind now this question of spikes comes up and Menino our spikes efficient or not I there’s a ton of confusion around this and just you know maybe not necessarily thinking from the right perspective and so here’s a slide I’ll walk through that is really just providing a completely different intuitive perspective I think on on spikes and the neuromorphic architectures that arise to support spike based communication and and largely you know you’ll see this is exploiting sparsity so you know that’s one thing we see in brains is that you know there there’s sparse activity in space and in time and that’s critical for the for the efficiency and it you know we we can argue intuitively all day long I think we’re accumulating the evidence to show that actually you know an artificial neuromorphic system is now providing that in a really rigorously benchmarked way but this let me just you know convey some of an intuitive view on maybe changes your perspective on what this architecture is all about so in nature you know this is an axon and a spike and and this is a digital signal it’s sparse in time right in the sense that you know this this spike is just moving along at a relatively slow speed along this axon you know it’s normally off it’s preferring the zero state and then there’s just a ripple of one that you know no other information other than the fact that it spikes you know along that wire now when we go and implement that it doesn’t look like that right we have time multiplexing we have these router mesh 2d mesh routing infrastructure that’s sending all these time multiplexed spike messages around you know we’re forced to represent these spikes as kind of a you know packetized message of some kind right so and then that that that message is moving through this router interconnect it’s taking you know 16 wires in the case in in Loihi 32-bit message and and it’s going through these you know relatively large you know router circuits you know compared to you know the biological model which is just this thin wire here right so some may look at this and and instinctively say that well you know this is this is not how can this be efficient you know you’re so far off of what nature is doing but on the other hand recognize that in today’s process technology you know the average diameter of a myelinated axon is one micron and we can fit sixteen wires no problem in one micron so you know your sense of scale is perhaps you know not what you think and then furthermore because of the time multiplexing you know yes this router structure is relatively large it’s actually not that big actually but we’re we’re multiplexing this at a factor of like a thousand or more I mean compared to biology so it’s shrinking the effective area of that router down to practically nothing so half of the brain area is white matter right which is the routing interconnect and actually we’re doing you know roughly similar to that probably even a little less in Loihi in terms of the not even the routing infrastructure more the tables that are associated with mapping the connectivity that’s really more the area driver in the chip but this routing infrastructure area wises is not at all you know dissimilar to to a you know network of axon wiring and so then when you step up and back and you look at the whole mesh of the neuromorphic system you know what’s important is that you know the the macroscopic activity that you get what’s important is the groups of neurons that are active and and when they’re active right so you just get this kind of ripple of communicating sets of neurons and it’s just sparse and it’s just a kind of a trickle of activity based on when there’s important computation to be done you know now you compare that to the von Neumann processor architecture which is basically a variant of that is also GPUs and you know this is all following pretty much the same general structure here from a system architecture perspective you have these memory hierarchy and you have these computing elements that logic highly multiplexed highly flexible circuits but you have just these kind of cyclic operations you know it’s just this kind of continual stream of reading a stream of instructions decoding it reading the data changing it writing it back to register files and this caches and your second cache then your third cache and then your DRAM and then your in non-volatile memory you know but there’s all these like cycles of memory that are axes and is it’s just very different from what you get in an architecture of neuromorphic architecture brain inspired architecture so that’s not to say one or the other is you know more efficient or less it’s just it’s very different right and so we should expect there to be different types of problems that one solves better than the other and it’s not you know one being necessarily more efficient or than another and and just from to look at this again from a completely different perspective you know what is important in a neuromorphic architecture you know if you really look at the computation that’s happening more than anything else it’s these little white boxes which is not what we tend to focus on but the white boxes are table lookups you know once you’re in this multiplex digital environment in fact all the analog chips have these as well these white boxes and that’s dominating the information transfer in the system this is this routing right this is a terming okay I have this event now who do I communicate it to and then and that’s happening again and again and again so you get this transmission and routing distribution of these events and the synaptic operation accumulating weights and you know decaying membrane potential and that sort of thing it’s relatively a small piece of what’s happening here and yet you know we tend to focus so a lot of it of course mathematically for the computational model they’ll matter a lot but if you just look at what the chip is doing it’s much more just determining where and when to send send these events and again compare that to a conventional model well you know you have just this one fixed function you know I mean generally isn’t simplifying a little bit right but so much of this activity is about how to very intelligently and cleverly reuse a flexible single computing element you know an ALU or a multiplier and just doing instruction decoding and dispatching sequencing resolving data dependencies but it’s in this kind of single kind of table iteration through this memory with a computing element you get this you know memory bottleneck and all that stuff that we all know that so so this is just to show that it’s these are just a very different architecture overall when you look at this and you know we’re exploring this realm to find out you know what is it good for you know it’s clearly going to be good for things that are different from conventional computing and our hope is that it’s actually a nice broad space of really interesting computation similar to what you know the brain it’s adoption of that kind of a now architecture can do ok so now on to the chip that’s done with philosophy we’re gonna talk about bits and wires and all that good stuff so this is what the chip looks like and under the covers you know pretty boring really because it’s it’s just mainly just this neuromorphic mesh so we have this single core each of the cores implements up to a thousand 24 neurons and that’s again in this time multiplexed way so we have a collection of memories inside it which collectively represent all the state and the configuration for those those neurons we group the the cores into we gang them together you can see into kind of this unit of what we call a tile so that’s four cores with kind of rotational symmetries so that we can create like a nice symmetric unit and and then we just extend this tile out it’s all done in an asynchronous design methodology which means that the tile is is a completely closed contained unit we don’t have any clock to be distributed across this at all so we can create as big or a small a chip as we like just literally by changing the parameters an array function in our layout tool just to create a big chip more small chip nothing to be verified further beyond that for that whole neuromorphic mesh you know it’s just local asynchronous communication across this 2d mesh dimension order routing spike messages going from you know core to core so beyond that we have you know you can see these off chip interfaces so for in each planar direction we have these parallel i/o off chip interfaces these these are asynchronous you know you’ve probably heard about this a ER for a dress event representation it’s kind of a similar protocol there but it’s a packetized message based now what happens is that the spike messages are all you know sent locally inside the chip if they need to go off chip there there’s a multi-chip router unit which encapsulates the spikes with an off chip header which then allows us to address across up to 16,000 chips you know just again by tiling the chips together without any additional circuitry but that’s basically and and the the Iowa the Occupy interfaces are also conveying management you know just general read and write operations just so that you can go ahead and you know all of the architectural state and all of this mesh and all these course is visible to this protocol so that means you can have your host CPU that we talked about in the arm that the FPGA off chip and it can just issue reads and writes like it would for any you know io map device memory mapped i/o device and and go ahead and configure the chip monitor what’s going on and just you know perform the computation feeted data in and out now you know off chip communication is expensive so we don’t always want to be driving this from some off chip CPU we found you know there’s there’s a critical need for von Neumann CPUs and you know they’re not going away anytime soon so in fact we’ve integrated three of these you know deeply you know right next to the mesh here and that’s so that we can have very tight you know communication loops with the neuromorphic cores at minimum this is useful for data encoding into spikes right so sending spikes off chips actually not necessarily the most efficient thing to do if you’ve you know if you’re expanding some conventional data stream into spikes you you know that may be a relatively high bandwidth operation and it could be much better to do that you know inside the chip where you have in a much higher bandwidth available in this internal mesh so so that’s that’s the default function that these integrated x86 processors that we call Lake Mont processors you can see they’re LMT you may hear us refer to them as as late months later so they’re they’re very simple processors I mean they literally have like 64 kilobytes of data RAM in each one so these are really just embedded simple processor you’re not going to boot Linux or anything on these are the Linux runs elsewhere there’s a bare metal programming for for you know tight interaction yeah question within the core they they interact just like they do between the cores so we’ll get into that so we’re gonna dive deeper right so we’ll talk about that so yeah I think that’s pretty much all to mention here okay so looking in the core this is sort of architectural II this is what a neuron core looks like all these boxes are basically individual SRAM’s so inside the core you know the take-home here is that well it’s not it’s definitely not a one-man architecture inside the core right it’s not that we have a memory and then some logic block which is just accessing all the state and all that you know what we have is a optimized pipeline that performs all the neural computation in a high performance you know pipeline manner and we have a couple different sort of independent asynchronous loops or processes that interlock and communicate as necessary but are otherwise quite decoupled so the the Green is kind of illustrating the pathway of spike handling as the spikes arrive into the core this kind of purple is showing the updating of neuron state which is responding to the spike input but you know in a decoupled way as as those neuron models are generating spikes well then it invokes this output kind of blue process which is then directing the spikes and addressing them out to the NOC and then we have this kind of loop of red which is the plasticity you know where we’re periodically up kind of monitoring and inspecting all the activity that’s accumulated in the in the course and in the neurons and synapses and then applying changes to the to the weights and the synaptic parameters so that’s kind of just generally the flow we’re actually in a pretty won’t go too much deeper in the design but but instead I’ll walk through some of the features so you know hopefully up some background awareness yeah question yeah they’re just they’re just letters yeah there’s letters yeah letters so so we implement a discrete-time leaky integrating fire neuron model specifically kuba if you’re familiar with that so we have a current response to every spike so that’s a typically an exponentially decaying you know postsynaptic current and then that gets integrated in a leaky manner to a voltage value so that’s just the basic kind of I mean it’s possible to go simpler in terms of a leaky integrating part you can get rid of that current response and just treat each spike as an impulse but we found plenty of algorithmic need for this somewhat more complicated leaky inter-korean fire neuron model kuba we didn’t go as far as Coppa so Koba is a conductance base model which introduces some extra nonlinearities we don’t have like is a kovitch neurons we don’t have you know extra synaptic variables so we found that this is just the sweet spot you know we try to make it as simple as possible but you know no simpler and and this is what we came up with now you know maybe case could be made that you know maybe we do need a bit more complexity but but any case this this seems to service a broad enough need and we found interesting ways to you know to adapt this to I’ll talk about it in a minute for example we support multi compartment dendritic trees so you know this is something we we did not necessarily just to try to support you know pyramidal cells and apical distal dendrites and all this but mainly I’ll share a little bit later about how this gets implemented but mainly we did this because it actually ended up being really really simple so surprisingly easy to integrate into the architecture and you know it’s interesting to think that well well maybe there’s some parallels here about why you know real neurons have dendrites and do graded continuous valued computation through the dendrites – and then spike at the soma you know it’s it’s a that’s the domain the extent over which you can do efficient analog computing effectively that’s what perhaps we found here so there’s a certain domain within the core that we can very efficiently do kind of continuous graded computation aggregate this up in a tree structure and so therefore you know we did it the one real practical reason we need this that we found and we’ve found others since then but during it architecting the chip is is actually just the simplest possible join operation as you aggregate this tree of addition right so if we want to add different compartments to have different exponential time responses to each synapse which you commonly have you may have an inhibitory response that is a longer time constant of current integration than an excitatory spike rather than have to make a more complicated neuron model to explicitly have this on every neuron we just treat that as two different compartments which you can just add the currents together so kind of a more general and simpler architecture can allow us to address these you know less common but still important use cases where you want to have this more diverse complex neuron tact so that’s not normally what a neuroscience would scientists would think about as far as compartments but once we went that far it was very easy to add a couple other you know possibilities for how you join and build this dendritic tree and then maybe you’ll find some goodies for it as some teams have there’s a question the handshakes are everywhere so we don’t show that handshakes and our asynchronous design paradigm here handshakes are just you know they’re on every little pipeline stage every SRAM handshakes are just automatic we think of in our design methodology we think of tokens of data right and the and those tokens are just atomic units and they’re being passed around and you know when two processes like you know when the input spike process here has to coordinate you know with this neuron update process well then they’re gonna handshake and it’ll stop and synchronize right but otherwise they’re decoupled and they’re operating you know as best as they can oh yeah yeah local completely local yeah it’s just you know stage to stage handshaking yeah yes yes yeah yeah it’s a digital signal processing system as well you know one way to think about it right so yeah you have to service state updated dynamically you know according to this discrete time neuron model okay so then we also have homeostasis so a simple mechanism to adapt in the the threshold over time in response to the activity of the neuron so that’s a particular feature inspired from the neuroscience of course we’ve found you’ve heard about maybe the Ellison end model I think I briefly mentioned it on Wednesday this simple adaptive neuron model that Wolfgang masses team at Graz has found can can implement LS TM like functionality that requires a sort of adaptive threshold mechanism unfortunately it’s not exactly the one we implemented but nevertheless we found a way to work it into the architecture so so modifying the the threshold over time is is something that definitely algorithm researchers are finding a use for and and we support that we support random noise sources so you know as you’ve seen in variety of contexts you know noise and these neural systems are you know not always a bug at all you know they’re there as much a feature as as a limitation and so you know we insert noise pseudo-random noise again we preserve determinism in the operation that’s really good for software development algorithm development and random noise sources are really cheap right so they’re there it’s not a difficult feature at all so so we have that you can add that on the membrane potential you can add that occurrence you can even add that in the refractory period so so that’s something else we have is a axon delay so you can add delay to the when you generate a spike and then also the neuron will go into a refractory period you can add a random number on that refractory period too which is a efficient place to introduce some stochastic city into the system and we have our output routing tables here so this is as the neuron spike you have to determine okay where does the spike go and so that’s a series of table lookups and you see here there’s a general principle that we apply throughout the design which is we don’t as much as possible we try to share the resources that are available as opposed to creating very rigid constraints per neuron so for example we don’t have a crossbar architecture at all and on any of our connectivity I think every prior neuromorphic system uses cross bar connectivity architecture which is hugely limiting you know you may have to under the covers you know deep under the hood it may be a crossbar an SRAM is basically a crossbar but you know you need to use it we found to support a broad possible diversity of network topologies you have to get away from this rigid crossbar constraint so we don’t at all think in terms of crossbars we think in terms of table routing tables which can be compressed and encoded in very clever ways to really get the maximum efficiency of this precious memory resource in the chip and so that’s what you’re seeing in that and the output stage you know we have kind of a pointer to list type of architecture so any neuron can communicate to in theory you know as many other cores in the system as it needs to subject to just the constraint that all lists of communications should add up and fit within the fixed resource you know memory size that we have available and that also is applying on the input side so as spikes are coming in we have a pooled synaptic memory so similarly it’s not just a rigid crossbar that that says okay you get you know some number of output connections and that’s it right you you can have more or less depending on the network need and it just all has to add up to this hundred and twenty eight kilobyte number you know in a given core now you know I mentioned we have all these clever kind of compression schemes and we have basically four different compression types at least for in in the chip for you know depending on the properties of that network right if it’s a completely dense network if you happen to have a crossbar Tecna encoding or the structure well then we have a completely dense encoding where you don’t have you know you have the minimal number of pointers interspersed in this memory you’re just having a list of weights as you would have in a crossbar structure on the other hand if you have a highly sparse network you don’t want to have to list tons of zeros right which is what the crossbar architecture gives you instead you have a pointer and value sort of encoding so that you can you know efficiently represent the specific connections that you have in your network and we have things in the middle and then we also have what’s quite novel we have a hierarchical connectivity model here so that you can look at if you have redundancy a repeated weight pattern in your network you can separately store that once and then just refer to that for populations of neurons so the simplest form of this is the well-known convolutional network idea where you have a kernel of weights that gets applied to patches across an image that that’s the simplest possible way to use such a feature like this you know but there’s much more complicated ways you can you can extract that redundancy and compress it in Loihi question like randomly generate the connectivity no no that we’ve thought about that that yeah that could be useful in some contexts yeah I mean the thing is you you would still unless you randomly generate the weights you’re still gonna have to wait state for that so there’s a limit you know and as certainly with learning then you need to have you know very specific weights so there’s limited use for that but it could be useful yeah okay so we also most people think of synapses is being weights right almost anonymously but really it’s more complicated than that and in particular we support synaptic delays which is a variable associated with the synapse you know if you recognize we’re in a temporal computing architecture here where delay is computationally significant the the you you want to be able to specifically add a delay per connection to get the best value of that that capability so we have some algorithms that definitely use it you heard something about poly Cronus and networks which you know need that feature you know there’s a number of cases where this this potentially comes up you know if you think about coincidence detection you need to kind of align all of the spikes to in particular you know alignment that it’s gonna recognize so you need synaptic delays for that not axon delays right so previous designs have had axon delays but really this we’d make a distinction here and say there’s something called a synaptic delay and and that’s important we also have synaptic eligibility traces so this is an idea that’s been around for a long time I think since the 60s in neuroscience this idea that you have some kind of state that arises in that operation of the network and it’s a it’s a fading memory of some kind of a provisional change it just fades away over time based on a later feedback reward or punishment you then apply that eligibility state to implement the plasticity the changes in the ways so we have that that’s a that’s a variable associated with each synapse so generally speaking we have we treat our synapses as a 3-tuple or we have a wait and then a tag which is you know what you may use for a synaptic eligibility trace but really the tag is just a scratch variable and and in fact you can decouple the delay not everybody wants synaptic delays so so really what you have is a delay and a tag as two variables dynamic variables that are available to learning processes so that you can create a complex dynamical system you know in each synapse and all of those are variable precision so some networks don’t need you know a lot of precision per per weight right you can get by with just a single bit to say whether it’s connected or not in other cases if you’re doing learning you know gradient based learning you typically need some precision in those weights so so to again to get the maximum flexibility the maximum use of our fixed resource we allow those to be variable precision so if you only have one bit weights you can fill up that memory with just single bit connections and get you know one hundred and thirty thousand synapses if you need sign nine bit weights well then you’re gonna get you know nine times less than that wheels we also have the question it’s not multiplication it’s just addition right so talk about the operation in a minute yeah but it’s yeah there’s not because the input spike carries no information right all you’re doing is reading the weight and then accumulating it to the amount of synaptic activation that’s that’s arrived so then we have these we do have these graded having just said that we also have these things called graded reward spikes that do carry a graded you know value with it and that’s specifically to support a reinforcement learning types of algorithms where you want to convey some sense of how good or bad you know the state that the whole system is reached and so you can communicate that with like an 8-bit graded number embedded in the spike message and then that’s available to the learning processes it’s kind of a following a dopamine type of a model of neuromodulation then we have you know a bunch of learning features so we’ll talk about these and I did preview these on Wednesday right as far as what our learning architecture looks like so we have these micro code programmed learning rules microcode being kind of a you know big word for just you know building equations basically specifying a mathematical equation of what learning rules are being applied at each synapse this is dealing with you know these things we call traces so filtered spike trains you know with different time constants different impulse values and generally it’s it’s adding this all up in kind of a sum up products form and you can apply these rules to you know any of these variables you know wait delay tag you know the they’re all the same as far as the learning engine is concerned so you know you can just couple them to find them however you like okay so that’s just kind of a high-level walkthrough of the feature set of the chip you know we’ll for those more design oriented and wondering what the words were not just the letters and maybe you can kind of read them now but but I’m not going to really go into any more detail on this this is just to just give you a glimpse of maybe what more specifically this looks like in the design you know so so really from an architecture perspective this isn’t relevant at all but just you know for your curiosity you know this is you know we have to arbitrate a number of different operations here you know depending on whether you’ve got spikes arriving you’re servicing neurons you have management operations learning operations all of this for deterministic you know simple operation we have a single kind of point of arbitration as we call it keeps it simple we do have some internal parallelism in the design which you know is maybe not something you would think about you know you think mainly about the parallelism in the mesh itself but to balance the bandwidths that that a neuromorphic design needs to support it’s not simply a single unified pathway you know it’s not the most efficient solution so we have to think a little bit about where do we need additional parallelism in the core to kind of optimize the performance of the overall system so we have that in in the synaptic handling pathway as well as in the learning pathway the the microcode here is stored in a tiny little memory and there’s a kind of an elaborate scheme that we’re not going to go really into but to associate what learning rules apply to every synapse because obviously if you think about it you know you’re not gonna specify an equation per synapse that’s way too much state right you’d have more state associated with your programming of your learning rules than the weights themselves right so instead we have to have a indirection mechanism where you can associate the learning rules either on the input axon or the postsynaptic neuron or the class of synapse that the the synapse belongs to any of those sources can kind of ultimately derive a pointer you can think or a profile as we call it and then that looks up into this memory the particular micro code kind of equations to then apply to those synapse synaptic variables yeah I think I already mentioned that you know all our management operations are sent in bands so we’re reusing the same you know multiplexed network on chip fabric for efficiency we actually have two physical fabrics actually so that’s for dead lock avoidance reasons if you know anything about you know kind of system on chip design you need virtual channels we just have two physical channels as minimally what we need to avoid deadlock that’s for spike communication there’s actually no ordering constraints whatsoever so although we need these two physical fabrics we can balance our spike traffic across the two of them so it ends up being you know very efficient way to way to do things generally you know again in the spirit of overloading our SRAM’s you know we have a lot of different configuration modes so we don’t you know if you see a register speck of the chip it’s gonna be really confusing because depending on various configuration settings other restaurants get interpreted in other ways so we generally don’t give you register visibility to the chip because it’s just it’s too too confusing and complicated to think about we expose you know abstract that through the software layer and and then deep under the hood you know although these are architectural estrogens I’m showing here you know really you know you can see there’s all these read modify loops right this is a pretty fundamental characteristic of these spiky neural network systems is that we have tight dynamic modifications happening all these parameters and all the state at all this is very different from conventional microarchitecture of just you know traditional CPUs or other Asics and it’s a bit of a challenge to manage all that so what we have is a kind of a pseudo multi ported memory kind of pattern design pattern you could call it that that we’re reusing all the way through here so there’s some kind of pseudo random bank striping that’s going on to achieve good good efficiency so it’s a little little complicated you know plus there’s a bunch of serial processes you know depending on you know if you think about exponential decay filtering you know these are these are not trivial to come compute with digital logic and you know there’s various tricks that can make this efficient but some of it involves serialization so generally speaking you there’s a huge amount of variability and how much time different functions may take you know do you do multi porting and the memories due to exponential decays and other processes that may be serialized from time to time so for all that reason for maybe not the reason you expect but it’s a really good match for asynchronous design having this embedded handshaking happening in the design because it can completely transparently accommodate this big diversity of frequencies and and and performance cycles that you have in the design certainly that’s that’s as much as I’ll say about the design because that’s really not not the point of this tutorial but some of you may find that interesting this is back to functionality so to make it crystal clear you know what these cores are doing when they’re servicing spikes and how they’re generating spikes never mind about learning for now so this is operation of the core you know we’ve said it’s multiplexed so this is like unrolled you know in time you know vertically across the the screen here to illustrate as if it was implemented just as a discrete circuit it makes it a little clearer that way so so now you have these spikes that arrive you know they arrive effectively on one of these virtual channels which is identified by an axon ID multiplex but you know unrolled it looks like a physical wire here and so that ID goes through a complicated routing function which we can just encapsulate is just this black as white box and and then that you know just expands into a variety of wait delay pairs so this is what’s happening with those weights back to the question earlier they’re getting scheduled to be serviced at some future point in time based on that delay value the current time is T and so this is the kind of a set of you know the buckets of accumulated synaptic activity that has arrived in the past that and and so this this output process is servicing the accumulated synaptic weight you know some illustrated here for the current time T all the other spikes that are coming in are adding weight to future time buckets T plus 1 up to t2 whatever this is circular FIFO here circularbuffer and these values are stable on time T they’re not modifying so that gives us you know determinism and so then we have this dendrite dendrite process here which is you know kind of generalizing from a neuron you know this is our multi compartment dendrite structure here but you can think of these as just basically neurons and so this structure is is walking through all the active neurons and servicing you know the the updating their state their dynamic state and so that’s reading the accumulated weights reading configuration and then reading the current state values applying some neuron evolution model you know the kuba model and then updating the state and if it’s sufficiently activated then it generates the spike and it you know sends that out to the NOC so simple as that that’s that’s basically inference in the chip so yep well not not explicitly so we have not explained the question is whether fault tolerance is incorporated into this so you know we do it in a very conventional way in some sense so our SRAM’s our ECC protection right like that’s not very neural you know over redundant right but but again just to have chips come back from fab and operate as we want them to ECC is what you need now if you know for the the pursuing this kind of this potential feature of resiliency to you know other faults and problems that may arise in manufacturing or soft errors and that sort of stuff that that’s more an unexplored regime right you in theory yes you can provide some resiliency to all that class of defects and and and errors but but that would be more in you know you could explore that from an algorithmic programming perspective and we haven’t really prioritized that direction much but but clearly the architecture offers that long term advantage so I mean in the sense that we can not use any core every chip that we’ve manufactured just about is useful in some way because we can have SRM defects of course and at worst we can just say okay we’re not going to use that core and then we pushing the software to not map you know neurons into those cores so there’s variety of ways that the nice homogeneous nature of the architecture can provide some nice properties yes well that’s embedded in the configuration you know of that so and I’ll I’ll briefly talk about a little bit of the dendrite configuration but let me blow through this since yeah we’re we should probably pick up the pace so I covered this on Wednesday but you know hopefully this was all clear to everybody you know what the basic constraint here this notion of local learning and then how that maps into the entire architecture or the fact that we have learning embedded you know in every core we don’t have a separate sideband processor that’s like having to become the sequential bottleneck and updating you know all the other neural parameters this is embedded it only has access to the local state here you know in addition to you know these these rewards bikes that I mentioned which are conveying kind of more global but you know not not completely global the the reward spikes have a certain distribution we have four of these axons per core and so you you have you know access to four different modular torii channels you can think of it you know in these learning rules and then here’s this slide I showed earlier about the this kind of trace base programmable model where you know again hopefully it was it was clear when I went through it on Wednesday but you know just to briefly recap you know you have these spikes arriving on the input you know you have spikes hitting generated at the output neuron that’s the red and the blue traces you can configure the filtering to either be in a very you know short time constant correlation regime which then is what we think of as stdp or you can be in this you know very long time averaged regime we’re now you’re you can think of these values these traces is correspond to long term you know rates of firing and then you build out your equations you know in this manner and I think you’ll see a bit more detail later on that so you’ll see some concrete examples of how you program this into the chip but but you know you basically these these or all of those variables that we talked about you know the the wait that delay the tag the input traces the read traces the output blue traces the reward traces you know all of these are accessible in these Asians you can build out and then you otherwise have constants that you can add or scale yeah yeah when I was talking about micro code earlier that’s exactly what it’s just building out these equations yeah okay so that’s that’s that here’s some actual examples so you know this if you’re coming from a computational neuroscience perspective these may just make complete sense and is that you know this is duh if you’re not then these may be a little mysterious but but basically expressed an equation form you know if you know about pairwise s GDP B and puh you know that’s how you express it so you know here you know for example we have the depressive term which is you know our convention is that the subscripts indicate the kind of the time scale of filtering so a bigger subscript means a longer time scale of filtering and a zero subscript means there’s no filtering so it’s just the direct impulse it’ll be 1 when the Spiker is 0 otherwise so so in that case you know if you think about this sgtp rule you know x0 means okay when the spike arrives x0 is 1 and it’s sampling the postsynaptic trace y1 which is a filtered version so if the y1 trace is positive non zero that means the spike has arrived shortly after the postsynaptic neuron spiked right and so that’s the you know depressive term the anti causal you know that that that spike could not have contributed to the neuron firing therefore it’s useless let’s get rid of it and so it it decreases the weight that’s minus term and then on the other hand you have the symmetric case of the the postsynaptic neuron spiking sampling the input trace and that would be a potentiating term because that means that the spike arrives shortly before the neuron spiked so that’s that simple stdp but then you can quickly go to more complicated rules so you may have heard of triplet stdp that actually adds a longer term averaging in a rate of this postsynaptic neuron why to otherwise it’s it’s actually similar to the stdp it’s exactly the same as stdp and then in this case it’s also got a header of synaptic decay to keep weights from kind of runaway and saturating which tends to be a characteristic of stdp based networks so here now we have a penalized term of yet a longer term rate measure so the more active a particular neuron is it’s going to tend to decay all the weights associated with that and kind of a slow exponential decay and that kind of keeps the weights from from saturating so this is just an example of a more complex learning world that you know people study and model in you know kind of computational neuroscience and find it to be useful and corresponding actually to actual you know neuroscience processes in neurons but you know you can also do other things you can apply to delays so here’s here’s a delay based learning rule which will tend to kind of try to enforce coincidence detection in some cases and you can have these multivariate you know more complicated synaptic systems here where you have the tag rule so here’s the distal reward you know reinforcement learning case so the T is the tag and this is basically capturing an eligibility trace so this is the S TDP paralyze term so it’s provisionally capturing that s TDP change that it might apply to the weight it decays that away over time so there’s just an exponential decay term on that tag and you know that variable is just not having any impact on inference you know and on general operation until you get this reward input which is the r-1 trace if that goes positive now it’s gonna go and take whatever is you know state is left in that eligibility trace and apply it to the weight so that that’s one example of how you could use this to do a reinforcement learning type of role this comes from is a kovitch I think 2006 or so as a paper where you kind of quantified this whole framework then then we have yeah this is just to kind of illustrate maybe the an extreme of complexity where you know this says not actually we haven’t actually used this for any algorithmic purpose yet this is more speculative based on some of you know these kind of ideas people are exploring but this is using the tag as more of an inertial term to the weight and it may be helpful for say catastrophic forgetting to have kind of a longer term slower-moving weight value but in the short term you can kind of buff it around the weight based on you know short term statistics so just speculative just to show the range okay then a little bit about this higher who connectivity feature I mentioned just so that it hopefully becomes a little clearer if it’s not so this is generalizing this idea of convolution from convolutional neural networks in deep learning you know and you can think of this in an image sense you know it’s that you have a patch and then the patch is like walking through the image you know it’s some overlapped way you’re applying the same set of weights on every single patch so you definitely don’t want to store all those redundant weights in a flat way of course nature does right nature or v1 it’s all these like repetitive receptive fields because well you know that that’s what nature has to do right we again are in a different design regime we have the option of using multiplexing and designing in a little bit more cleverness here and so that’s where this feature comes in where we store the weights associated with that patch just once and and here is just kind of an illustration of in a one-dimensional sense that’s kind of what it looks like you know you have features associated with each patched process processing so how active is that particular you know kernel and and then you have connectivity arcs you know from from different positions of this patch the different orientations of the segments of the components of the the patch and so that gives you just like a simple connectivity you maybe have one inputs from one side input from the other side to this patch and and this would be just simple convolution now what we can do for save that LCA example that I showed we have more than just linear feed-forward weights right we have lateral inhibition and so that complicates this you know away from the simple convolutional structure where now we have these lateral connections that from this these these feature neurons we have connections to other feature neurons you know and so we know one dimensional sense you’ll have ones to your left and then once to your right and but but in general you know we haven’t done anything specific to convolution in Loihi what we really have is just a template-based connectivity mechanism so you know if you think about in this very generic way you just have a set of population types of neurons and you’re defining conductivity between these population types so this is just you know using all the compression mechanisms that are available to you you define in some way how these neurons in this population connect to these neurons in that population and then you just store that once in the synaptic memory and then as spikes arrive dynamically during operation this connectivity is applied down to the specific population instances that you know the the spikes course on to so this is a way to just compress out redundancy in your network of whatever kind a simple other case beyond convolution would be sale winner-take-all Network where you have a simple you know stereotypical lateral connection across all these winner-take-all neurons and you don’t want to represent that for everyone or take all group so this this is a pretty powerful technique if you have a kind of redundancy in the network okay and then on multi compartment neurons this may be a little bit elaborate to go through at this point in detail but but basically you know if you recognize that what we’re doing in the core here is we’re walking through sequentially you know the active neuron States we can think of okay well maybe these aren’t necessarily neurons what if we just call these compartments right and we just want to create a tree of these compartments all we need to do is preserve a tiny bit of state associated with the the dynamical you know state variables of each of those compartments and then we can just propagate these forward you know by some amount and then we can aggregate we just join that state together as we as we propagate sequentially through these these compartments and so the state that needs to be recorded here or stored temporarily it just corresponds to the cross-sectional kind of width of that tree so it’s it’s it’s not that much I mean if you want to do an incredibly complicated neuron well okay it you know very broad a neuron as opposed to a deep neuron it gets expensive but if you just want some simple aggregation of neurons certainly if you just want to add serially in a chain to create a bunch of different types of synaptic responses incredibly cheap and easy and you can you can build some interesting tree structures this way and then you know we then slightly modify the kubja neuron model to support you know the joining of this dynamic state to at the the joint points in this dendritic tree you know here is the kuba neuron model expressed in a signal processing manner if your computational neuroscience this may look strange to you if your electrical engineer this looks like wow you didn’t know a neuron model was so simple it’s just two two first order filters basically with a with a non linear threshold link function but that is what the kuba that’s what a spiky neuron is it’s just a filter and so what we do is we just augment that slightly to say okay well you have dynamic State that’s coming from you no other place in this tree you know from from the other branch and and then we join that with a couple possible operations you know we either join the you the current state variable or we can join kind of with some boolean operations the spiking condition so you can have one neuron say gate the spiking another based on how it you know where it’s at and its threshold so that’s that’s our multi compartment model so you see we just had to make a small modification and now we get potentially a big expansion of functionality so it’s up to you to think about how to use that as they say the here are some of the operations that we support you know the ad is the one that for sure comes up a lot so that’s the reason we did the whole thing and then we kind of speculatively put in some of these other features so the slam example you briefly saw from Konstantinos make me so’s they’ve actually found some creative ways to use some of the other features here to implement their slam model so it’s nice to see that that it’s these others have received some use now I won’t go through all this detail but this is basically how it gets implemented you know we use a stack basically to keep track of the state and then that just simple stack operations can provide you know support for building out these kind of trees of dendrites and you get complicated in our on structures okay so that’s multi compartment we have homeostasis which I won’t go through in detail but but basically you know the homeostasis feature corresponds to water marks of activity so again we have a trace of filtered activity trace of the neuron spiking and that gives us a measure of long-term kind of activity and if the activity is too low below the low-water mark this a men value you see a men then it’ll drop the threshold and if it’s above the high-water marks and it’ll raise the threshold so it will try to target a certain kind of range of activity that you you know want the neuron to operate in so that that comes up there’s other ways to do homeostasis actually is just a plot showing behavior over time but it’s pretty intuitive then let’s see other synaptic features is it gets the kind of some minor features but they’re good to keep in mind that they exist so something we call box synapses so so this is showing the postsynaptic current response that by default everyone thinks about and that’s you know after some delay after a spike arrives you know you have a wait activation that’s impulse and that decays away over time so that’s getting integrated onto the current of the the voltage of the neuron well in some cases and some theoretical frameworks and just a variety of surprising different places it comes up you want something a little simpler actually but what we call a box in apps where now the the weight is still the the height of this thing but the delay then corresponds to the the duration of just this fixed application of a weight and so for example that comes up in this constraint satisfaction you know stochastic neuron framework that Wolfgang Maus and team have have explored and it just turns out that the math is nice and tractable if you have you know this box type of a synaptic response corresponding to kind of a box refractory period response and so in this case you know it’s just mathematically the whole system is cleaner and more well-defined if you have this type of a synaptic response but this is also come up in a variety of other cases too so it’s just good to be aware that that feature is available we also have something we call weight scaling so particularly you know you if you think about it that we have these variable with weights and you want to really compress out the storage and you’re in your weights you may have a mixture of single bit weights and 8-bit weights and in order for these to all ultimately live on the same weight scale you need to scale these separately so we have that feature so that if if you say have a learning process which treats plastic synapses as 8-bit values but after some period of time they saturate either zero or two volt 128 and end or 255 and and then you want to remove the plasticity well you can compress that away and replace it with a single bit weight but then you need to apply some scale factor so that those class of weights are then scaled back up to 255 right so that they can still coexist in the network so that’s that’s for example why we have this particular feature and then also we have a non-linearity in this function to a weight limit that can be applied and you may be familiar with some some people talk about permanence of weight values where where you may actually want your you’re learning dynamic variable to exceed some maximum strength but just to record a strong a more strongly connected more permanent synaptic connection and so that’s what this weight limit can provide you can have though that the learning variable exceeds some max value and that just you know allows it to become more strongly connected for the for the learning process okay then just a reminder about the barrier sync process you know this went through it on Wednesday and it’s pretty intuitive so I won’t go over it again but the the key is to you know recognize that we have this the synchronization that’s emerging asynchronously in the system on every time step and that you know you may think about see this as being a potential performance weak point of this whole system because it has all got a synchronized right and we’re not using a clock we don’t distribute this with a wire we’re distributing this through in band messaging in or knock and you know the question is how high performance is this so I won’t go through all this I assume you remember but this I can share with you results we have from our 32 chip in a hookah system which which shows I blew through this on Wednesday I’ll just describe a little more detail just so that you know you have a quantitative feel for what you know the this this architecture provides in terms of performance so this is this graph search example of propagating spike wave fronts through the system to find the shortest path so it we’ve benchmarked this barrier synchronization process on the 32 chip system with this 3d lattice graph structure so it’s a very simple network structure you know you can have binary connections in here when you when you map that to a planer you know in a Hoku system the mesh the the traffic patterns get to be very you know kind of messy and complexes it’s not a simple wavefront it’s all kinds of activity that has to propagate through the network so it’s it’s not no I wouldn’t say it’s a torture test but it’s definitely you know not a trivial example and so if you look at this 50 by 50 by 50 network that 125,000 neurons we can map that on one chip and if we map that on one chip and then we solve it we find that we get this breakdown in performance where the barrier synchronization the blue bar is taking about a microsecond to synchronize on every time step and then we have our sequential functions which is the neuron updates so updating all the dynamical state and we also have learning that’s occurring that’s how it’s actually remembering basically what the shortest path is is by changing the weight connections as the as the wave front propagates and so that’s also a sequential function that’s happening in every core so that’s a big you know contributor to you know performance you know the time to perform the computation and then we have all the the the the green here is the time to the extra time required some of these are concurrent but this is basically the extra time required in the network to distribute and handle all the spike traffic so this is what you get on a single chip mapping this this this problem now what we can do is we can distribute the same number of cores across all 32 chips so we have the same amount of sequential computation because we haven’t increased really the parallelism we’ve just spread the cores out across all these 32 chips and now you can see that the the time to execute each time step slows down here because well we have larger longer distances for the spike traffic to follow and so you can see well and importantly the barrier synchronization process has to span all these 32 chips so this is showing quantitatively how this slows down at this point so the bearer sync takes no more time and in fact we’ve optimized this the berry sync is not a simple process and there’s various tricks to it so this is actually down to about four microseconds now but but in any case it’s something you know you can see it increases otherwise these sequential elements don’t change at all the spike traffic increases a little bit but not really perceptibly now what you can do is you can start spreading the network out and using the extra cores parallelism that’s available across this all of these thirty-two chips so you can you can increase the number of cores from 128 to 256 you know then you can go to 1024 you can go more and more and more to use the parallel and now you have fewer neurons per core so the sequential process in every core is less and so you can see that the the sequential components of the computation the the learning and the neuron updates are squeezing down you know and taking less time you’re getting a benefit from that extra parallelization but at some point the the the spike traffic actually takes more time because the chip the chip links are slower than the on chip links so by spreading out and using more cores you’re engaging more you’re generating more spike traffic basically and so at some point that that starts increasing so there’s a sweet spot and it corresponds to about nine chips I think in this case but but in case that that’s um you know as I say this is actually today this is somewhat old data so this would all drop down you know we can run this network r32 chips and it’s under ten microseconds a time step right so you know comfortably hundred times faster than than real time you know notionally real time okay so that’s it on the architecture and I think we we’re going to take a quick break now but I suspect we’re a bit over time yep you have some statistics it’s um for the most part it’s you biggest use we found for it is to add it additively on the membrane potential you can also add it to the refractory delay or the input current you know as you’ve accumulated synaptic input but for the most part you know adding it to the membrane potential is the most useful okay so maybe we’ll take a just a yeah maybe shorten that 5-minute a quick bio break basically and then and if you have any specific questions on the architecture I can answer them now and then otherwise we’ll we’ll move on to the next