Neuromorphic computing systems uses non-volatile memory (NVM) to implement
high-density and low-energy synaptic storage. Elevated voltages and currents
needed to operate NVMs cause aging of CMOS-based transistors in each neuron and
synapse circuit in the hardware, drifting the transistor's parameters from
their nominal values. Aggressive device scaling increases power density and
temperature, which accelerates the aging, challenging the reliable operation of
neuromorphic systems. Existing reliability-oriented techniques periodically
de-stress all neuron and synapse circuits in the hardware at fixed intervals,
assuming worst-case operating conditions, without actually tracking their aging
at run time. To de-stress these circuits, normal operation must be interrupted,
which introduces latency in spike generation and propagation, impacting the
inter-spike interval and hence, performance, e.g., accuracy. We propose a new
architectural technique to mitigate the aging-related reliability problems in
neuromorphic systems, by designing an intelligent run-time manager (NCRTM),
which dynamically destresses neuron and synapse circuits in response to the
short-term aging in their CMOS transistors during the execution of machine
learning workloads, with the objective of meeting a reliability target. NCRTM
de-stresses these circuits only when it is absolutely necessary to do so,
otherwise reducing the performance impact by scheduling de-stress operations
off the critical path. We evaluate NCRTM with state-of-the-art machine learning
workloads on a neuromorphic hardware. Our results demonstrate that NCRTM
significantly improves the reliability of neuromorphic hardware, with marginal
impact on performance.