An increasing number of dedicated accelerators in modern System on Chips (SoCs) have led to large regions of dark silicon. Although highly efficient, these accelerators (ASICs) are inflexible. Where CPUs are flexible in time and FPGAs in space - both architectures suffer from limited efficiencies (in terms of GOPS/mm^2 and GOPS/mW). The challenge then - is to come up with architectures that implement the "right" amount of flexibility (in both time and space) while simultaneously giving near ASIC performance. Adding to this requirement, these architectures should have fast design-time, yet be easy to program with simple compilers and should support efficient in-field deployment of new algorithms.
As a solution, in this thesis, a UDSP architecture is presented that addresses each of these criteria using hardware/software co-design. This thesis demonstrates a way to make a UDSP array consisting of individual heterogeneous cores well connected by a high-speed routing network and accompanied by a software mapper which eases the overall programming process. As an example implementation, this thesis presents an 81 core chip implemented in TSMC 16nm FF process, with support for FIR/IIR Filters, Matrix Vector Multiply, Small Hardware Neural Nets, FFTs and 1 GHz Single Cycle MACs.