Search

Scholarly Works (13 results)

Sort By:

Show:

Thesis
Peer Reviewed

Approximate and Bit-width Configurable Arithmetic Logic Unit Design for Deep Learning Accelerator

Chen, Xiaoliang
Advisor(s): Kurdahi, Fadi J

UC Irvine Electronic Theses and Dissertations (2020)

As key building blocks for digital signal processing, image processing and deep learning etc, adders, multi-operand adders and multiply-accumulator unit (MAC) have drawn lots of attention recently. Two popular ways to improve arithmetic logic unit (ALU) performance and energy efficiency are approximate computing and precision scalable design. Approximate computing helps achieve better performance or energy efficiency by trading accuracy. Precision scalable design provides the capability of allocating just-enough hardware resources to meet the application requirements.

In this thesis, we first present a correlation aware predictor (CAP) based approximate adder, which utilizes spatial-temporal correlation information of input streams to predict carry-in signals for sub-block adders. CAP uses less prediction bits to reduce the overall adder delay. For highly correlated input streams, we found that CAP can reduce adder delay by $\sim$23.33\% and save $\sim$15.9\% area at the same error rate compared to prior works.

Inspired by the success of approximate multipliers using approximate compressors, we proposed a pipelined approximate compressor based speculative multi-operand adder (AC-MOA). All compressors are replaced with approximate ones to reduce the overall delay of the bit-array reduction tree. An efficient error detection and correction block is designed to compensate the errors with one extra cycle. Experimental results showed the proposed 8-bit 8-operand AC-MOA achieved 1.47X $\sim$ 1.66X speedup over conventional baseline design.

Recent research works on deep learning algorithms showed that bit-width can be reduced without losing accuracy. To benefit from the fact that bit-width requirement varies across deep learning applications, bit-width configurable designs can be used to improve hardware efficiency. In this thesis a bit-width configurable MAC (BC-MAC) is proposed. BC-MAC uses spatial-temporal approach to support variable precision requirements for both of activations and weights. The basic processing element (PE) of BC-MAC is a multi-operand adder. Multiple multi-operand adders can be combined together to support input operands of any precision. Bit-serial summation is used to accumulate partial addition results to perform MAC operations. Booth encoding is employed to further boost the throughput. Synthesis results on TSMC 16nm technology and simulation results show the proposed MAC achieves higher area efficiency and energy efficiency than the state-of-the-art designs, making it a promising ALU for deep learning accelerators.

Cover page: Approximate and Bit-width Configurable Arithmetic Logic Unit Design for Deep Learning Accelerator

Thesis
Peer Reviewed

Specification and Runtime Verification of Distributed Multiprocessor Systems: Languages, Tools and Architectures

Nassar, Ahmed
Advisor(s): Kurdahi, Fadi J.

UC Irvine Electronic Theses and Dissertations (2016)

Post-Deployment runtime verification (RV) has recently emerged as a complementary technology to extend coverage of conventional software verification and testing methods. This thesis is an attempt to tackle three major barriers that need to be surmounted before RV technologies become in widespread use:

Barrier-1: Lack of an expressive, yet efficiently monitorable, specification language. Distributed software behavior is projected onto an observation interface consisting of data-carrying (or parameterized) events, such as Linux system calls including argument values, and self-replicating deterministic finite automata (SR-DFAs) are introduced for RV purposes as well as anomaly-based intrusion detection in embedded and general-purpose software systems based on these parametric traces.

Barrier-2: The substantial performance and power overhead of pure software RV frameworks. NUVA, which stands for nonuniform verification architecture, a distributed automata-based RV architecture for software specifications in the form of SR-DFAs. NUVA has been implemented over a cache-coherent nonuniform-memory-access (ccNUMA) multiprocessor and can be deployed on the FPGA fabric that will reside on all next-generation processor chips. The core of NUVA is a coherent distributed automata transactional memory (ATM) that efficiently maintains states of a dynamic population of automata checkers organized into a rooted dynamic directed acyclic graph (DAG) concurrently shared among all processor nodes.

Barrier-3: Formal specifications are hard to formulate and maintain for evolving complex embedded and general-purpose software systems. Therefore, specification mining has long ago been envisioned to play a key role in software verification, modification and documentation. However, in order to scale beyond simple, library/API-level properties having short temporal spans, specification mining tools need to support more expressive specification languages that can capture complex, application-level properties. This thesis introduces a bio-inspired complete specification mining methodology for SR-DFAs using an iterative and interactive mining tool, called ParaMiner. ParaMiner relies on novel mining algorithms invoking multiple-sequence alignment (MSA) techniques to enable learning specifications from temporal slices of software behavior while overcoming the initial-state uncertainty problem.

SR-DFAs and ParaMiner have been leveraged in a new specification-based intrusion detection (ID) framework that protects distributed, reactive computing systems against cyberattacks having very sparse signatures, arbitrarily long time spans and wide attack fronts. Such attacks lie outside the scope of conventional anomaly-based ID methods which typically work with short event windows and ignore manipulated data objects, such as files and sockets. We demonstrate the effectiveness of the constructed SR-DFAs at classifying as well as resolving subtle behaviors typical of cyberattacks with varying evasion parameter values.

Cover page: Specification and Runtime Verification of Distributed Multiprocessor Systems: Languages, Tools and Architectures

Article
Peer Reviewed

Layout-driven RTL binding techniques for high-level synthesis

ICS Technical Reports (1996)

The importance of effective and efficient accounting of layout effects is well-established in High-Level Synthesis (HLS), since it allows more realistic exploration of the design space and the generation of solutions with predictable metrics. This feature is highly desirable in order to avoid unnecessary iterations through the design process. In this paper, we address the problem of layout-driven register-transfer-level (RTL) binding as this step has a direct relevance on the final performance of the design. By producing not only an RTL design but also an approximate physical topology of the chip level implementation, we ensure that the solution will perform at the predicted metric once implemented, thus avoiding unnecessary delays in the design process.

Cover page: Layout-driven RTL binding techniques for high-level synthesis

Article
Peer Reviewed

ChipEst-FPGA : a tool for chip level area and timing estimation of lookup table based FPGAs for high level applications

ICS Technical Reports (1995)

The importance of efficient area and timing estimation techniques for hierarchical design methodology is well-established in High-Level Synthesis (HLS), since the estimation allows more realistic exploration of the design space, and hierarchical design methodology matches well with HLS paradigm. In this paper, we present ChipEst-FPGA, a chip level estimator for designs implemented using a hierarchical design methodology for Lookup Table Based FPGAs. In FPGAs, the wire delay may contribute up to 60% of the overall design delay. ChipEst-FPGA uses a realistic model which takes the component area/delay as well as wiring effects into account. We tested our ChipEst-FPGA on several benchmarks and the results show that we can get accurate area and timing estimates efficiently.

Cover page: ChipEst-FPGA : a tool for chip level area and timing estimation of lookup table based FPGAs for high level applications

Thesis
Peer Reviewed

Efficient Acceleration of Computation Using Associative In-memory Processing

YANTIR, Hasan Erdem
Advisor(s): Kurdahi, Fadi J

UC Irvine Electronic Theses and Dissertations (2018)

The complexity of the computational problems is rising faster than the computational platforms' capabilities. This forces researchers to find alternative paradigms and methods for efficient computing. One promising paradigm is accelerating compute-intensive kernels using in-memory computing accelerators since memory is the major bottleneck that limits the amount of parallelism and performance of a system and dominates energy consumption in computation. Leveraging the memory intensive nature of big data applications, an in-memory-based computation system can be presented where logic can be replaced by memory structures, virtually eliminating the need for memory load/store operations during computation. The massive parallelism enabled by such a paradigm results in highly scalable structures.

The present thesis is studied against this background. The objective is to conduct a broad perspective research on in-memory computing. For this purpose, associative computing architectures (i.e., Associative Processors, or AP) are built by both traditional (SRAM) and emerging (ReRAM) memory technologies together with their corresponding software frameworks. For ReRAM-based APs, the reliability concerns coming with the emerging memories are resolved. Architectural innovations are developed to increase the energy efficiency. Furthermore, approximate computing approach is introduced for APs to perform efficient/low-power approximate in-memory computing for the tasks which can tolerate some accuracy lost. The works also propose a novel two-dimensional in-memory computing architecture to cope with the existing deficiencies of the traditional one-dimensional AP architectures.

Cover page: Efficient Acceleration of Computation Using Associative In-memory Processing

Thesis
Peer Reviewed

Resource Aggregation for Collaborative Projected Video from Multiple Mobile Devices

UC Irvine Electronic Theses and Dissertations (2016)

We explore and develop an embedded real time system and associated algorithms that enable an aggregation of limited resource, low-quality, projection-enabled mobile devices to collaboratively produce a higher quality video stream for a superior viewing experience. Such a resource aggregation across multiple projector enabled devices can lead to a per unit resource savings while moving the cost to the aggregate.

The pico-projectors that are embedded in mobile devices such as cell phones have a much lower resolution and brightness than standard projectors. Tiling (putting the projection area of multiple projectors in a rectangular array overlapping them slightly around the boundary) and superimposing (putting the projection area of multiple projectors right on top of each other) multiples of such projectors, registered via automated registration through the cameras residing within those mobile devices, result in different ways of aggregating resources across these multiple devices. Evaluation of our proof-of-concept system shows significant improvement for each mobile device in two primary factors of bandwidth usage and power consumption when using a collaborative federation of projection-embedded mobile devices.

The portable, low-power, light weight, small size pico-projectors are key components of projection-enabled mobile devices for the future. Due to the reduction of weight and dimension and the portability nature of the projector-enabled mobile devices, the calibrated integrated systems are prone to physical un-stabilizing of the projected image during the presentation. Thus the auto re-calibration and projected video stabilization features during the presentation time becomes essential requirements to enhance user experience. The design, algorithm, and implementation methods for these features will be presented in the second part of the dissertation.

Cover page: Resource Aggregation for Collaborative Projected Video from Multiple Mobile Devices

Thesis
Peer Reviewed

Low Power Reliable Design using Pulsed Latch Circuits

UC Irvine Electronic Theses and Dissertations (2017)

System-on-Chip (SoC) faced lots of challenges over the past decade. With nowadays applications centered around Internet-of-Everything (IoE), these challenges are expected to be more critical. Among these challenges are the reduction of power consumption for better energy efficiency, the overcoming of different sources of variations to ensure reliable operation and the reduction of design area to reduce the cost and increase the integration. As a result, chip designers find themselves facing lots of problems, trying to build reliable systems that integrate complex level of functionality, on a minimum die size and with a limited power budgets. Among different circuit components in every chip, memory components are of great concern. They consume the majority of the chip area and power, in addition to affecting the entire chip performance and reliability. These include large memory arrays, caches, register files and different sequential elements in the logic paths. Sequential elements play an important and critical role in modern synchronous CMOS circuits. Indeed, they can represent up to 50% of the standard cells used in a chip. In addition, the power consumption of the clock tree, including these elements can be more than half of the total chip power. In addition, they come in the second place after memory to be affected by different sources of variation. Hence, efficient implementation of these elements is of great importance for the design of energy efficient and reliable integrated circuits. Pulsed latches have been proposed as efficient replacement of flip-flops in the implementation of sequential elements. They can achieve higher performance when compared to traditional flip-flops, and can be designed to be smaller in area and more power efficient. However, the operation of pulsed latch is more sensitive to process, voltage and temperature (PVT) variations. In this thesis, we are proposing a methodology to study the reliability of pulsed latches and we have used it to evaluate the effect of PVT variations on their behavior. In addition, novel approaches to enhance the reliability of pulsed latches without significant degradation in performance, area or power are presented. Also, since sequential elements can be used to build small size register files, pulsed latch implementation of register files are discussed and compared to other traditional implementations, including SRAM and flip-flops. In addition, since multiport register files are very beneficial for quite few applications, novel implementations of multiport register files are also presented. The proposed implementation is proved to highly reduce the significant overhead in area, power and latency associated with the traditional way of designing multiport register files.

Cover page: Low Power Reliable Design using Pulsed Latch Circuits

Creative Commons 'BY' version 4.0 license

Article
Peer Reviewed

State Dependent Statistical Timing Model for Voltage Scaled Circuits

UC Irvine Previously Published Works (2014)

This paper presents a novel statistical state-dependent timing model for voltage over scaled (VoS) logic circuits that accurately and rapidly finds the timing distribution of output bits. Using this model erroneous VoS circuits can be represented as error-free circuits combined with an error-injector. A case study of a two point DFT unit employing the proposed model is presented and compared to HSPICE circuit simulation. Results show an accurate match, with significant speedup gains. © 2014 IEEE.

Cover page: State Dependent Statistical Timing Model for Voltage Scaled Circuits

Article
Peer Reviewed

The effects of variations in component styles and shapes on high-level synthesis

ICS Technical Reports (1992)

High-level synthesis (HLS) has long relied on point models for RT-components that assume fixed area and delay values for a given component style. However, aspect ratio variations alone can result in substantially different area-delay characteristics for a component. In this work, we explore the combined effect of style and aspect ratio variations on the area and delay of individual RT-components, as well as on complete RT-level designs produced by HLS. We describe the results of extensive experiments which indicate that point models are inadequate for use in the synthesis process. We believe that our results have some deep implications on the formulation of HLS algorithms that attempt to realistically incorporate physical design information early in the design process.

Cover page: The effects of variations in component styles and shapes on high-level synthesis

Article
Peer Reviewed

Joint Power Management and Adaptive Modulation and Coding for Wireless Communications Systems with Unreliable Buffering Memories

UC Irvine Previously Published Works (2014)

To guard against process variability in advanced semiconductor nodes, especially for high-density memories, designers resort to overdesigning policies resulting in increased power consumption. A promising approach to save power is to utilize Voltage over-Scaling (VoS). However VoS results into unreliable buffering memories where a predictable statistically amount of errors are introduced to memories. The goal is to trade off channel dependent SNR slack versus hardware induced errors, to achieve predetermined quality metrics, at reduced power consumption. By design, modern communication systems attempt to minimize channel-dependent SNR slack via adaptive modulation and coding (AMC) schemes, thus reducing the gains of on-chip power management. This paper investigates the interaction between on-chip power management via VoS on embedded memories versus network based AMC techniques. A novel mathematical approach that analytically describes the system packet error rate (PER) performance under the VoS induced noise is presented. Based on this model, different AMC and power management algorithms are presented that utilize the received SNR estimates to find the best AMC mode and memory voltage that achieves performance goals at reduced power consumption. Simulation results show that the proposed algorithms can achieve up to 58% energy efficiency for the memory-subsystems compared to conventional AMC algorithm with perfect memories.

Cover page: Joint Power Management and Adaptive Modulation and Coding for Wireless Communications Systems with Unreliable Buffering Memories