# **UC San Diego** # **UC San Diego Electronic Theses and Dissertations** ## **Title** Power network analysis and optimization ## **Permalink** https://escholarship.org/uc/item/19w0627c ## **Author** Zhang, Wanping # **Publication Date** 2009 Peer reviewed|Thesis/dissertation # UNIVERSITY OF CALIFORNIA, SAN DIEGO ## Power Network Analysis and Optimization A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Computer Science by Wanping Zhang # Committee in charge: Professor Chung-Kuan Cheng, Chair Professor Scott B. Baden Professor Yuan Taur Professor Michael B. Taylor Professor Deli Wang 2009 Copyright Wanping Zhang, 2009 All rights reserved. | The Dissertation of Wanping Zhang is approved, and it is acceptable in quality and form for publication on microfilm and electronically: | | | |------------------------------------------------------------------------------------------------------------------------------------------|--|--| | | | | | | | | | | | | | | | | | | | | | | | | | Chair | | | University of California, San Diego 2009 # **DEDICATION** To my parents. # TABLE OF CONTENTS | Si | gnature Page | iii | |----|-------------------------------------------------|------| | De | edication | iv | | Li | st of Tables | viii | | Li | st of Figures | ix | | Αc | cknowledgments | xi | | Vi | ta | xiv | | Αł | ostract of the Dissertation | xvii | | 1. | Introduction | 1 | | | 1.1 Power Network Background | 3 | | | 1.1.1 Power Network Model | 3 | | | 1.1.2 Circuit Simulation | 4 | | | 1.1.3 Power Network Noise | 5 | | | 1.2 Power Network Analysis | 6 | | | 1.2.1 Efficient Transient Analysis | 6 | | | 1.2.2 Worst-Case Voltage Variation Analysis | 7 | | | 1.3 Power Network Optimization | 9 | | | 1.3.1 Power-Up Sequence | 9 | | | 1.3.2 Optimization with Decoupling Capacitor | 11 | | | 1.4 Dissertation Organization | 12 | | 2. | Efficient Circuit Simulation | 14 | | | 2.1 Analysis Flow | 14 | | | 2.1.1 Basic Idea | 14 | | | 2.1.2 Laplace Transform of Input Current Source | 15 | | | 2.1.3 Obtain the Frequency-Domain Response | 17 | | | 2.1.4 Convert the Frequency-Domain Response to Time-Domain Waveform | 19 | |----|---------------------------------------------------------------------------|----| | | 2.1.5 Computational Complexity and Parallelism | 20 | | | 2.2 Experimental Results | 21 | | | 2.2.1 Accuracy of the Proposed Simulation Method | 22 | | | 2.2.2 Efficiency of the Proposed Simulation Method | 25 | | | 2.3 Summary | 27 | | 3. | Power Network Worst Case Noise Analysis | 29 | | | 3.1 Problem Statement | 29 | | | 3.2 Worst-Case Voltage Drop | 32 | | | 3.2.1 The Voltage Response for an Arbitrary Clock Gating Pattern | 33 | | | 3.2 2 Find the Maximum Voltage Variation Considering Multi-Domain Clock | | | | Gating | 35 | | | 3.2.3 Experimental Results | 38 | | | 3.3 Worst-Case Violation Area | 41 | | | 3.3.1 Analysis Flow Based on Superposition | 42 | | | 3.3.2 Models to Predict the Worst-Case Violation Area Considering Leakage | | | | Current | 45 | | | 3.3.3 ILP Based Algorithm | 47 | | | 3.3.3.1 General Model | 47 | | | 3.3.3.2 Simplified Model | 49 | | | 3.3.4 Experimental Results | 50 | | | 3.4 Summary | 54 | | 4. | Power Network Optimization | 55 | | | 4.1 Noise Minimization during Power-Up Stage | 55 | | | 4.1.1 Problem Statement | 56 | | | 4.1.2 The Simulated Annealing Based Method | 60 | | | 4.1.2.1 Domain Ordering | 60 | | | 4.1.2.2 Greedy Initial Solution | 63 | | | 4.1.2.3 Simulated Annealing Based Algorithm | 63 | | | 4.1.3 Experimental Results | 65 | |----|------------------------------------------------------------------------|-----| | | 4.1.3.1 A Case with Eight Power Domains | 65 | | | 4.1.3.2 More Results with Different Test Cases | 68 | | | 4.2 Noise Minimization with Decap and Controlled-ESR | 68 | | | 4.2.1 Power Network Model With Controlled-ESR | 70 | | | 4.2.2 Problem Statement | 72 | | | 4.2.2.1 Power Network Noise Considering Voltage Overshoot | 72 | | | 4.2.2.2 Formulation of Optimization Problem | 73 | | | 4.2.3 Revised Sensitivity Computation | 75 | | | 4.2.3.1 Circuit Sensitivity Analysis with State Equation | 75 | | | 4.2.3.2 Sensitivity Computation with the Merged Adjoint Network Method | .77 | | | 4.2.2.3 Revised Sensitivity Computation Considering Voltage Overshoot | 78 | | | 4.2.4 SQP Based Optimization | 80 | | | 4.2.5 Experimental Results | 81 | | | 4.2.5.1 Effect of Considering Voltage Overshoot | 82 | | | 4.2.5.2 Effect of Optimizing with Controlled-ESR | 82 | | | 4.2.5.3 Relationship between Decap Budget and Noise | 84 | | | 4.3 Summary | 86 | | 5. | Conclusion | 87 | | | 5.1 Summary of Contributions | 87 | | | 5.2 Future Work | 88 | | 6. | Bibliography | 90 | # LIST OF TABLES | Number Page | |------------------------------------------------------------------------------------| | Table 1-1: ITRS 2008 technology parameters | | Table 1-2: Comparison of IR drop and $\Delta I$ noise6 | | Table 2-1: Comparison of HSPICE, MSPICE and the proposed frequency-domain | | based method | | Table 2-2: CPU time of parallel computation27 | | Table 3-1: The worst cases of voltage variation considering only one clock domain, | | and all clock domains41 | | Table 3-2: Computational results with 1% leakage53 | | Table 3-3: Violation area comparison with different leakage percentage for a four- | | domain network53 | | Table 4-1: Comparison between the enumerating method and SA based method for | | small cases68 | | Table 4-2: Comparison between the enumerating method and SA based method for | | large cases68 | | Table 4-3: Test cases description and the noise underestimation due to neglecting | | voltage overshoot85 | | Table 4-4: Comparison among three methods for the minimization of power network | | noise85 | # LIST OF FIGURES | Number | |-----------------------------------------------------------------------------------------| | Figure 1-1: Power network model | | Figure 1-2: Voltage violation area | | Figure 2-1: The proposed method for time-domain simulation | | Figure 2-2: A PWL waveform can be decomposed into several ramp segments17 | | Figure 2-3: A frequency-domain response and its fitting result with the vector fitting | | technique. (a) with linear vertical axis, (b) with log vertical axis24 | | Figure 2-4: Comparison of HSPICE and the proposed simulation method25 | | Figure 2-5: Relative errors and CPU times vs. the number of frequency samples25 | | Figure 3-1: The model of power supply network with multi-domain clock gating30 | | Figure 3-2: An example of voltage response $y_0(t)$ | | Figure 3-3: The one-cycle portions of $y_0(t)$ are arranged for superposition | | Figure 3-4: Parameters in the algorithm finding the worst case of voltage variation37 | | Figure 3-5: The proposed algorithm for finding the worst case of positive voltage | | variation | | Figure 3-6: The voltage responses of the first case, corresponding to one-cycle current | | sources in a single clock domain40 | | Figure 3-7: The voltage responses of the large-scale industrial case, corresponding to | | one-cycle current sources in a single clock domain41 | | Figure 3-8: Analysis flow in one domain considering leakage current | | Figure 3-9: Analysis with multiple domains | | Figure 3-10: Parameters description for ILP formulations | | Figure 3-11: Voltage responses with each domain working respectively | | Figure 3-12: Worst case voltage violation area | | Figure 4-1: Illustration of power-up sequence for multiple domains58 | | Figure 4-2: Parameter description for power-up sequencing problem. | 58 | |-------------------------------------------------------------------------------------|----| | Figure 4-3: The power-up sequencing problem. | 59 | | Figure 4-4: A constraint graph modeling the inter-domain relationships | 62 | | Figure 4-5: Parameter description for the ordering algorithm. | 62 | | Figure 4-6: The domain ordering algorithm. | 63 | | Figure 4-7: A greedy algorithm to find the initial solution. | 63 | | Figure 4-8: Simulated annealing based algorithm. | 65 | | Figure 4-9: Voltage waveforms with only one domain working. | 67 | | Figure 4-10: Relationship between the power-up deadline and the minimum violation | n | | area | 68 | | Figure 4-11: Power network model with controlled-ESR. | 71 | | Figure 4-12: A simple case to show the effect of controlled-ESR. | 71 | | Figure 4-13: Effect of controlled-ESR on reducing the noise. | 72 | | Figure 4-14: Voltage violation area with both voltage drop and overshoot considered | 73 | | Figure 4-15: Problem formulation of noise minimization of power network via | a | | allocation of decaps and controlled-ESRs | 75 | | Figure 4-16: The algorithm description for revised sensitivity computation | 80 | | Figure 4-17: SQP based optimization method. | 81 | | Figure 4-18: Voltage waveforms obtained with different optimization methods | 84 | | Figure 4-19: Relationship between decap budget and the noise. | 85 | #### **ACKNOWLEDGMENTS** First of all, I would like to thank my advisor Professor Chung-Kuan Cheng for his invaluable advice, thoughtful criticism, encouraging words. It was his step-by-step guide to lead me how to identify interesting topics and formulate research problems. He is the driven force behind my accomplishment. It was my great honor to work for and learn from him. I would also like to thank my dissertation committee members, Professor Scott Baden, Professor Yuan Taur, Professor Michael Taylor, and Professor Deli Wang for their insightful review of my dissertation. It was my greatest pleasure to be a member of UCSD VLSI CAD group. I lived a happy life in such a friendly, enjoyable environment and I am glad to work with so many talented colleagues. I would like to thank Yi Zhu and Renshen Wang for their sparkling ideas on algorithm. I appreciate the discussions with Ling Zhang, Rui Shi, He Peng, Xiang Hu, Amirali Shayan and Yulei Zhang on the circuit theory. Especially I am grateful to Professor Wenjian Yu at Tsinghua University for his help on academic writing, and constructive suggestions in many of my research projects. Without the help from these colleagues, I can not overcome all the difficulties and arrive at this point that I am proud of. Finally, I am deeply indebted to my parents for their love and encouragement during the long years of my education. I would like to give my special thanks to my girlfriend Kejun Xu for her five-year continuous caring, love, and intellectual support ever since my PhD application. It is her who encourages me to work hard, play hard, and enjoy my PhD life. Chapter 2, in part, is a reprint of the paper "Efficient Power Network Analysis Considering Multi-Domain Clock Gating", co-authored with Wenjian Yu, Xiang Hu, Ling Zhang, Rui Shi, He Peng, Zhi Zhu, Lew Chua-Eoan, Rajeev Murgai, Toshiyuki Shibuya, Noriyuki Ito, and Chung-Kuan Cheng, in *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD)*. The dissertation author was the primary investigator and author of this paper. Chapter 3, in part, is a reprint of the paper "Efficient Power Network Analysis Considering Multi-Domain Clock Gating", co-authored with Wenjian Yu, Xiang Hu, Ling Zhang, Rui Shi, He Peng, Zhi Zhu, Lew Chua-Eoan, Rajeev Murgai, Toshiyuki Shibuya, Noriyuki Ito, and Chung-Kuan Cheng, in *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD)*. It is also a reprint of the paper "Predicting the Worst-Case Voltage Violation in a 3D Power Network ", co-authored with Wenjian Yu, Xiang Hu, Amirali Shayan, A. Ege Engin, Chung-Kuan Cheng, in *IEEE/ACM International Workshop on System Level Interconnect Prediction (SLIP 2009)*. The dissertation author was the primary investigator and author of these papers. Chapter 4, in part, is a reprint of the paper "Noise Minimization During Power-Up State for a Multi-Domain Power Network", co-authored with Yi Zhu, Wenjian Yu, Amirali Shayan, Renshen Wang, Zhi Zhu, Chung-Kuan Cheng, in *Proc. IEEE / ACM Asia and South Pacific Design Automation Conference (ASP-DAC 2009)*, pp. 391-396. It is also a reprint of the paper "On-Chip Power Network Optimization with Decoupling Capacitors and Controlled-ESRs", co-authored with Ling Zhang, Amirali Shayan, Wenjian Yu, Xiang Hu, Zhi Zhu, A. Ege Engin, and Chung-Kuan Cheng, under submission. The dissertation author was the primary investigator and author of these papers. #### **VITA** | 2005 | B.E. in Computer Science<br>University of Science and Technology of China | |------|---------------------------------------------------------------------------| | 2007 | M.S. in Computer Science<br>University of California, San Diego | | 2009 | Ph.D. in Computer Science<br>University of California, San Diego | #### **PUBLICATIONS** Wanping Zhang, Wenjian Yu, Xiang Hu, Ling Zhang, Rui Shi, He Peng, Zhi Zhu, Lew Chua-Eoan, Rajeev Murgai, Toshiyuki Shibuya, Noriyuki Ito, and Chung-Kuan Cheng, "Efficient Power Network Analysis ConsideringMulti-Domain Clock Gating", *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD)*, accepted. **Wanping Zhang**, Ling Zhang, Amirali Shayan, Wenjian Yu, Xiang Hu, Zhi Zhu, A. Ege Engin, and Chung-Kuan Cheng, "On-Chip Power Network Optimization with Decoupling Capacitors and Controlled-ESRs", under submission. **Wanping Zhang,** Wenjian Yu, Xiang Hu, Amirali Shayan, A. Ege Engin, Chung-Kuan Cheng, "Predicting the Worst-Case Voltage Violation in a 3D Power Network", *IEEE/ACM International Workshop on System Level Interconnect Prediction (SLIP 2009)*, accepted. **Wanping Zhang,** Yi Zhu, Wenjian Yu, Amirali Shayan, Renshen Wang, Zhi Zhu, Chung-Kuan Cheng, "Noise Minimization During Power-Up State for a Multi-Domain Power Network", in *Proc. IEEE / ACM Asia and South Pacific Design Automation Conference (ASP-DAC 2009)*, pp. 391-396. Amirali Shayan, Xiang Hu, He Peng, **Wanping Zhang**, Chung-Kuan Cheng, "Parallel Flow to Analyze the Impact of the Voltage Regulator Model in Nanscale Power Distribution Network", in *Proc. International Symposium on Quality Electronic Design* (*ISQED 2009*), pp. 576-581. Shan Zeng, Wenjian Yu, **Wanping Zhang**, Jian Wang, Xianlong Hong, Chung-Kuan Cheng, "Efficient Power Network Analysis with Complete Inductive Modeling", in *Proc. International Symposium on Quality Electronic Design (ISQED 2009)*, pp. 770-775. **Wanping Zhang**, Yi Zhu, Wenjian Yu, Ling Zhang, Rui Shi, He Peng, Zhi Zhu, Lew Chua-Eoan, Rajeev Murgai, Toshiyuki Shibuya, Nuriyoki Ito, Chung-Kuan Cheng, "Finding the Worst Voltage Violation in Multi-Domain Clock Gated Power Network", in *Proc. IEEE / ACM Design, Automation & Test in Europe (DATE 2008)*, pp. 537-540. Amirali Shayan, Xiang Hu, He Peng, Mikhail Popovich, **Wanping Zhang**, Chng-Kuan Cheng, Lew Chua-Eoan, Xiaoming Chen, "3D Power Distribution Network Co-design for Nanascale Stacked Silicon Ics", in *Proc. IEEE Conference on Electrical Performance of Electronic Packaging (EPEP2008)*, pp. 11-14. Ling Zhang, Wenjian Yu, Haikun Zhu, **Wanping Zhang**, Chung-Kuan Cheng, "Clock Skew Analysis via Vector-Fitting in Frequency Domain", in *Proc. International Symposium on Quality Electronic Design (ISQED 2008)*, pp. 476-479. Yi Zhu, Amirali Shayan, **Wanping Zhang**, Tong Lee Chen, Tzuyy-Ping Jung, Juan-Ren Duann, Scott Makeig, Chung-Kuan Cheng, "Analyzing High-Density Human Heart Signals using ICA", *IEEE Transactions on Biomedical Engineering*, Vol. 55, No. 11, Nov. 2008, pp. 2528-2536. **Wanping Zhang**, Ling Zhang, Rui Shi, He Peng, Zhi Zhu, Lew Chua-Eoan, Rajeev Murgai, Toshiyuki Shibuya, Nuriyoki Ito, Chung-Kuan Cheng, "Fast Power Network Analysis with Multiple Clock Domains", in *Proc. IEEE International Conference on Computer Design (ICCD 2007)*, pp. 456-463. **Wanping Zhang**, Chung-kuan Cheng, "Incremental Power Impedance Optimization Using Vector Fitting Modeling", in *Proc. IEEE International Symposium on Circuits and Systems (ISCAS 2007)*, pp. 2439-2442. Amirali, Yi Zhu, **Wanping Zhang**, Tzyy-Ping Jung, Jeng-Ren Duann, Scott Makeig, Chung-Kuan Cheng, "Spatial Density Reduction in the Study of the ECG Signal Using ICA", in *Proc. International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC 2007)*, pp. 5497 – 5500. Yi Zhu, Tong Lee Chen, **Wanping Zhang**, Tzyy-Ping Jung,Jeng-Ren Duann, Chung-Kuan Cheng, "Noninvasive Study of the Human Heart using Independent Component Analysis", in *Proc. IEEE International Conference on BioInformatics & BioEngineering (BIBE 2006)*, pp. 340-347. # FIELDS OF STUDY Major Field: Computer Science Studies in VLSI CAD Professor Chung-Kuan Cheng ### ABSTRACT OF THE DISSERTATION Power Network Analysis and Optimization by ### Wanping Zhang Doctor of Philosophy in Computer Science University of California, San Diego, 2009 Professor Chung-Kuan Cheng, Chair Power networks supply power from the P/G pads on a chip to the circuit modules. With the rapid increase of working frequency and continuous scaling of VLSI technology, the power supply network is experiencing unprecedented noise, which causes significant delay variation of devices, or even logic failure. Therefore, robust and reliable power supply network has increasing importance for high-speed circuit performance. In this dissertation, we study the methodologies and algorithms to perform the power networks analysis and optimization. We design an efficient circuit simulation flow based on frequency domain computation, which serves as a helpful tool for analysis and optimization. Then, we explore approaches to make the worst case noise analysis considering clock gating with multiple domains. The worst case voltage drop and violation area are studied in this analysis work. After power network analysis, the optimization is to confine the voltage fluctuation to meet with a target noise tolerance. The power-up sequencing problem and the noise minimization with decoupling capacitors (decap) and controlled-ESRs are studied. In the circuit simulation work, a frequency domain based simulation method is proposed to obtain the time domain voltage response. With the vector fitting technique, the frequency-domain responses are approximated by a partial fraction expression, which can be easily converted to time-domain waveform. Numerical results show that the proposed simulation method is up to several hundred times faster than commercial fast simulators, like HSPICE and MSPICE. And, the proposed method is able to analyze large-scale power networks that the commercial tools are not able to afford. The worst case voltage drop and violation area analysis are both studied in a multi-domain clock gated power network. We describe a linear time complexity algorithm to find the worst case voltage drop and the corresponding clock gating pattern. An efficient integer linear programming (ILP) based approach is proposed to find the worst voltage violation area. Leakage current is taken into consideration to accurately estimate the violation noise. The optimization work covers two pars. Firstly, an efficient heuristic algorithm is introduced to arrange the power-up sequence in a multi-domain power network to minimize the noise. Secondly, we propose a sequential quadratic programming (SQP) based algorithm to optimize power network with both decap and controlled-ESR. A revised sensitivity computation is derived to consider both voltage drop and overshoot. Experimental results shows the controlled-ESR reduces the noise by 25% with the same decap budget. # 1. Introduction With aggressive technology scaling, power ground network has become one of the major concerns in VLSI design. The trend of increasing power and clock frequency while reducing power supply voltage causes the power supply network to experience larger noise. Therefore, efficient power network analysis and optimization is of more importance. Table 1-1: ITRS 2008 technology parameters | Year | Gate Length (nm) | Frequency<br>(GHz) | Vdd<br>(V) | Size (mm <sup>2</sup> ) | Power Density (W/mm <sup>2</sup> ) | |------|------------------|--------------------|------------|-------------------------|------------------------------------| | 2008 | 29 | 5.06 | 0.8 | 140 | 0.86 | | 2009 | 27 | 5.45 | 0.7 | 140 | 0.90 | | 2010 | 24 | 5.88 | 0.6 | 140 | 0.96 | | 2011 | 22 | 6.33 | 0.6 | 140 | 1.13 | | 2012 | 20 | 6.82 | 0.6 | 140 | 1.11 | | 2013 | 18 | 7.34 | 0.5 | 140 | 1.10 | | 2014 | 17 | 7.91 | 0.5 | 140 | 1.17 | The 2008 International Technology Roadmap for Semiconductors (ITRS) [50] predicts the feature size would be 17nm, the clock frequency to be 7.91GHz, and power density to be 1.17 W/mm2 in 2014 as shown in Table 1-1. The increasing clock frequency will result in more sudden current demands. As supply voltage Vdd is reduced, there will be less noise margin. And the increasing power density leads to more noise with technology scaling down. Therefore, power network needs to be analyzed and optimized to confine the voltage fluctuation. The power network background including models, simulation, and noise problem is discussed in section 1.1. Accurate and fast circuit simulation method plays a fundamental role in power network analysis. We review the previous works on efficient simulation algorithm in section 1.2.1. They mainly focus on two aspects to speed up the simulation: one is the reduction of circuit, which makes the problem simpler, and the other is the improvement of matrix solvers. The noise issue is crucial in VLSI design. Because if the circuit fails, it makes no sense to talk about how small the chip is, how fast it runs and how little the power consumption is [1]. In modern deep-submicron technologies, the power supply voltage variation will not only introduce additional signal delay, but also may cause false switching of logic gates [2]. In section 1.2.2, we review the research works on the worst case noise analysis. The rush current in the power-up stage may result in excessive noise in chip initialization. Therefore, the power up sequence needs to be carefully scheduled in a multi-domain power network to reduce noise. In circuit working stage, adding decoupling capacitors (decaps) between power and ground is a traditional way to reduce the power ground network impedance and therefore eliminate the supply noise. We introduce the previous power network optimization works on both power-up sequence and decap allocation in section 1.3. In this dissertation, we propose a frequency domain based simulation flow which is accurate and efficient. We also introduce the approaches to analyze the worst case noise for multi-domain power networks. Our optimization works include the power-up sequence arrangement and the noise minimization with both decaps and controlled-ESRs. The dissertation organization is presented in section 1.4. # 1.1 Power Network Background ## 1.1.1 Power Network Model The power network is usually modeled as a circuit including resistance, capacitance and packaging inductance, like that shown in Figure 1-1. Time-varying current sources are connected to some circuit nodes, characterizing the behavior of active circuit instances. These current sources draw current from the power network and cause voltage fluctuations [3]. The waveform of current source is usually described as a piecewise linear (PWL) function. Figure 1-1: Power network model ## 1.1.2 Circuit Simulation Suppose there are n nodes and b branches in a circuit. We can use equation (1-1) [4] to model the whole circuit. $$\begin{pmatrix} C & \\ & L \end{pmatrix} \begin{pmatrix} v \\ i \end{pmatrix} = \begin{pmatrix} -G & -E \\ E^T & -R \end{pmatrix} \begin{pmatrix} v \\ i \end{pmatrix} + BU$$ (1-1) where C is the capacitance matrix, L is the inductance matrix. V is the voltage in each node and i is the current through every branch with inductance. G and R are conductance and resistance matrix, respectively. U is the input vector such as current source. We can get every node voltage and branch current by solving this equation. The same circuit can also be modeled in frequency domain. We apply Laplace Transform on equation (1-1) and get: $$s \begin{pmatrix} C \\ L \end{pmatrix} \begin{pmatrix} V(s) \\ I(s) \end{pmatrix} = \begin{pmatrix} -G & -E \\ E^T & -R \end{pmatrix} \begin{pmatrix} V(s) \\ I(s) \end{pmatrix} + BU(s)$$ (1-2) Frequency domain voltage and current responses can be solved from equation (1-2). ### 1.1.3 Power Network Noise The ideal VDD should be a straight line with constant value. However, when the circuits are switching, currents flow through the power supply lines will cause the power supply voltage to fluctuate. When we consider the package and board Power Ground noise, it is given by $IR + L \times \frac{\Delta I}{\Delta t}$ , where I is the branch current and can be computed from equation (1-1) There are two main sources of power supply noise: IR drop and $\Delta I$ noise [1] as shown in Table 1-1. Traditionally, these two kinds of noise are considered separately. However, with increasing circuit switching frequency, the on-chip interconnect impedance may have a significant inductive component jwL that is comparable to the resistive component R and thus can not be ignored. Table 1-2: Comparison of IR drop and $\Delta I$ noise | IR drop | $\Delta I$ noise | |--------------------------------|-----------------------------------| | $\Delta V = IR$ | $\Delta V = L\Delta I / \Delta t$ | | Because of the wire resistance | Because of wire inductance | | Often occurs on the chip | Mostly occurs on the package | These two kinds of noises will not reach the worst case at the same time. The reason is that the maximum $\Delta I$ noise occurs during switching when the current change $\Delta I$ reaches maximum and the maximum IR drop occurs when the current I is at its peak. # 1.2 Power Network Analysis ## 1.2.1 Efficient Transient Analysis Accurate power network verification and analysis require full-chip simulation of large-scale circuits. The basic SPICE simulator solves the system state equation (1-1) at every time step. However, the power network in modern integrated circuits such as microprocessors can easily include millions of nodes, which makes huge burdens on computation and memory storage. Many previous works focused on the efficient time-domain transient analysis of large-scale power networks. They generally pursue in two directions: one is the circuit size reduction and the other is more efficient numerical matrix solvers. The circuit size can be reduced by using methods such as circuit partitioning [5], multigrid-like technique [6], and hierarchical model reduction [7]. Therefore, the size of matrix in state equation is reduced, and the computation is more efficient. However, circuit reduction sacrifices the accuracy. In others, the simulation is accelerated by fast linear equation solvers. They include the direct solver "KLU" [8], iterative solvers like the preconditioned conjugate gradient (PCG) [9] method and generalized minimal residual (GMRES) [10] method. The direct solver is fast for small circuit, but due to the cost of LU factorizations, it is not applicable for large cases. The conjugate gradient (CG) algorithm is good for solving sparse symmetric positive definite (S.P.D) linear systems, such as power network with R, L, and C. GMRES can handle non-symmetric matrix, which makes it possible to solve problems containing elements other than R, L, C. In chapter 2, we describe a novel simulation flow based on frequency domain computation. With the vector fitting technique, the frequency-domain responses are approximated by a partial fraction expression, which can be easily converted into time-domain waveform. ## 1.2.2 Worst-Case Voltage Variation Analysis The voltage variation in the power network can have an adverse impact on the performance and the reliability of chip, package and board such as longer signal delay and even logic failure. Therefore, the accurate estimation of the worst-case voltage variations is of more importance. The voltage violation area is shown in Figure 1-2, where $V_{\min}$ is the allowed voltage drop. Figure 1-2: Voltage violation area The previous analysis works focused on the estimation of maximum current, which then leads to the worst-case voltage variation. The traditional way to find the maximum current is to simulate all possible patterns at these inputs including primary inputs and pseudo primary inputs. However, for a circuit with n inputs, this method requires simulation of $4^n$ patterns, which is of exponential time complexity. Recent research works are either pattern dependent or pattern independent. Pattern dependent techniques are based on searching to generate a small set of patterns to produce high power supply noise [11][12]. However, as pointed in [13], they can only generate a lower bound estimation of the maximum current envelope. The pattern independent approaches estimate the upper bound envelope of all possible current waveforms. Ref [14] proposed a linear time algorithm (iMax), and ref [15] introduced the MIMAX algorithm to estimate the maximum current envelopes. Later on, the full-chip vetorless method was developed for dynamic power integrity analysis [3]. Shi et al. introduced an idea to predict the worst-case logical timing correlations among the cells which cause the voltage resonance [16]. All the above works do not consider the power networks with multi-domains, which make the voltage variation analysis more complicated. Certain clock-gating patterns may induce the voltage resonance of the power network. Chapter 3 presents the approaches to predict the worst-case clock gating pattern which leads to the worst voltage drop and violation area. # 1.3 Power Network Optimization Based on the circuit simulation and the worst-case analysis, we further investigate the power network optimization. Robust and reliable on-chip power supply network has increasing importance for high-speed circuit performance. Optimizing the power network to confine the voltage fluctuation so as to meet a target of noise tolerance (typically 5%~10% of nominal Vdd) becomes an essential step of on-chip circuit design. In this section, we first discuss the power-up sequence problem during initialization. Then the allocation of decap to reduce noise is introduced. ## 1.3.1 Power-Up Sequence Multiple power domain (MPD) is becoming popular in the modern SoC design. In order to handle different performance objectives and constraints among different blocks, a new approach is to partition the internal logic of the chip into multiple power domains [17]. According to [21], the benefits to introduce MPD design can be summarized in three aspects as follows. Firstly, separated system development can be performed in each domain. Secondly, as different domains work independently, we are able to apply various power gating schemes based on the functionality of a particular block, in order to reduce leakage power consumption. Thirdly, clock frequency for each domain could also be changed for the sake of dynamic power reduction. One of the most important issues to power up all domains is the stability of VDD/GND lines. Turning on the power switches may cause a large rush current on the power lines. Those large rush currents will make the inductance components more significant and therefore more switching noise will be introduced. Previous research has shown that poor rush current management or power supply noise can potentially corrupt retention registers, which may lead to unsafe state [17]. The power-up sequencing is one of the major challenges in MPD design for noise reduction. It is not practical to bring up all the power supplies at the same time, because excessive noise will be introduced due to the rush current. Hence, it is beneficial to design a power-up sequence to enable different power domains in a well-defined order, which results in less noise and therefore assures correct function [17]. Lots of previous work discussed the importance of the power-up sequence in the initialization stage in order to minimize noise. Salmon and Dour [18] showed that the voltage level shifting circuitry associated with the core logic is able to initialize properly only when the core logic voltage supply lines are ramped prior to the I/O voltage supply lines. Ranjan [19] designed a circuit that can turn each transistor stage on and off in order, so as to avoid drawing huge current which leads to excessive voltage violation. A power switch design was developed to minimize rush current [20], and a sequential power-up scheme was established [21]. The above techniques consider the power-up sequence in transistor or logic gate level. We will extend this sequencing problem into multi-domain power network. In section 4.1, we present a simulated annealing based algorithm with preprocessing to arrange the power-up sequence in a multi-domain power network to minimize noise. ## 1.3.2 Optimization with Decoupling Capacitor Adding decoupling capacitors (decap) between the power network and the ground is an effective and widely adopted approach to reduce the power network impedance and therefore reduce the power network noise. However, the decap consumes die area and affects die yield adversely [22]. To control its negative impact, the total amount of decaps needs to be restricted while the decap locations are determined optimally to reduce the noise. Most of existing research works for on-chip power noise reduction optimized the location and/or the amount of decaps. The optimization approach can be generally classified into two groups: the charge based algorithms and the sensitivity based algorithms. The charge based approaches estimate the total charge drawn from the power network during the worst-case switching scenario, and then determine the amount of decaps needed [1][23][24]. However, it is difficult to accurately estimate the voltage drop and electric charge for power networks [25]. Recent works focused more on the sensitivity based approach. An adjoint network method is applied to calculate the sensitivity of the violation area for circuit node with respect to decap change [26][27]. The sensitivity is used as the gradient in the nonlinear optimization solver. Because the number of simulations required for each iteration step is proportional to the number of nodes checked for violation, the computational complexity could be very high. To reduce the computational complexity, the merged adjoint network method was introduced and applied to calculate the sensitivity of the overall violation area with respective to decap [28][29]. The idea is based on the superposition principle of linear circuits. ## 1.4 Dissertation Organization In this dissertation, we describe methodologies and approaches to simulate circuit efficiently, analyze worst-case voltage noise, and optimize power networks. This work makes contributions in both algorithm theory and system designs. Chapter 2 presents a frequency domain based simulation method to obtain the time domain voltage response. With the vector fitting technique, the frequency domain responses are approximated by a partial fraction expression, which can be easily converted to time domain waveform. The simulation flow can be parallelized to achieve more speedup. Chapter 3 describes the worst-case analysis approaches for both voltage drop and violation area. The analysis work predicts the clock gating pattern for the worst-case noise in a multi-domain power network. The algorithm for the worst-case voltage drop analysis is of linear time complexity. We introduce the integer linear programming (ILP) to formulate the worst-case voltage violation area problem. Leakage current is taken into consideration for the modeling. In chapter 4, we discuss the algorithms to minimize the noise. Firstly, an efficient heuristic framework is proposed to arrange the power-up sequence in a multi-domain power network. The framework consists of domain ordering, a greedy initial solution and the simulated annealing optimization algorithm. Secondly, for circuit itself, we proposed to use both decap and controlled-ESR for the on-chip power network optimization, which is different from traditional ways using decap only. A revised sensitivity computation is derived to consider both voltage drop and overshoot. The sequential quadratic programming (SQP) is adopted to solve the optimization problem where the revised sensitivity is regarded as the gradient. Finally, chapter 5 summarizes this dissertation and sketches some promising research directions. # 2. Efficient Circuit Simulation In this chapter, we present the efficient frequency domain based simulation method. Section 2.1 shows the analysis flow on how to compute the voltage response in frequency domain, approximate with vector fitting and then convert back to time domain. This simulation flow is also parallelized. Section 2.2 demonstrates the experimental results for both accuracy and efficiency of proposed flow. The summary will be given in section 2.3. # 2.1 Analysis Flow In this section, a method based on frequency-domain analysis and vector fitting technique is proposed to calculate the time-domain voltage waveform. ### 2.1.1 Basic Idea Fig. 3 describes the flow of the frequency-domain based simulation method. We firstly convert the current sources from time-domain waveform to frequency-domain expression with Laplace transform. Since each input current source I(t) is described as a PWL function, its frequency-domain expression can be derived analytically. Then, a linear equation system A(s)V(s)=I(s) is formulated for frequency-domain analysis. After solving the frequency-domain equation, we obtain the voltage response at specified frequency. The vector fitting technique is adopted to fit the voltages V(s) at frequency samples with a partial fractional expression $\Re(s)$ . Finally, the partial fractional expression can be easily converted to the time-domain waveform v(t). Figure 2-1: The proposed method for time-domain simulation With little sacrifice on accuracy, this method provides an alternative for time-domain transient simulation. Since it is based on frequency-domain analysis, one can easily obtain the natural frequency information of the power network, which is useful for comprehensive knowledge of power noise. Further- more, with efficient techniques discussed below, this method demonstrates large speedup comparing to the conventional time-domain simulation for large-scale power networks. ## 2.1.2 Laplace Transform of Input Current Source We apply Laplace transform to the PWL function. Suppose r(t) denotes the unit ramp function: $$r(t) = tu(t), (2-1)$$ where u(t) is the unit step function. The frequency-domain expression of the ramp function is: $$R(s) = 1/s^2$$ (2-2) A PWL function f(t) can be regarded as the superposition of several ramp functions, as shown in Figure 2-2: $$f(t) = \sum_{i} a_i r(t - t_i)$$ , (2-3) where $t_i$ is the starting time point of the *i*th ramp segment, and $a_i$ is the difference between the slopes of two adjacent segments (see Figure 2-2). Because of the linearity of Laplace transform, the frequency-domain expression of f(t) is obtained: Figure 2-2: A PWL waveform can be decomposed into several ramp segments. For the DC component, i.e. when s=0, Equation (2-4) is not applicable. Instead, the area under the time-domain PWL waveform f(t) is calculated to give F(0). It should be pointed out that if the PWL function includes a ramp segment with infinite slope, a shifted step function would be added in the superposition expression (2-3). The step function u(t) corresponds to 1/s in frequency domain, and F(s) can be calculated with a little modification on equation (2-4). ### 2.1.3 Obtain the Frequency-Domain Response For a specified frequency, $s=j\omega$ , where $\omega$ is the angular frequency, we generate a complex-valued linear equation system: $$A(s)V(s) = I(s) (2-5)$$ Here I(s) is generated with the frequency-domain expressions of current sources, and V(s) consists of the unknown frequency-domain voltages. In the model of power network, there are R, L, C elements, constant voltage source and time-varying current source. For frequency-domain analysis, the voltage source is considered as short circuit. Thus, (2-5) can be easily formulated with the nodal analysis approach, which utilizes admittances of forms 1/R, $1/j\omega L$ , and $j\omega C$ . Thus, the number of unknowns in (2-5) equals to the number of circuit nodes, and A(s) is the admittance matrix of circuit. Even though there is no unknown of branch current, the order of the linear equation system (2-5) could be very huge for a large-scale power network. Efficient and scalable linear equation solver is required to perform the frequency-domain analysis. PETSc is a suite of data structure and routines for the scalable (parallel) solution of large-scale scientific applications [30]. It includes various Krylov subspace equation solvers and preconditioners. They can be easily used in application codes written in C or C++, while users have detailed control over the solution process. For the large-scale complex-valued equation (2-5), we choose the conjugate gradient square (CGS) method with incomplete LU (ILU) preconditioner from PETSc. Numerical results show that a power network with more than one million nodes can be easily analyzed by the efficient solver from PETSc. To describe the complete spectrum of a voltage response, we need choose some frequency sampling points. For each frequency sample, equation (2-5) is solved to get the voltage response. The highest frequency in the spectrum is related with the input current sources and the nature of power network. In practical applications, the upper bound of frequency spectrum is usually not more than several tens of GHz. Then, the logarithmic scale sampling is adopted to make a moderate value of frequency samples. The number of frequency sample is obtained by an empirical formula based on a lot of testing of industrial power networks, to make the tradeoff of accuracy and efficiency. With this technique, the number of frequency points is of $O(log f_{max})$ , where $f_{max}$ is the upper bound of frequency. ### 2.1.4 Convert the Frequency-Domain Response to Time-Domain Waveform The vector fitting (VF) technique is a general method for the fitting of frequency-domain responses with rational function approximations [33]. It converts a nonlinear problem of least squares approximation to a linear problem in two stages, where the pole locations are determined in an iterative manner. And, it is guaranteed that the resulting approximation has stable poles. The VF technique has been developed into a robust numerical package shared in public domain [31][32]. With the frequency-domain responses at a given node, the VF technique is used to fit the voltage points with a partial fractional expression: $$\theta_{k}(s) = \sum_{i=1}^{N_{a}} \frac{r_{i}}{s - p_{i}} , \qquad (2-6)$$ where $\mathcal{H}$ stands for the voltage at node k. Residues $r_i$ and poles $p_i$ are obtained with the VF algorithm, and are either real quantities or come in complex conjugate pairs. With (2-6), the time-domain response can be easily derived: $$v_k(t) = \sum_{i=1}^{N_a} [r_i e^{p_i t} u(t)]$$ (2-7) The major computation in vector fitting is to solve a linear least square (LLS) problem, whose coefficient matrix is of $2m\times2N_a$ . Here m is the number of frequency samples, and $N_a$ is the order of approximation. The LLS problem can be solved by the method of normal equation, or the QR decomposition, whose computational complexity is $O(mN_a^2)$ . Usually, the response of power network does not include many resonance peaks, and a low order $N_a$ could give the approximation with sufficient accuracy. ### 2.1.5 Computational Complexity and Parallelism In the frequency-domain based simulation method, the computational time is mostly spent on solving the frequency-domain equation and performing the vector fitting. The time complexity for solving the frequency-domain linear equation system is about $O(N^{\alpha} \log f_{\max})$ , where N is the node number of the power network and item $logf_{\max}$ represents the number of frequency samples. We assume the complexity of solving one equation is $O(N^{\alpha})$ , where $\alpha$ is a quantity between 1 and 2 if using the efficient CGS solver. The time complexity of vector fitting is $O(N_a^2 \log f_{\max})$ , where $N_a$ is the order of approximation. For large-scale power network, the time for solving equation dominates the total computational time, because the node number N is much larger than $N_a$ . If the voltage responses of multiple nodes on power network are considered, the time for vector fitting will be multiplied by the number of output nodes $N_{out}$ . For analysis of maximum voltage variation, only some nodes at the lowest level of P/G grid are considered. Therefore, $N_{out}$ is a small number. The computational time of the frequency-domain based simulation method is not related with the number of time steps in a conventional transient simulation. Furthermore, since the nodal analysis approach is sufficient for generating the frequency-domain equations, solving each linear equation system is easier than that in conventional transient simulation. The latter usually involves larger linear equation system generated with the modified nodal analysis. In addition, the proposed simulation method can be easily parallelized. Because solving (2-5) for frequency samples are independent from each other, the work can be distributed to multiple processors. These three points indicate the advantage of proposed method over the conventional time-domain simulation methods. The numerical results in Section 2.2.2 validate the above analysis. ### 2.2 Experimental Results The proposed simulation method is implemented in C language. The CGS solver from PETSc [30] is used to solve the frequency-domain circuit equation (2-5), with an ILU preconditioner. A Matlab program is written to take in the frequency-domain responses and convert them to the time-domain voltage waveform with the help of vector fitting [31]. A parallel program using the message passing interface (MPI) is also implemented to show the parallelizability of the proposed simulation method. We firstly demonstrate the accuracy of the proposed frequency-domain based simulation method. Then, the numerical results showing the efficiency of the proposed method are presented, including the comparison with two commercial simulators: HSPICE and MSPICE. MSPICE is a fast SPICE simulator from Fastrack, which utilizes an iterative equation solver and is claimed to be two to ten times faster than other SPICE simulators [34]. The test cases of power network are provided by our industry partner. They are of mesh structure, similar to that in Figure 1-1, including R, L, C elements and current sources. All experiments are run on a four-core machine with 16GB memory. Each core has a 3.0 GHz Intel Xeon processor. #### 2.2.1 Accuracy of the Proposed Simulation Method With one of the test cases, we demonstrate the accuracy of our proposed simulation method. For this case, the upper bound of frequency $f_{max}$ is set to 4 GHz, and the number of frequency samples is 36. The frequency-domain responses are fitted with the vector fitting technique, where the fitting order $N_a$ is 9. Figure 2-3 shows the result of vector fitting for one output voltage. The root mean square (RMS) error is found to be $4.6 \times 10$ -12, which means the frequency-domain response is well approximated by a partial fractional function. In Figure 2-4, the time-domain voltage waveform converted from the partial fractional expression is compared with that obtained from transient simulation of HSPICE. The waveforms from both methods match very well. Let $V_i$ denote the voltage simulated from HSPICE at the ith time sampling point, and $\hat{V_i}$ is the corresponding voltage simulated from the proposed method. We utilize the 1-norm $|\cdot|_1$ to measure the average error ratio (AER) of the voltage response waveform: $$AER = \frac{\sum_{i} |\hat{V}_{i} - V_{i}|}{\sum_{i} |V_{i}|}$$ (2-8) For the accuracy on the maximum voltage drop, the peak error ratio (PER) is defined as: $$PER = \frac{\max(\left|\hat{V}_i - V_i\right|)}{\max(\left|V_i\right|)} \tag{2-9}$$ The accuracy of proposed simulation method relies on the number of sampling frequency points. The more frequency samples, the more accuracy will be achieved. We manually vary the number of frequency samples from 28 to 40, and draw the corresponding waveforms in Figure 2-4. We can see that the waveform with fewer frequency samples has less accuracy. The relative errors (AER and PER) for these waveforms are plotted in Figure 2-5, vs. the number of frequency points. This figure shows good accuracy of the proposed simulation method, and verifies the correlation between the accuracy and the number of frequency points. Figure 2-3: A frequency-domain response and its fitting result with the vector fitting technique. (a) with linear vertical axis, (b) with log vertical axis. The computational time of proposed simulation method is proportional to the number of frequency points. In Figure 2-5, the curve of CPU time is also plotted with "star" marks. Figure 2-4: Comparison of HSPICE and the proposed simulation method. Figure 2-5: Relative errors and CPU times vs. the number of frequency samples. ### 2.2.2 Efficiency of the Proposed Simulation Method Seven test cases of power network with the node number ranging from 5678 to above one million are used to demonstrate the efficiency of the proposed simulation method. The simulation time are compared with those of HSPICE and MSPICE, as listed in Table 2-1. The time for proposed method just includes that for solving the frequency- domain equation on the frequency samples. Because only additional 0.2 second per output node is needed for the vector fitting and converting to time-domain waveform, the total CPU time of the proposed method would be a little more than that in Table 2-1. From Table 2-1, we see that the proposed method is about 100 times faster than HSPICE, and the speedup to MSPICE is about ten or more. For large test cases, the speedup ratios are larger. The AER and PER of the voltage responses obtained from the proposed method are also listed in Table 2-1. They show the errors are all less than 1%, thus the proposed method has good accuracy. Table 2-1: Comparison of HSPICE, MSPICE and the proposed frequency-domain based method | Name of test case | # nodes | Time of<br>proposed<br>method (s) | Time of<br>Hspice (s) | Speedup to<br>HSPICE | | PER to<br>HSPICE | Time of<br>MSPICE (s) | Speedup to<br>MSPICE | |-------------------|---------|-----------------------------------|-----------------------|----------------------|-------|------------------|-----------------------|----------------------| | Ckt1 | 5678 | 1.8 | 63.0 | 35.0 | 0.02% | 0.2% | 11.6 | 6.4 | | Ckt2 | 11479 | 4.1 | 268.4 | 20.0 | 0.04% | 0.4% | 52.8 | 12.9 | | Ckt3 | 23011 | 8.3 | 622.3 | 75.0 | 0.03% | 0.3% | 70.0 | 8.4 | | Ckt4 | 46090 | 17.1 | 1636.5 | 95.7 | 0.03% | 0.3% | 152.6 | 8.9 | | Ckt5 | 92155 | 39.5 | 11126.5 | 281.7 | 0.09% | 0.3% | 428.7 | 10.8 | | Ckt6 | 369983 | 196.7 | N.A. | N.A. | N.A. | N.A. | 3798.1 | 19.3 | | Ckt7 | 1156220 | 815.5 | N.A. | N.A. | N.A. | N.A. | N.A. | N.A. | The proposed method is able to handle the power network with millions of nodes, as shown in Table 2-1. In contrast, HSPICE can not afford the case with more than one hundred thousand nodes. MSPICE is not valid for the case with one million nodes, either. With the CPU time in Table 2-1, we can also validate the computational complexity of the employed CGS equation solver, that shows it is of $O(N^{1.14})$ for these test cases. Table 2-2: CPU time of parallel computation | | Time of | Time of | | |--------------|---------------|---------------|---------| | Name of test | proposed | proposed | Speedup | | case | method with 1 | method with 4 | ratio | | | CPU (s) | CPU (s) | | | Ckt1 | 1.8 | 1.0 | 1.8 | | Ckt2 | 4.1 | 1.8 | 2.2 | | Ckt3 | 8.3 | 3.4 | 2.4 | | Ckt4 | 17.1 | 6.6 | 2.6 | | Ckt5 | 39.5 | 15.2 | 2.6 | | Ckt6 | 196.7 | 78.7 | 2.5 | | Ckt7 | 815.5 | 339.8 | 2.4 | On the machine with four cores, we carried out the experiment of parallel computation. For the seven cases, the computational times of the parallel program and the corresponding serial program are listed in Table 2-2. The speedup ratio is about 2.5 in this experiment. Because it is hard to fully balance the workload and there are other overheads, an ideal 4X speedup with 4 CPUs is not achieved. Nevertheless, with the parallel computation the speedup of proposed method to HSPICE or MSPICE becomes even larger. And due to the independence of solving (2-5) for different frequency point, more speedup would be achieved with more CPUs. ### 2.3 Summary A frequency-domain based transient simulation method is proposed. With the application of vector fitting technique and iterative equation solver from PETSc library, the frequency-domain based method is much more efficient than the conventional time-domain simulation method while preserving good accuracy. Numerical results show that the proposed simulation method is up to several hundred times faster than commercial simulators like HSPICE and MSPICE. And, the analysis flow is able to handle larger industrial cases with one million nodes. Preliminary results of parallel computation demonstrate larger speedup to the conventional simulation methods and the ease of parallelizing the proposed method. Chapter 2, in part, is a reprint of the paper "Efficient Power Network Analysis Considering Multi-Domain Clock Gating", co-authored with Wenjian Yu, Xiang Hu, Ling Zhang, Rui Shi, He Peng, Zhi Zhu, Lew Chua-Eoan, Rajeev Murgai, Toshiyuki Shibuya, Noriyuki Ito, and Chung-Kuan Cheng, in *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD)*. The dissertation author was the primary investigator and author of this paper. ## 3. Power Network Worst Case # **Noise Analysis** In this chapter, we first present the problem statement of worst-case noise in a multi-domain clock gated power network. In section 3.2, the linear time complexity algorithm to find the worst-case voltage drop is introduced. In section 3.3, we discuss the integer linear programming (ILP) based analysis flow for the worst-case voltage violation area. The summary will be given at last. ### 3.1 Problem Statement The analysis of voltage variation becomes more complicated for the low-power circuits with multiple clock domains. The technique of multi-domain clock gating has been used to reduce unnecessary power dissipation, by disabling the clock signals for some modules [35]. However, certain clock-gating patterns may induce the voltage resonance of the power network. Predicting the worst-case voltage variation caused by multi-domain clock gating is not only necessary, but also a challenge for the power integrity analysis of multi-clock-domain circuits. In this chapter, we propose an efficient framework to analyze the power network considering multi-domain clock gating. The worst-case clock gating patterns and the corresponding maximum voltage variation are predicted. Figure 3-1: The model of power supply network with multi-domain clock gating. The power network is usually modeled as a circuit including resistance, capacitance and packaging inductance, like that shown in Figure 3-1. Time-varying current sources are connected to some circuit nodes, characterizing the behavior of active circuit instances. These current sources draw current from the power network and cause voltage fluctuations [3]. The waveform of current source is usually described as a piecewise linear (PWL) function. In the low-power design with multiple clock domains, the circuit instances belong to different clock domains. The circuit instances within the same clock domain work synchronously. Each domain is governed by one clock controlling signal. Figure 3-1 shows such a configuration with four clock domains. The value of clock signal indicates whether the instances in the domain work or sleep at current clock cycle, and the sequence of clock signal is called clock gating pattern. Bit "1" in the pattern means that the instances in the domain work for this cycle, while bit "0" represents the sleep mode of the instances. The clock gating pattern affects the voltage fluctuation of power network, because it determines the behavior of current sources in the model. We consider two kinds of worst-case situation. One is the worst-case voltage drop which is the largest voltage drops away from the Vdd. The other is the maximum violation area, which describes the accumulating effect of the noise. The violation area at node i is defined as: $$A_{i} = \int_{0}^{T} \max(V_{\min} - v_{i}(t), 0) dt$$ (3-1) where $V_{\min}$ is the allowed voltage drop. The main purpose of the analysis work is to determine the clock gating patterns for all of the domains that cause the maximum voltage variation (either voltage drop or violation area) at given nodes of power network. Below we summarize the assumptions taken in this work: - (1) The current profiles of current sources are known, and described as PWL functions. - (2) We assume that a current source has the same waveform for different working cycles. This scenario is supposed to correspond to the worst case of voltage fluctuation. - (3) If a circuit instance is under the sleep mode, the corresponding current source is assumed to have zero current. This assumption omits the leakage current, but the proposed method can be easily extended to consider it. - (4) The clock gating patterns for different domains are independent from each other. To consider the influence of multi-domain clock gating on supply voltage variation, we divide the task into two steps. Firstly, the time-domain voltage response of the power network is simulated with a single clock domain working for one clock cycle. Then, with the simulation results of all clock domains, we propose algorithms to find the maximum voltage variation and corresponding worst-case clock gating patterns. ### 3.2 Worst-Case Voltage Drop We firstly derive the voltage response for an arbitrary clock gating pattern, using the response corresponding to the current sources working for one cycle. Then, the algorithm to predict the maximum voltage variation considering multi- domain clock gating is presented. ### 3.2.1 The Voltage Response for an Arbitrary Clock Gating Pattern We firstly consider the situation where there is only one clock domain. Suppose all current sources only work for the first clock cycle. The voltage of power network will fluctuate for several cycles before reaching steady state, due to the resonance in circuit. We use $y_0(t)$ to denote the voltage response at a given node. For the situation where the current sources work for multiple cycles with an arbitrary clock gating pattern, the voltage response can be derived using $y_0(t)$ and the principle of superposition. Suppose $f_i(t)$ denotes the first-cycle waveform of the *ith* current source. Its waveform within the first k cycles can be expressed as: $$g_i(t) = \sum_{l=0}^{k-1} b_l f_i(t-lT), i = 1,L, N_s,$$ (3-1) where T is the clock cycle time, and sequence $\{b_l\}$ represents the clock gating pattern. $N_s$ is the total number of current sources. If the clock domain is enabled at the *lth* cycle, $b_l$ =1, otherwise $b_l$ =0. Because all current sources in the domain work synchronously and the power network is a linear circuit, the voltage response corresponding to the arbitrary clock gating pattern becomes: $$y(t) = \sum_{l=0}^{k-1} b_l y_0(t - lT) \quad . \tag{3-2}$$ If $y_0(t)$ reaches its steady state after n cycles, the l in (3-3) needs to satisfy 0 < t - lT < nT to contribute non-zero value to the summation. That is, $$\frac{t}{T} - n < l < \frac{t}{T} \tag{3-3}$$ This means we just need to check at most n bits of clock signal (value of $b_l$ ) for calculating y(t). For the voltage response during the kth cycle, these bits are the controlling signal for the kth cycle and the preceding n-1 cycles. Figure 3-2 shows an example of $y_0(t)$ . Suppose one clock cycle is 5ns, we find out that the waveform takes 6 cycles to reach the steady state. For this example, we depict the waveforms for the 6 cycles separately, and arrange from top to bottom in Figure 3-3. According to (3-3) and (3-4), the voltage response y(t) within a given clock cycle can be obtained by selectively superimiposing these waveforms. To generate the voltage response during a specified cycle, the 6 sequent bits of clock signal are needed, which correspond to the six waveforms in Figure 3-3 respectively. For each enabled clock bit, the corresponding waveform is kept. Finally, summing up all kept waveforms together gives the result of y(t) for the specified cycle. Figure 3-3: The one-cycle portions of $y_0(t)$ are arranged for superposition. Above derivation only considers the current sources. The obtained waveform y(t) needs to be added with the initial value of voltage to consider the effect of supply voltage source. ### 3.2 2 Find the Maximum Voltage Variation Considering Multi-Domain Clock Gating We firstly consider the problem with only one clock domain. With above deduction, we know that the output y(t) is calculated by superimposing different portions of the waveform $y_0(t)$ . Given a time point, if only portions having positive voltage at this point are selected, the superimposed result y(t) must reach the largest value at this point. Then, sweeping all time points in one cycle with the above manipulation, we can find the maximum positive voltage variation and the corresponding clock gating pattern. The situation is similar for finding the maximum negative voltage variation, where we select the waveform portions contributing negative voltage. For the problem with multiple clock domains, the above strategy is still valid with little modification. Suppose yi(t), i=1, ..., D denotes the voltage response at the given node if only current sources in the ith domain work for the first cycle while other domains are sleeping. Here D is number of clock domains. Then, the voltage response y(t) corresponding to arbitrary clock gating pattern becomes: $$y(t) = \sum_{i=1}^{D} \sum_{l=0}^{k-1} b_l^{(i)} y_i(t - lT), \qquad (3-4)$$ where $b_l^{(i)}$ is the *lth* bit of the clock gating pattern for the *ith* domain. If each $y_i(t)$ takes n cycles to reach the steady state, we need to arrange all $n \cdot D$ waveform portions in the manner shown in Figure 3-3. Then, the maximum voltage variation can be found like what is done for the single-domain problem. D: the number of clock domains. $C_i$ : the number of cycles for $y_i(t)$ to get saturated. $y_i^P(t)$ : the positive function for the *i*th domain. $y_i^P(t)$ $= y_i(t)$ , when $y_i(t) > 0$ , when $y_i(t) \le 0$ , $\forall i \in \{1, 2, ..., D\}$ $y_i^N(t)$ : the negative function for the *i*th domain. $y_i^N(t)$ $= y_i(t)$ , when $y_i(t) \le 0$ , when $y_i(t) > 0$ , $\forall i \in \{1, 2, ..., D\}$ P(t): the resulted curve by superimposing $y_i^P(t)$ for all clock domains $(t \in [0,T]).$ N(t): the resulted curve by superimposing $y_i^N(t)$ for all clock domains $(t \in [0,T]).$ $P_{G}[i][j]$ : the selection status of the jth portion of $y_{i}(t)$ for finding positive variation. If it is selected in the superposition, $P_{G}[i][j] = 1$ , otherwise $P_{c}[i][j] = 0.$ $N_{G}[i][j]$ : the selection status of the jth portion of $y_{i}(t)$ for finding negative variation. If it is selected in the superposition, $N_G[i][j]=1$ , otherwise $N_{G}[i][j]=0.$ Figure 3-4: Parameters in the algorithm finding the worst case of voltage variation To describe the algorithm of finding the maximum voltage variation considering multi-domain clock gating, we list the relevant parameters in Figure 3-4. And, the algorithm for the maximum positive voltage variation is presented in Figure 3-5. It is straightforward to give a similar algorithm description for finding the maximum negative voltage variation. The computational complexity of the algorithm in Figure 3-5 is about $O(N_p nD)$ , where $N_p$ is the number of discrete time points within interval [0, T] and n is the average number of cycles for $y_i(t)$ to saturate. The value of $N_p$ is affected by the desired precision of the obtained time-domain waveform, and is usually much larger than n or D. With this analysis, we know that the algorithm finding the worst-case voltage variation has the linear computational complexity. And, the algorithm consumes little time because it does not involve any complex calculation. ### Algorithm for finding the worst case of positive voltage variation: - 1. Superimpose the positive functions $y_i^p(t)$ , $y_i^p(t-T)$ , ..., $y_i^p(t-(C_i-1)T)$ on the interval $t \in [(C_i-1)T, C_iT]$ . - 2. Perform Step 1 for all domains, then shift and superimpose the curves together. The result curve is: $$P(t) = \sum_{i=1}^{D} \sum_{l=0}^{C_i-1} y_i^P (t + (C_i - 1 - l)T), \ t \in [0, T].$$ - 3. Find the maximum of P(t): $P(t_{MP}) = \max(P(t))$ , which is the maximum positive voltage variation. For time point $t_{MP}$ , assign the values of $P_c[i][i]$ . - 4. The worst-case clock gating patterns $\{b_i^{(i)}\}$ are: $$\{b_l^{(i)} = P_G[i][C_i + 1 - l], l = 1, 2, K C_i\}, i = 1, ..., D.$$ Figure 3-5: The proposed algorithm for finding the worst case of positive voltage variation. Section 2.1.5 discussed that the complexity for the proposed simulation flow is $O(N^{\alpha} \log f_{\max})$ . For the whole analysis flow including simulating voltage response and finding the worst-case clock-gating patterns, the dominant computation is the simulation for each domain, and therefore the complexity is about $O(N^{\alpha}D\log f_{\max})$ , where D is the number of domain and $\alpha$ is a quantity between 1 and 2. #### 3.2.3 Experimental Results We analyze two power network cases with multi-domain clock gating. The first one has four clock domains. The current sources are uniformly distributed in each clock domain and are synchronized with each other. The clock frequency is 200 MHz and the voltage response for one-cycle current sources takes six cycles to reach the steady state. The four curves in Figure 3-6 show the voltage responses with only one clock domain working for one cycle. The node which we are concerning is at the center of domain 1. In Table 3-1, we present the worst cases of voltage variation considering each clock domain respectively, and considering all clock domains. The second column gives the peak voltage in the response waveform caused by current sources working for only one cycle. The third column shows the worst case of voltage variation caused by a sequence of clock signal. The last column includes the corresponding clock gating patterns. For example, the voltage response caused by one-cycle current sources in domain 1 has a minimum value of -10.5 mV. Then, if the clock pattern for domain 1 is { 1, 1, 1, 0, 1, 1}, the resulting maximum voltage drop would be 11.8 mV. This means that the variation will be worse if the clock gating technique is used. The last row in Table 3-1 shows the worst case for the actual circuit with four clock domains. The maximum voltage drop will be 15.7 mV, worse than any result considering single clock domain. The second case is a large-scale industrial case. This case includes about $3\times10^5$ nodes, and $10^4$ current sources. The circuit is divided into four clock domains, and the clock frequency is 2 GHz. For the proposed simulation method, the $f_{max}$ is set to 10 GHz. The simulation results show that the voltage response for one-cycle stimulus takes 30 cycles to reach the steady state. Figure 3-7 shows the voltage responses with only one clock domain working for one cycle. The maximum voltage drops caused by each domain are 0.15mV, 1.9mV, 0.41mV and 23.4mV, respectively. If considering all the four domains together, the worst-case voltage drop is 45.5mV. This result suggests again that the power network voltage variation will be much larger in circuit with multi-domain clock gating. Figure 3-6: The voltage responses of the first case, corresponding to one-cycle current sources in a single clock domain Table 3-1: The worst cases of voltage variation considering only one clock domain, and all clock domains | | Maximum | Worst-case | Worst-case pattern | | |----------|---------------|------------|--------------------|--| | | vari- | voltage | | | | | ation of | variation | worst-ease pattern | | | | $y_0(t)$ (mV) | (mV) | | | | Domain 1 | -10.5 | -11.8 | {1 1 1 0 1 1} | | | Domain 2 | -3.56 | -4.06 | {1 1 1 0 1 1} | | | Domain 3 | -2.26 | -2.68 | {1 1 1 1 1 1} | | | Domain 4 | -2.86 | -3.47 | {1 1 1 1 1 1} | | | | | | Domain 1: {011101} | | | All | | -15.7 | Domain 2: {011101} | | | domains | | -13./ | Domain 3: {111011} | | | | | | Domain 4: {111011} | | Figure 3-7: The voltage responses of the large-scale industrial case, corresponding to one-cycle current sources in a single clock domain. ### 3.3 Worst-Case Violation Area With the continuous shrinking of feature size, leakage is becoming a major challenge for VLSI design [36]. In this section, we discribe the worst-case violation area analysis flow considering leakage current in a multi-domain power network. We propose a general model to identify the worst-case gating pattern and the maximum variation area with arbitrary leakage current. For low power wireless chips, we introduce another simplified model, which treats the leakage to be a DC current. The two models are formulated with integer linear programming (ILP). We consider the worst case with the maximum voltage violation area, which presents the accumulating effect of the noise as formulated in equation (3-1). The objective is to predict the sequence of the clock gating signal for each domain causing the maximum voltage violation area at observing nodes. ### 3.3.1 Analysis Flow Based on Superposition The idea is based on linear circuit superposition as explained in section 3.2.1. We first consider the case with one clock domain, and then extend into multiple domains. The analysis flow considering leakage current for one domain is described in Figure 3-8. The voltage variation waveforms at observation node are calculated with active current source and leakage current source working respectively in one clock cycle. The voltage variation may take $n_k$ cycles to reach the steady state. The dissected waveforms in $n_k$ cycles need to be superimposed. However, for each cycle, if the clock is enabled, the voltage variation caused by active current is selected. On the other hand, if the clock is disabled, the voltage variation caused by leakage current is chosen. The goal is to determine the clock gating pattern for each cycle that makes the superimposed variation waveform in one cycle have maximum violation area. If there are P domains, the total voltage variation for the observation node is the summation of the variation contributed by each domain. Assume the working frequencies of all the domains are the same which means the period T is the same. We superimpose these dissected variation waveforms from each domain as shown in Figure 3-9. Then the clock gating pattern ("1" or "0") for each domain needs to be determined to maximize the violation area of the overall superimposed variation waveform. The enumeration method exhaustively tries all the possible clock gating patterns. Figure 3-8: Analysis flow in one domain considering leakage current. Figure 3-9: Analysis with multiple domains. *P*: the number of domains the number of cycles that voltage variation $n_k$ : waveform reaches steady state in the kth domain. the number of cycles needed in superposition of n: all the domains, which is $\sum_{n=1}^{p} n_n$ ; the number of sample voltage response in each m: cycle; nominal voltage; minimal voltage requirement. Voltage $V_{\min}$ : considered to be violation if below this value; $V_{ii}^{A}, V_{ii}^{L}$ : voltage responses at the jth sampling point of the *i*th cycle, $(1 \le i \le n, 1 \le j \le m)$ , for active and leakage current respectively; The active current includes the dynamic and leakage currents; $v_{\parallel}^{\prime\prime}$ , $v_{\parallel}^{\prime\prime}$ : voltage variation from $v_{\perp}$ for the active and leakage current, i.e. $V_{dd} - V_{ii}^A$ and $V_{dd} - V_{ii}^L$ ; *V*: voltage variation from $V_{Ad}$ with only dynamic current; the allowed minimal value of $\psi_{i}$ . If $\psi_{i}$ is larger than *cutoff*, voltage violation occurs; time interval between adjacent sample points; $d_i$ : *M*: a sufficiently large constant. Figure 3-10: Parameters description for ILP formulations. ### 3.3.2 Models to Predict the Worst-Case Violation Area Considering Leakage #### Current With the reduction of power supply voltage and increasing operating frequency, the threshold voltages have to scale aggressively, and result in higher subthreshold leakage currents [37]. On the other hand, the gate leakage is becoming larger with the reduction of the gate oxide thickness [36]. Therefore, we can not ignore the leakage current in noise estimation. In the general model, both the dynamic current and leakage can have arbitrary waveform. When the clock is enabled, the overall active current includes both dynamic and leakage current. While the clock is disabled, which means in sleep or idle mode, there is only leakage current. In order to take the leakage effect into consideration when clock is disabled, we need to have two voltage variations contributed by the active current and leakage current respectively as Figure 3-8. The two waveforms are superimposed. When the clock is enabled, we select the dissected waveform from the active response. Otherwise, select the dissected waveform from the leakage response. Then, the overall superimposed voltage waveform can be obtained considering both active and leakage current effect. This general model covers both dynamic and leakage current effect to analyze the worst case violation area. For general processors such as IBM processors, the leakage currents vary in active and idle mode because of circuit activity and temperature [36]. The separation voltage response computation for active and leakage current in the proposed model provides the capability to handle this case. The simplified model considers the leakage current to be a DC constant. Unlike the general processors whose leakage current takes a large portion of active current (i.e. 30%), the low power wireless chips have a limited percentage of leakage (i.e. 1%), and the overall current is small (i.e. 300mA) [38]. Therefore, the leakage current can be approximated to be a DC constant, and we assume that the current value keeps the same in either active or idle mode. We can then simplify the general model to deal with the DC leakage current sources. Since the voltage response for DC current sources is still a DC constant, the voltage variation contributed by leakage is simply a DC bias. This will simplify the ILP formulation which will be explained in section 3.3.3.2. ### 3.3.3 ILP Based Algorithm We proposed an Integer Linear Programming (ILP) based method to determine the clock gating pattern for the worst-case voltage violation area. The objective function is to maximize the violation area of the superimposed voltage variation waveforms. The decision variables are the clock gating signal for every cycle at each domain. Thus, this problem can be solved optimally by commercial ILP solver such as CPLEX [39]. We will describe the formulations for both general model and simplified model below. #### 3.3.3.1 General Model These parameters used in the proposed models are listed in Figure 3-10. We sample the voltage waveform in each cycle with m time points, whose intervals are $d_j$ seconds $(1 \le j \le m)$ . The following variables are used in the ILP: (1) $x_i^A \in \{0,1\}, 1 \le i \le n, x_i^L \in \{0,1\}, 1 \le i \le n$ : binary variables to indicate the status of clock gating signal for the ith cycle. If the clock is enabled, $x_i^A$ is "1", $x_i^L$ is "0", and the dissected waveform from $V_{ij}^M$ will be selected. If the clock is disabled, $x_i^A$ is "0", $x_i^L$ is "1", and the dissected waveform from $V_{ij}^M$ will be selected. - (2) $y_j \in \{0,1\}, 1 \le j \le m$ : binary variables to indicate whether the *jth* voltage sampling violates the allowed amount. These are intermediate variables used to compute the violation amount in the *jth* sampling point. - (3) $u_j \in [0, \infty), 1 \le j \le m$ : continuous auxiliary variables to represent the total violated amount for the jth voltage sampling. The ILP formulation is then presented as follows: Maximize: $$\sum_{j=1}^{m} d_j u_j$$ Subject to: $$y_{j} \cdot M \ge \sum_{i=1}^{n} V_{ij}^{M} x_{i}^{A} + \sum_{i=1}^{n} V_{ij}^{M} x_{i}^{L} - cutoff, \ 1 \le j \le m$$ (3-5) $$(y_j - 1) \cdot M \le \sum_{i=1}^n V_{ij}^{\emptyset} x_i^A + \sum_{i=1}^n V_{ij}^{\emptyset} x_i^L - cutoff, \ 1 \le j \le m$$ (3-6) $$u_{j} \leq \sum_{i=1}^{n} V_{ij}^{A} x_{i}^{A} + \sum_{i=1}^{n} V_{ij}^{A} x_{i}^{L} - cutoff + M(1 - y_{j}), \ 1 \leq j \leq m \ (3-7)$$ $$u_j \le M \cdot y_j, \ 1 \le j \le m \quad . \tag{3-8}$$ $$x_i^A + x_i^L = 1, \quad 1 \le i \le n$$ (3-9) The objective is the total violation area, which is the summation of the area in each sampling. Constraints (3-6) and (3-7) describe the how yi works: (3-6) enforces $y_i$ to be 1 if $\sum_{i=1}^{n} V_{ij}^{\mathcal{A}} x_{i}^{A} + \sum_{i=1}^{n} V_{ij}^{\mathcal{A}} x_{i}^{L} > cutoff$ , which means the cutoff is violated in this point and the area should be counted to be violation; (3-7) makes $y_{j}$ be 0 if $\sum_{i=1}^{n} V_{ij}^{\mathcal{A}} x_{i}^{A} + \sum_{i=1}^{n} V_{ij}^{\mathcal{A}} x_{i}^{L} < cutoff$ . Constraints (3-8) and (3-9) restrict $u_{j}$ by using $y_{j}$ : $u_{j} \leq \sum_{i=1}^{n} V_{ij}^{\mathcal{A}} x_{i}^{A} + \sum_{i=1}^{n} V_{ij}^{\mathcal{A}} x_{i}^{L} - cutoff$ when $y_{j}=1$ according to (3-8), and $u_{j} \leq 0$ when $y_{j}=0$ according to (3-9). Since the objective function needs to be maximized, constraints (3-8) and (3-9) are actually equivalent to the following conditional assignment: $u_{j} = \sum_{i=1}^{n} V_{ij}^{\mathcal{A}} x_{i}^{A} + \sum_{i=1}^{n} V_{ij}^{\mathcal{A}} x_{i}^{L} - cutoff$ if $y_{j}=1$ , and $u_{j}=0$ otherwise. Constraint (3-10) ensures either the active waveform or the leakage waveform will be selected. The above formulation presents the violation area maximization problem with only 2n+m binary variables. Thus it can be efficiently solved by CPLEX. #### 3.3.3.2 Simplified Model In simplified model, the DC leakage current exists in both active and idle mode, and the DC current will contribute a DC bias $V_{DC}$ to the voltage response with dynamic current. The ILP formulation for simplified model is presented as follows: Maximize: $\sum_{j=1}^{m} d_j u_j$ Subject to: $$y_{j} \cdot M \ge \sum_{i=1}^{n} V_{ij}^{QP} x_{i} - cutoff + V_{DC}, \ 1 \le j \le m$$ (3-10) $$(y_j - 1) \cdot M \le \sum_{i=1}^n V_{ij}^{0} x_i - cutoff + V_{DC}, \ 1 \le j \le m$$ (3-11) $$u_{j} \le \sum_{i=1}^{n} V_{ij}^{0} x_{i} - cutoff + V_{DC} + M(1 - y_{j}), \ 1 \le j \le m$$ (3-12) $$u_i \le M \cdot y_i, \ 1 \le j \le m \quad . \tag{3-13}$$ Because the leakage is simplified to be a DC constant, we just add the DC bias $V_{DC}$ to the dynamic voltage response in constraints (3-11)-(3-14), instead of two voltage variation curves together like constraints (3-6)-(3-8). #### 3.3.4 Experimental Results We implement the ILP based method with the ILOG CPLEX9.1.10. We also implement an enumeration method for comparison, which exhaustively tries all possible clock gating patterns. The experiments are run on a 3.2GHz Pentium 4 machine with 1GB memory. The test cases are simplified industrial power networks for low power wireless chips. Therefore, we apply the simplified model to estimate the worst case voltage violation area. These power networks are of mesh structures with on-chip R, C and inductive components from package. The dynamic current for the whole chip is about 360mA, and the percentage of the leakage current varies from 1% to 5%. The VDD is 1V, and the cutoff to determine violation is 5% of VDD which is 0.05V. The number of clock domains in the test cases varies from 4 to 10. A node at the center of a clock domain is selected as the observation point, whose voltage response is simulated. The number of cycles required for superimposition is 6. We first show the proposed ILP based method via the four-domain example with 1% leakage in Table 3-2. Figure 3-11 shows the voltage responses with each domain working respectively. The worst violation area clock gating pattern given by the proposed algorithm in this example is {110011, 110001, 110011, 110011}, with each group for a clock domain. And the worst case violation area under that pattern is displayed in Figure 3-12. The dotted line is the worst case voltage response, the VDD is 1V, and the cutoff voltage to determine violation is 0.05V which is represented by the dashed line. Therefore, the area below this dashed line is the violation area whose value is 51.78 mV·ns. Then we compare the computational time between the enumeration method ("T\_enum.") and the proposed ILP based method ("T\_ILP"). The total leakage current is 1%. The number of decision variables is proportional to the number of domains. So the complexity of enumeration method grows exponentially as the number of domains grows. Hence, it only works for these cases with small numbers of clock domains. For the four-domain case, the enumeration method consumes 21 seconds which is over 200 times slower than the ILP based method. The proposed ILP based method works efficiently for complicated cases with more domains, and provides an optimal solution. The simulation time is not included in the computational time of Table 3-2. We also show the worst case violation area with different percentage of leakage in Table 3-3. The test case is the first one in Table 3-2 which is a four-domain power network. If we do not consider leakage current, the violation area is 51.685 mV·ns. When the percentage of leakage is increased from 1% to 5% which are some typical rates in low power wireless chips, the violation area keeps increasing as shown in the third column in Table 3-3. For general processors, the leakage could take more percentage of total current (i.e. 30%) and therefore, the violation area considering leakage would be much larger than the area without considering it. This implies the importance to have a worst case violation area analysis model that takes the leakage current into consideration. Figure 3-11: Voltage responses with each domain working respectively. Figure 3-12: Worst case voltage violation area. Table 3-2: Computational results with 1% leakage | Test | # Clock | T_enum. | T_ILP | A_ILP | |------|---------|---------|------------|-----------------| | case | Domain | (s) | <b>(s)</b> | $(mV \cdot ns)$ | | 1 | 4 | 21 | <0.1s | 51.78 | | 2 | 6 | N.A. | <0.1s | 46.05 | | 3 | 8 | N.A. | <0.1s | 56.87 | | 4 | 10 | N.A. | <0.1s | 62.67 | Table 3-3: Violation area comparison with different leakage percentage for a four-domain network | % of<br>Leakage | ViolationArea<br>(mV·ns) | % Violation area increased | | | |-----------------|--------------------------|----------------------------|--|--| | | ` , | nici cascu | | | | No leakage | 51.685 | U | | | | 1% | 51.778 | 0.18% | | | | 2% | 51.873 | 0.36% | | | | 3% | 51.971 | 0.55% | | | | 4% | 52.069 | 0.74% | | | | 5% | 52.171 | 0.94% | | | #### 3.4 Summary We present algorithms to analyze the worst-case voltage drop and violation area in power networks with multi-domain clock gating. A linear time complexity algorithm is introduced for the worst-case voltage drop. The ILP formulation is used to model the worst-case voltage violation area. The analysis results suggest that the power network noise would be much larger if considering the clock gating signals of multiple domains, and the leakage current is not negligible. Chapter 3, in part, is a reprint of the paper "Efficient Power Network Analysis Considering Multi-Domain Clock Gating", co-authored with Wenjian Yu, Xiang Hu, Ling Zhang, Rui Shi, He Peng, Zhi Zhu, Lew Chua-Eoan, Rajeev Murgai, Toshiyuki Shibuya, Noriyuki Ito, and Chung-Kuan Cheng, in *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD)*. It is also a reprint of the paper "Predicting the Worst-Case Voltage Violation in a 3D Power Network ", co-authored with Wenjian Yu, Xiang Hu, Amirali Shayan, A. Ege Engin, Chung-Kuan Cheng, in *IEEE/ACM International Workshop on System Level Interconnect Prediction (SLIP 2009)*. The dissertation author was the primary investigator and author of these papers. ## 4. Power Network Optimization In this chapter, we discuss the approaches for power network optimization. Section 4.1 introduces the noise minimization algorithm during power-up stage for a multi-domain power network. Section 4.2 presents the optimization using both decaps and controlled-ESRs. We summarize these methodologies in section 4.3. ### 4.1 Noise Minimization during Power-Up Stage In current multiple power domain (MPD) designs, we need to take into account of the sequence of different power domains for minimizing the overall power noise. Also, we make sure all power domains are completely powered up before proceeding to other tasks. For example, the CPU may wait until the rest of the chip is powered up before booting [17]. As a result, all domains need to start powering up before a particular time point, which is referred as "deadline" in this paper. Turning on transistors in sequence can avoid drawing large amounts of current. The switches are grouped into several sets, and were turned on with delay in between [17]. Similarly, the power-up sequence for all domains is also critical for limiting the rush current, so that there may not cause voltage spikes that could corrupt registers. There are some timing relationships between domains due to the signal or data transition. For example, if domain A wants to get data in the tenth cycle after powering up from domain B, B needs to be turned on in time so that its data will be available when A acquires it. More applicably, there was a real industry hierarchical power distribution design with a power tree [20]. Those domains on lower levels of the hierarchy can be in the powered-on state only if the domains on higher levels are on. Therefore, when designing the power sequence for all domains, we need to consider the inter-domain timing relationships as constraints. To analyze the total noise when powering up all domains, the idea of superposition is utilized. We assume that the power network is a linear time invariant system, the voltage drop at one node is the superposition of those voltage drops caused by all domains individually. In this sense, we divide the analysis work into two steps. Firstly, we simulate the voltage response at the observation node with each domain working respectively, and then obtain the voltage drops. Secondly, we analyze the voltage noise with the superposition of all the voltage drops. #### **4.1.1 Problem Statement** Figure 4-1 illustrates the power-up sequence for multiple domains. Each row corresponds to the power status for a domain, and there are D domains. For each domain, one square represents the power status in one clock cycle. The blank square denotes power off, while the dark square denotes power on. $X_i$ ( $1 \le i \le D$ ) is the cycle when the ith domain switches to power on. Therefore, the voltage response contributed by the ith domain keeps zero during the previous $X_i$ -1 cycles. The nonzero voltage waveform can be generated by shifting the response powered up at the $X_i$ cycle. Based on the superposition idea, the overall voltage drop would be the summation of the drops contributed by all the domains. The noise minimization problem during power-up stage can be formulated to be an optimization work as shown in Figure 4-3. This problem is to find a power-up time sequence for multiple domains, denoted by $X_1$ , $X_2$ , ..., $X_D$ , to minimize the voltage violation area for a given observation node at power network. The related parameters are listed in Figure 4-2. We sample the voltage waveform for each domain with P time points, whose intervals are $d_i$ nano-seconds ( $1 \le i \le P$ ). Based on the violation area definition, the violated amount for the ith sampling point is the amount exceeding a tolerable cutoff. Then the violation area can be approximated by the multiplication of violated voltage and sampling interval. The inter-domain timing relationship will become the constraints as shown in Figure 4-3. Constraint (1) means domain k can start powering up no earlier than $a_{jk}$ cycles and no later than $b_{jk}$ cycles after domain j. The deadlines for powering up also impose additional constraints. Figure 4-1: Illustration of power-up sequence for multiple domains. - D: the number of domains; - T: the time period of clock; - P: the number of time samples to describe the voltage response contributed by one domain; - d: the interval between adjacent time sample points; - $X_i$ : the starting cycle when domain *i* bocomes power on; - $L_i$ : the last power-up cycle (deadline) for domain i; - $V_{\rm dd}$ : nominal high-level voltage; - $V_{\min}$ : minimal voltage requirement; Voltage is considered to be violation if below this value; - Cutoff: the allowed maximum voltage drop, i.e. $V_{\rm dd}$ - $V_{\rm min}$ . - $V_{\text{sup}}^{i}$ : superimposed voltage drop for the *i*th sampling point. - $V_{\text{violate}}^{i}$ : violated voltage amount for the *i*th sampling point. Figure 4-2: Parameter description for power-up sequencing problem. Power-up sequencing problem statement Objective function: $$\min \sum_{i=1}^{P} V_{violate}^{i} \cdot d$$ where $V_{violate}^{i} = \begin{cases} V_{sup}^{i} - cutoff, & \text{if } V_{sup}^{i} - cutoff > 0\\ 0, & \text{otherwise} \end{cases}$ . Constraints: - (1) Inter-domain timing relationships, i.e., $X_j + a_{jk} \le X_k \le X_j + b_{jk}$ ; - (2) Deadline to start powering on, i.e., $0 \le X_i \le L_i$ . Decision Variables: $$X_1, X_2, ..., X_D$$ . Figure 4-3: The power-up sequencing problem. This problem is NP-complete, and the proof is given as follows. *Lemma*: The Power-up sequencing problem (Figure 4-3) is NP-complete. *Proof*: The power-up sequencing problem is in NP, because verifying whether the violation area corresponding to a given sequence is less than a particular value or not can be done in polynomial time. To prove it is NP-hard, we reduce a known NP-complete problem, i. e. the partition problem [40], to the power-up sequencing problem. The partition problem is to decide whether a given set of m integers $A_1$ , ..., $A_m$ with the total sum S can be partitioned into two subsets that have the same sum S/2. We would like to reduce it to the decision version of the power on sequence problem, i.e. can we find a sequence such that the total violation area is less than or equal to a constant K. From an instance of the partition problem, we construct an instance of the powerup sequence with m domains. Each domain has two cycles and one sampling point per cycle. The voltage drop at this two sampling points for domain i are $V_{drop}^{i}[1] = A_{i}$ and $V_{drop}^{i}[2] = 0$ . Let us suppose Cutoff = S/2. Because there are only two cycles with one sampling point per cycle, each domain can shift 0 or 1 cycle, which means $X_{i} = 0/1$ . We can show that the partition problem has a solution if and only if the power-up sequence has a solution with violation area 0. When a partition problem has a solution of two subsets with equal sum S/2, we can start the domains which correspond to the first subset in the first cycle, and start those domains which correspond to the second subset in the second cycle. Thus we have $V_{\text{sup}}^1 = S/2$ , and $V_{\text{sup}}^2 = S/2$ . As the Cutoff = S/2, the violation area is 0. Conversely, when there is a solution to the power-up sequencing problem with violation area 0, we can partition the set based on the cycle where each domain is started. Hence we have proved the power-up sequencing problem is NP-complete #### 4.1.2 The Simulated Annealing Based Method The proposed method consists of three parts: domain ordering, a greedy algorithm to get an initial solution, and the simulated annealing (SA) based searching. We first order the position of domains in the solution to speed up the searching. Then, a greedy initial solution is obtained. The SA based searching algorithm solves the optimal powering up sequence for minimizing the total violation area. #### 4.1.2.1 Domain Ordering We order the domains as a sequence to generate the feasible solution. Because in one solution $X_1, X_2, ..., X_D$ , the domain which appears earlier will restrict the possible range for these domains that appear later based on the constraints. Therefore, if we have determined the starting cycles of those domains which have more constraint relationships with others, the following domains will have less search space. As a result, the searching efficiency will be greatly improved. The inter-domain timing constraints are modeled as a directed graph, shown in Figure 4-4. Every node represents one domain. A directed edge (*A*, *B*) points from one domain to another, which means domain *B* depends on domain *A*. The number on each edge is the range of cycles that domain *B* can choose based on the constraints given by domain *A*. For example, if one of the inter-domain constraints are: Domain 2 needs to power up after 10 cycles but before 14 cycles of domain 1, which means the freedom of domain 2 based on domain 1 is 4. Following the same rule, we construct the directed inter-domain relationship graph as show in Figure 4-4. The ordering algorithm is described in Figure 4-6, and Figure 4-5 shows the related parameters, where sequence P is the output of the algorithm. The main idea is: we want to firstly select the domain which can constrain more other domains and let these domains as less freedom as possible. The evaluation expression in (4-1) means the average freedom of the domains controlled by domain i. For a particular domain i, its value becomes smaller if domain i gives less freedom per domain that it controls. As the example in Figure 4-4, the first domain to choose is domain 1, because it is the only one with zero input degree. After we delete domain 1 and its output edge, the next domain to choose is domain 2, because its evaluation value is (5+14)/2=9.5, while the evaluation value for domain 6 is 12. If we continue this algorithm, the ordering result for the six domain power network in Figure 4-4 is: 1, 2, 6, 3, 4, 5. Figure 4-4: A constraint graph modeling the inter-domain relationships. S: the set of domains that have not been ordered; P: the ordered sequence of domains; Freedom<sub>ij</sub>: the selection range of domain j based on the constraint from domain i; OutDegree<sub>i</sub>: the output degree for domain i, which is the number of domains constrained by domain i. Figure 4-5: Parameter description for the ordering algorithm. Ordering Algorithm: Given the constraint graph $P=\varnothing; S=$ all the domains; While $S!=\varnothing$ do If (there is no domain in S which has output degree) Add S to the end of P; break; EndIf; Among the domains in S without input degree and with output degree, choose the domain i with $\frac{\sum_{j} Freedom_{ij}}{OutDegree_{i}}; \qquad (4-1)$ Delete domain i and its out edges from S; Add domain i to the end of P; EndWhile. Figure 4-6: The domain ordering algorithm. #### 4.1.2.2 Greedy Initial Solution We propose a greedy algorithm to find the initial solution for the SA based algorithm. As shown in Figure 4-7, the algorithm scans the ordered domains one by one. For each domain, it makes the local optimal decision which causes the minimal superimposed violation area. Finally, it outputs the initial values of X1, X2, ..., XD. ``` Greedy Initial Solution Algorithm: For i = 1 to D, According to the values of X_j, j < i, and the freedoms associated with the constraints on domain i, determine the possible values for X_i; Choose the value of X_i such that the superimposed violation area caused by domains from 1 to i is minimal; EndFor. ``` Figure 4-7: A greedy algorithm to find the initial solution. #### 4.1.2.3 Simulated Annealing Based Algorithm The simulated annealing based algorithm is presented in Figure 4-8. The cost function is the total violation area for a given power-up sequence. The voltage waveforms at observation node caused by each domain have been simulated in advance, and then the voltage drops by each domain will be obtained. So, the voltage drops are shifted with $X_i$ time cycles and then superimposed to easily produce the actual voltage drops. Then, the violation area below $V_{\min}$ , i.e. the cost function, is computed. The neighbor search plays a crucial role in a SA based algorithm. Given the current power-up sequence, we need to perturb it to produce a new sequence. We randomly choose a domain, and determine its possible range based on the constraints and current starting cycle of the other domains. The new starting cycle for this domain will be chosen within the possible range. This neighbor search method guarantees that all the constraints are met without further checking. The conventional cooling schedule and stopping criterion are adopted in our algorithm. With higher temperature, the algorithm has high probability to accept the current solution even though it is not better than the current best solution. Therefore, the algorithm searches within a larger space. However, when temperature becomes lower, the algorithm will have high probability to accept the solution that is better than the current best. The related parameters are determined experimentally. ``` Simulated Annealing Based Algorithm: Ordering(); Seq = GreedyInitialSolution(); Temp = Initial_Temperature; Iteration = 0; Repeat Neighbor(Seq, Seq'); Cost = ViolationArea(Seq'); dif = Cost - ViolationArea(Seq); if( Cost < minCost) minCost = Cost; minSeq = Seq'; end if r = Random(0, 1); if (r < \exp(-dif / Temp)) Seq = Seq'; Temp= Temp * Temperature_Adjustment; end if Iteration ++; Until Temp == Freezing Point or Iteration > maxIter; End. ``` Figure 4-8: Simulated annealing based algorithm. #### **4.1.3** Experimental Results We have implemented the SA based algorithm and an enumerating method for comparison in C language. The experiment environment is a PC with 3.2GHz Pentium 4 processor. Firstly, we give a case to show how the whole SA based method works, and analyze the relationship between the power-up deadline and the minimum violation area. Then, ten test cases with different domain numbers are discussed, for which the computational results from different methods are compared. #### 4.1.3.1 A Case with Eight Power Domains We consider a power network for a 7mm×7mm chip with eight domains. The power network is modeled with a RLC netlist. The $V_{\rm dd}$ is 1.2V. For a given observation node, we simulate its voltage responses with only one domain working. The voltage waveforms obtained with HSPICE are shown in Figure 4-9. The clock cycle is 5ns, and the simulation spans 100 cycles. From the figure, we can see all domains power up during the first 30 cycles, while the rush current leads to very sharp voltage drop. After each domain is fully charged and works normally, the voltage drop becomes much smaller. For this case, we assume the maximal allowed voltage drop is 0.1V, and there are several inter-domain timing relationships that need to be considered as constraints. With the simulated responses, the proposed method can be used to search the power-up sequence for the domains. If the power-up deadlines are all 50 cycles, the obtained minimal violation area with the proposed SA based algorithm is 1143.6 mV·ns. The corresponding power-up sequence is: 0, 14, 1, 22, 3, 32, 46, 50, which means the first domain becomes power on at the 0th cycle, the second domain becomes power on at the 14th cycle, and so forth. In order to discuss the relationship between power-up deadline and the minimum violation area, we analyze the eight-domain case with different power-up deadlines. The deadline increases from 20 cycles to 60 cycles with step size of 5 cycles. The minimum violation areas obtained by the proposed method are shown in Figure 4-10, where the optimal values from the enumerating method are also shown for comparison. From the figure, we can see that the minimum violation area decreases as the deadline increases. This means, the less tight deadline will give every domain more choices to make power-up sequence arrangement, and therefore reduces the voltage noise on power network. Figure 4-10 also helps designers to make tradeoff between the power-up schedule and the induced power noise. Long power-up stage introduces less power noise, but defers following tasks. In Figure 4-10, the curves of the SA based method and the enumerating method match with each other very well. This suggests the high accuracy of the proposed SA based algorithm. Figure 4-9: Voltage waveforms with only one domain working. Figure 4-10: Relationship between the power-up deadline and the minimum violation area. Table 4-1: Comparison between the enumerating method and SA based method for small cases | Circuit<br>Name | # of<br>Domain | A_Enum<br>(mV·ns) | A_SA<br>(mV·ns) | A_Greedy (mV·ns) | T_Enum (s) | T_SA(s) | Speed Up<br>(SA over<br>Enum.) | |-----------------|----------------|-------------------|-----------------|------------------|------------|---------|--------------------------------| | Ckt 1 | 4 | 196.6 (1) | 196.6 (1.00) | 675.9 (3.44) | 36 | 2 | 18.0 | | Ckt 2 | 4 | 489.6 (1) | 497.5 (1.02) | 1133.1 (2.31) | 37 | 2 | 18.5 | | Ckt 3 | 8 | 1356.8 (1) | 1377.9 (1.01) | 11289 (8.32) | 4192 | 10 | 419.2 | | Ckt 4 | 8 | 1143.6 (1) | 1143.6 (1.00) | 1734.4 (1.52) | 2313 | 9 | 257 | Table 4-2: Comparison between the enumerating method and SA based method for large cases | Circuit Name | # of Domain | A_Enum (mV·ns)<br>after 10 hours | A_SA (mV·ns) | A_Greedy (mV·ns) | T_SA(s) | |--------------|-------------|----------------------------------|--------------|------------------|---------| | Ckt 5 | 12 | 4452.9 | 1128.5 | 3002.8 | 28 | | Ckt 6 | 12 | 4650.5 | 1240.2 | 3001.4 | 34 | | Ckt 7 | 16 | 3712.3 | 2572.6 | 4411.3 | 56 | | Ckt 8 | 16 | 2537.1 | 1310.4 | 3064.9 | 65 | | Ckt 9 | 20 | 4081.4 | 1859.3 | 14136.1 | 114 | | Ckt 10 | 20 | 3616.9 | 1713.3 | 13419.5 | 96 | ## 4.2 Noise Minimization with Decap and Controlled-ESR A concept of controlled equivalent series resistor (controlled-ESR) was recently proposed for on-chip [41] and off-chip power network design [42][43]. It has been demonstrated that the controlled-ESR is effective to suppress the resonance, and reduce the simultaneous switching noise (SSN) as well [41][42]. In [41], the decap with controlled-ESR was implemented in a CMOS process, and the controlled-ESR was adaptively changed to reduce the on-chip SSN. However, this work only considers fixed decap allocation, and the adaptive scheme may not suit general VLSI circuits. In this work, we optimize the general on-chip power network with the usage of controlled-ESR. The noise of on-chip power network is minimized via the allocation of both decaps and controlled-ESRs. And different from previous works, we also consider the voltage overshoot when evaluating power network noise. Actually, excessive overshoot can cause system failure, and damages both the power supply and the loads coupled with it [47][48]. We formulate the optimization problem with a constraint of decap budget, instead of minimizing the total amount of decap. Since in most power network designs it is impossible or unnecessary to completely remove the voltage violation, our formulation is more practical. The optimization problem is solved with the sequential quadratic programming (SQP) algorithm. Experimental results show that the optimization algorithm is efficient, and the noise is reduced by 25% on average via the allocation of both decaps and controlled-ESRs. The main contributions are summarized as follows: (1) We propose to allocate decaps and controlled-ESRs simultaneously to suppress the resonance and reduce SSN of power network. - (2) We consider both voltage drop and overshoot for voltage violation. A revised sensitivity approach is presented to calculate the sensitivity the overall violation area with respective to both decap and controlled-ESR. - (3) An optimization formulation with the objective function of minimizing the voltage violation area and a constraint of decap budget is presented, and solved with an efficient SQP algorithm. #### 4.2.1 Power Network Model With Controlled-ESR The power network is usually modeled as a circuit including resistance, capacitance and packaging inductance, as shown in Figure 4-11. The controlled-ESRs and decaps are connected between power grid nodes and the ground. Time-varying current sources are connected to some circuit nodes, characterizing the supply current for active circuit instances. These current sources draw current from the power network and cause voltage fluctuations [49]. The waveform of current source is described as a piecewise linear (PWL) function. The controlled-ESR is able to effectively reduce the SSN especially when there is resonance phenomena caused by both the off-chip inductance and the on-chip capacitance [41]. However, if there are excessive controlled-ESRs, the impedance of power network will increase causing large SSN. A simplified model of power network with controlled-ESR is shown in Figure 4-12. Figure 4-13 shows the effect of controlled-ESR with different values on reducing the noise. As the value of controlled-ESR increases from 10 mOhm to 1 Ohm, the noise is gradually reduced. But when there are excessive controlled-ESRs such as 10 Ohm, the noise (dash line) will become even larger. Therefore, we need to decide the adequate amount of controlled-ESRs. Figure 4-11: Power network model with controlled-ESR. Figure 4-12: A simple case to show the effect of controlled-ESR. Figure 4-13: Effect of controlled-ESR on reducing the noise. #### **4.2.2 Problem Statement** We formulate the noise minimization problem as a nonlinear optimization problem. In this section, we first introduce the noise definition considering both voltage drop and overshoot. Then, the problem formulation including the objective function and constraints are presented. #### 4.2.2.1 Power Network Noise Considering Voltage Overshoot Most existing works consider the power network noise as the violation area at circuit node. The violation area at node j is defined as: $$g_{j} = \int_{0}^{T} \max(V_{\min} - v_{j}(t), \ 0)dt$$ , (4-2) where $V_{min}$ is a pre-defined threshold voltage, and $v_j(t)$ is the voltage curve of node j. The violation area $g_j$ equals to the shaded area below $V_{min}$ . (see Figure 1-2) This definition only considers the violation area below the nominal Vdd, as shown in Figure 1-2, and neglects the voltage overshoot. The voltage overshoot is a transient rise of output voltage beyond Vdd. Excessive overshoot will lead to logic failure or the reliability issues. Therefore, the existing works underestimated power network noise due to the neglect of voltage overshoot. In this paper, we consider both voltage drop and overshoot as the power network noise. The total violation area is defined as: $$g_{j} = \int_{0}^{T} \max[\max(V_{\min} - v_{j}(t), 0), \max(v_{j}(t) - V_{\max}, 0)]dt, \quad (4-3)$$ where $V_{\text{max}}$ is a pre-defined threshold voltage above Vdd. With this definition, the violation area equals to the shaded area shown in Figure 4-14. Figure 4-14: Voltage violation area with both voltage drop and overshoot considered. #### 4.2.2.2 Formulation of Optimization Problem For the power network optimization, the objective function is to minimize the total power network noise, which is the total violation area caused by both voltage drop and overshoot. The decision variables are the amount of decap and controlled-ESR at each candidate location: $c_i$ and $CtrlESR_i$ , $1 \le i \le M$ . Here M denotes the number of candidate locations. We assume the power network stimulus is given. There are four constraints. The first one is that the voltages on nodes satisfy the circuit equation with given stimulus. With the transient voltage response, the violation area is then determined by equation (4-3). The second constraint is the total decap budget. The summation of all decap amounts should be not larger than a pre-defined budget Q. The third and forth constraint are the white space limitation for each candidate position. They mean that the amount of decap and controlled-ESR can not exceed the maximum allowed value. The formulation of the optimization problem is shown in Figure 4-15. Some previous work, such as [29], put the noise and total decap together into the objective function with different weights. However, such formulation could be misleading because the optimization direction may be decided by the weight rather than the sensitivity [28] Some other work directly set the total decap amount to be the objective function to be minimized, with the constraint that the violation area is zero. In most on-chip power network designs, it is almost impossible to avoid voltage violation. So, we minimize the violation area, and set the total decap budget as a constraint. We do not set the budget for the controlled-ESR because it is relatively cheaper than decap. And, the decap budget can be adjusted to evaluate the tradeoff between decap investment and noise elimination. Thus, the proposed formulation is reasonable and flexible. Objective function: $$\operatorname{Min} \sum_{i=1}^{N} g_{j}$$ Constraints: - (1) Voltage response satisfies the circuit equation with given stimulus; - (2) Total decap budget: $\sum_{i=1}^{M} c_i \leq Q$ ; - (3) Space constraint for each decap location: $0 \le c_i \le c_{\text{max}i}$ ; - (4) Space constraint for each controlled-ESR location: $0 \le CtrlESR_i \le CtrlESR_{maxi}$ . Figure 4-15: Problem formulation of noise minimization of power network via allocation of decaps and controlled-ESRs #### 4.2.3 Revised Sensitivity Computation To solve the problem of power network noise minimization, the sensitivity value of the objective function to each decision variable is needed as the gradient in the procedure of nonlinear optimization. In this section, we firstly derive the voltage variation due to decap and controlled-ESR from the view of circuit state equation, and then review the merged adjoint network method to calculate the sensitivity of voltage violation area with respect to decap. Lastly, the revised sensitivity computation is presented, which considers the voltage overshoot and includes the controlled-ESR as the decision variable. #### 4.2.3.1 Circuit Sensitivity Analysis with State Equation From Kirchhoff's Current Law (KCL) and Kirchhoff's Voltage Law (KVL), we have the state equations: $$\begin{bmatrix} C & 0 \\ 0 & L \end{bmatrix} \begin{bmatrix} \mathcal{R} \\ \mathcal{R} \end{bmatrix} = \begin{bmatrix} -G & -E \\ E^T & -R \end{bmatrix} \begin{bmatrix} v \\ i \end{bmatrix} + BU$$ (4-4) where C is the capacitance matrix, L is the inductance matrix. v is the voltage in each node and i is the current through every branch with inductance. G and R are conductance and resistance matrix, respectively. U is the input vector such as current source. We can get every node voltage and branch current by solving this equation. We denote equation (4-4) to be $$C = Ax + Bu (4-5)$$ If we add some extra decap $\Delta C$ and controlled-ESR $\Delta A$ to the circuit, the solution vector x will be updated by $\Delta x$ . And then, the state equation (4-5) becomes $$(C + \Delta C)(x + \Delta x) = (A + \Delta A)(x + \Delta x) + Bu \qquad (4-6)$$ By subtracting (4-5) from (4-6), we further get $$(C + \Delta C)\Delta \mathcal{R} = (A + \Delta A)\Delta x + (\Delta A x - \Delta C \mathcal{R})$$ (4-7) The solution to the above differential equation (4-7) is $$\Delta x = e^{\frac{\partial^{r_0} \mathcal{N}(t-t_0)}{2}} \Delta x_0 + \int_{t_0}^{t} e^{\frac{\partial^{r_0} \mathcal{N}(t-\tau)}{2}} \mathcal{O}(\tau) d\tau \quad , \tag{4-8}$$ where $$\mathcal{C}' = C + \Delta C$$ , $\mathcal{A}' = A + \Delta A$ , $\mathcal{C}' = \Delta Ax - \Delta Cx$ , and $$e^{A6} = I + A6 + \frac{A6}{2!} + \frac{A6}{3!} + \dots$$ With (4-8), it is clear that the change of decap or controlled-ESR will cause the variation of voltage response. #### 4.2.3.2 Sensitivity Computation with the Merged Adjoint Network Method In existing works, the sensitivity computation only considers the violation area below Vdd as shown in Figure 1-2. The sensitivity $s_{ij}$ is defined to be the contribution of decap added at node i to remove violation at node j: $$s_{ij} = \frac{\partial g_j}{\partial c_i} \quad , \tag{4-9}$$ where $c_i$ is the decap value at node i, and $g_i$ is the violation area defined in (4-2). Because the computational complexity for standard sensitivity is very high, which requires N times of simulations (N is the number of violation nodes), the merged adjoint sensitivity is introduced to speed up the computation. The merged adjoint sensitivity is defined to be the contribution of decap added at node i to remove the violation for all nodes. An adjoint network is with the same topology as original network but with all of the voltage sources in the original network shorted and current sources open. The merged adjoint network has a current source $u(t-t_s)-u(t-t_e)$ applied at every node j if it is a node with noise defined in Figure 1-2, where $t_s$ is the starting time and $t_e$ is the ending time of violation and u(t) is the step function. With this method, the merged adjoint sensitivity is calculated with $$s_{i} = \sum_{j=1}^{N} s_{ij} = \int_{0}^{T} (\mathcal{Y}_{i,all}(T-t)) \times \mathcal{X}(t) dt, (i = 1, 2, ..., M) , \qquad (4-10)$$ where $\theta_{f,all}(T-t)$ is the voltage waveform at node i from the adjoint network with combined step current sources, $\mathcal{L}(t)$ is the derivative of the voltage waveform at node i in the original network. M is the number of candidate nodes to put decap. We need to set the initial condition of merged adjoint network to be zero and analyze backward in time. #### 4.2.2.3 Revised Sensitivity Computation Considering Voltage Overshoot To consider voltage overshoot, the voltage violation area is defined as that shown in Figure 4-14. We use the time points $t_{sk}$ , $t_{ek}$ to denote the starting and ending time of the kth violation, respectively. We need to calculate the sensitivity of the violation area defined in (4-3) to decap $c_i$ and controlled-ESR $CtrlESR_i$ , respectively. It is convenient to extract all independent sources to form a multiport circuit. We denote the port currents and voltages by vectors $I_p$ and $V_p$ . Denote the non-source branch currents and voltages by vectors $I_b$ and $V_b$ . From Tellegen's theorem, we have $$\int_{0}^{T} \left[ -\hat{i}_{p}(\tau) \Delta v_{p}(t) + \hat{v}_{p}(\tau) \Delta i_{p}(t) \right]_{\tau=T-t} dt$$ $$= \int_{0}^{T} \left[ \hat{i}_{b}(\tau) \Delta v_{b}(t) + \hat{v}_{b}(\tau) \Delta i_{b}(t) \right]_{\tau=T-t} dt$$ , (4-11) where lowercase variables of i or v stand for the quantities of a specified port or branch. The symbols with hat ( $^{\wedge}$ ) are for an adjoint network, those without hat are for the original network. We set all voltage sources in the adjoint network to zero and apply a current source for each violation node: $$I_{s} = \sum_{k=1}^{N_{v}} D_{k} [u(t - t_{sk}) - u(t - t_{ek})]$$ , (4-12) where $N_{\nu}$ is the number of violation stages on the voltage waveform (see Figure 4-14), whose time intervals are $(t_{s1}, t_{e1}), (t_{s2}, t_{e2}), ..., (t_{s,N\nu}, t_{e,N\nu})$ . $D_k$ is defined as: $$D_k = \begin{cases} -1, & \text{if } v(t_{sk}) > \text{Vdd} \\ 1, & \text{if } v(t_{sk}) < \text{Vdd} \end{cases}, k=1, ..., N_v.$$ Then, the left hand side of equation (4-11) becomes $$\Delta g = \int_{0}^{T} \left\{ -\sum_{k=1}^{N_{v}} D_{k} [u(t - t_{sk}) - u(t - t_{ek})] \Delta v_{p}(t) \right\}_{\tau = T - t} dt . \tag{4-13}$$ This is exactly the derivative of violation area for voltage drop and overshoot. The evaluation of the right hand side of equation (4-11) follows the same way as [46]. Then the sensitivity component for the decap is: $$s_{C} = \frac{\Delta g}{\Delta C} = \int_{0}^{T} -[\hat{v}_{c}(\tau) \&(t)]_{\tau=T-t} dt , \qquad (4-14)$$ and for the controlled-ESR is: $$s_{R} = \frac{\Delta g}{\Delta R} = \int_{0}^{T} [\hat{i}_{R}(\tau) i_{R}(t)]_{\tau = T - t} dt \quad . \tag{4-15}$$ The revised sensitivity computation method is described in Figure 4-16. The input is the power network which needs to be optimized. The output is the violation area sensitivity with respect to each controlled-ESR and decap. #### Algorithm for the revised sensitivity computation: - 1. Simulate input circuit, and obtain a violation node set. - 2. Obtain starting and ending time for each violation node shown in as Figure 4-14. - 3. Obtain current $i_R(t)$ across controlled-ESR, and voltage derivation $\mathcal{E}(t)$ across decap in the original network. - 4. Construct adjoint network, and apply the excitation described in Section 4.2.2.3 to the adjoint network. - 5. Simulate the adjoint network. - 6. Obtain current $\hat{i}_{R}(\tau)$ across controlled-ESR, and voltage $\hat{v}_{c}(\tau)$ across decap in the adjoint network. - 7. Evaluate the expression (4-14) and (4-15) to compute the sensitivity for each decap and controlled-ESR. Figure 4-16: The algorithm description for revised sensitivity computation. #### **4.2.4 SQP Based Optimization** We apply sequential quadratic programming (SQP) to solve the problem formulated in Figure 4-15. The SQP is the state-of-the-art nonlinear programming method, which closely mimics Newton's method for constrained optimization just as is done for unconstrained optimization [44][45]. For each iteration, the quasi-Newton updating method is used for the approximation of the Hessian matrix of the Lagrangian function. This approximation is then used to generate a quadratic programming (QP) subproblem. The solution for the subproblem is used to form a search direction for a line search procedure. The SQP based optimization method is described in Figure 4-17. The parameter $X^{(t)}$ is the solution of decap and controlled-ESR values in the t'th iteration. In our implementation, the noise sensitivities with respect to all the decap and controlled-ESR are computed and are provided into the SQP algorithm as the first-order gradient, which is used to construct the QP sub-problem at each iteration. After solving this QP problem, the step size $d^{(t)}$ is determined to update the solution $X^{(t+1)}$ . The termination criterion takes the value change of objective function, the step size, and the number of iteration into account. In the optimization, the total amount of controlled-ESRs is not restricted since it is relatively cheaper than decap. Therefore, higher priority should be made on allocating adequate amount of controlled-ESR than decap. Due to the sensitivity of violation area with respect to controlled-ESR is much smaller than decap, we need to scale the controlled-ESR sensitivity to get higher priority. #### Algorithm for the SQP based optimization: - 1. Select the intrinsic capacitance and controlled-ESR to be the initial solution $X^{(0)}$ . - 2. Simulate the power network circuit, and compute the sensitivity as gradient using the algorithm in Figure 4-16. - 3. Use the gradient to approximate the problem in Figure 4-15 with a linearly constrained QP subproblem at $X^{(t)}$ . - 4. Solve for the step size $d^{(t)}$ to move. - 5. If meet with termination condition, stop; Else, let $X^{(t+1)} = X^{(t)} + d^{(t)}$ . - 6. Increase *t* and return to step 2. Figure 4-17: SQP based optimization method. #### **4.2.5 Experimental Results** The proposed SQP based optimization algorithm with revised sensitivity computation is implemented by Matlab and Perl. HSPICE is used to simulate the circuit. All experiments run on a 3GHz Pentium machine with 4GB memory. Six cases of simplified industry on-chip power networks are tested. They are of mesh structure with R, C and packaging L. The number of nodes for each case varies from 858 to 14K, and is listed in the second column of Table 4-3. The nominal Vdd is 1.0V, and the allowable voltage fluctuation is 5%~10% of Vdd. In all experiments, we set the same noise tolerance ratio for voltage drop and overshoot. #### 4.2.5.1 Effect of Considering Voltage Overshoot For each case, there are some intrinsic capacitors and controlled-ESRs. We firstly simulate these cases to reveal the effect of considering voltage overshoot on the voltage violation area. The value of noise, i.e. the voltage violation area and the number of violation nodes are listed in Table 4-3, with overshoot considered or not. As shown in the column 4 and 6 of Table 4-3, the number of violation nodes is almost the same for the both situations. However, the total noise is on average underestimated by 4.8% due to neglecting the voltage overshoot (compare column 3 and 5). Although the voltage drop noise still takes the major part of overall noise, the voltage overshoot should not be neglected in the noise model because it can lead to race condition and even logic failure. #### 4.2.5.2 Effect of Optimizing with Controlled-ESR We then compare different optimization methods for the minimization of power network noise, with the same decap budget. In some industry design, decaps are evenly distributed at all candidate nodes. This method is straightforward, but definitely could not get the optimal solution. In the second method, the SQP based optimization described in Figure 4-17 is employed but only the decaps are allocated. The third one is the proposed SQP based method allocating both decaps and controlled-ESRs. The comparison results are listed in Table 4-4. It is demonstrated that the number of violation nodes (column 3, 5, 7) is reduced, compared with the results in Table 4-3. And, the third method optimizing both decap and controlled-ESR yields the best solutions. For the noise, the third method also achieves the best solutions. The reduction on the minimized noise over the second method is listed in column 9 of Table 4-4, which shows that the improvement brought by considering the controlled-ESRs is 25% on average. With the third method, the average allocated controlled-ESR ranges from 0.038 Ohm to 0.083 Ohm for different cases. Figure 4-18 shows the voltage waveforms of one violation node in circuit CKT1 with the optimization methods applied or not. The dot-dash line is the waveform without optimization. The dash line is the waveform with even distribution of decap, which is better than the original waveform but still has higher noise than the SQP based optimization. The SQP method considering both decap and controlled-ESR is the best one which produces voltage response with less noise than the optimization considering decap only. In Table 4-4, the overall CPU time of the proposed method and the time spent on SQP are listed. Therefore, the simulation time is the difference between those two values. For the largest case with 14K nodes the total computational time is about 42 minutes. For each iteration, two HSPICE simulations run for the original and adjoint network, respectively, while the number of iteration is always less then twenty for the test cases. It is found out that the computational time is mainly spent by HSPICE simulation. With a faster simulation method for power network, the method would achieve faster computational speed. #### 4.2.5.3 Relationship between Decap Budget and Noise To study the relationship between the minimal noise the optimization method can reach and the decap budget, we optimize the circuit CKT1 with different values of decap budget. The results are plotted in Figure 4-19. The decap constraint ranges from 5nF to 50nF with a 5nF increment. The results show that larger decap budget leads to smaller noise. As more decaps are added, the benefit would increase slower (see Figure 4-19). This relationship will help the designer to make the tradeoff between the noise reduction and the decap investment. Figure 4-18: Voltage waveforms obtained with different optimization methods. Figure 4-19: Relationship between decap budget and the noise. Table 4-3: Test cases description and the noise underestimation due to neglecting voltage overshoot | Circuit # Node | | Consider voltage drop only | | Consider both vo | ltage drop and overshoot | Noise underestimation due | |----------------|-------|----------------------------|------------------|------------------|--------------------------|---------------------------| | | | Noise (V*ps) | # Violation node | Noise (V*ps) | # Violation node | to neglecting overshoot | | CKT1 | 858 | 306.7 | 46 | 324.8 | 46 | 5.6% | | CKT2 | 1794 | 7406.3 | 325 | 8127.8 | 347 | 8.9% | | CKT3 | 2006 | 5111.3 | 309 | 5324.9 | 309 | 4.0% | | CKT4 | 3634 | 9770.1 | 268 | 10049.6 | 268 | 2.8% | | CKT5 | 8330 | 5608.1 | 2470 | 5829.1 | 2470 | 4.0% | | CKT6 | 14852 | 31420.8 | 2243 | 32477.3 | 2243 | 3.4% | Table 4-4: Comparison among three methods for the minimization of power network noise | Circuit | Evenly allocate the decaps | | Allocate decaps only with<br>the SQP-based method | | Allocate both decaps and controlled-ESRs with<br>the SQP-based method | | | | |---------|----------------------------|-------------|---------------------------------------------------|-------------|-----------------------------------------------------------------------|-------------|-------------|---------------------------| | Circuit | Noise | # Violation | Noise | # Violation | Noise | # Violation | | Noise reduction compared | | | (V*ps) | node | (V*ps) | node | (V*ps) | node | time (s) | to allocating decaps only | | CKT1 | 229.5 | 20 | 113.4 | 23 | 82.1 | 16 | 26.2/9.9 | 27.6% | | CKT2 | 6137.1 | 156 | 2538.0 | 104 | 2023.4 | 47 | 172.8/46.0 | 20.3% | | CKT3 | 4597.9 | 141 | 2308.3 | 88 | 2077.2 | 78 | 116.7/42.9 | 10.0% | | CKT4 | 8939.0 | 235 | 2212.5 | 96 | 1527.0 | 47 | 141.6/34.8 | 30.8% | | CKT5 | 5352.3 | 1245 | 1694.3 | 617 | 1073.0 | 333 | 1195.2/66.2 | 36.7% | | CKT6 | 26916.9 | 1191 | 6538.1 | 390 | 4853.6 | 204 | 2564.1/35.3 | 25.8% | #### 4.3 Summary In this chapter, we present the power-up sequence optimization and the noise minimization with both decap and controlled-ESR. We formulate the power-up sequencing problem in the power domain level, and prove its NP-Completeness. Hense, an SA based algorithm with ordering and greedy initial solution is proposed. The experimental results show the SA based approach is as accurate as optimal solution, but much more efficient than enumerating. We optimize the circuit itself with both decap and controlled-ESR instead of considering decap only. The SQP algorithm is adopted to solve the optimization problem where the revised sensitivity is regarded as the gradient. We demonstrate the controlled-ESR is able to reduce the noise by 25% with the same decap budget. Chapter 4, in part, is a reprint of the paper "Noise Minimization During Power-Up State for a Multi-Domain Power Network ", co-authored with Yi Zhu, Wenjian Yu, Amirali Shayan, Renshen Wang, Zhi Zhu, Chung-Kuan Cheng, in *Proc. IEEE / ACM Asia and South Pacific Design Automation Conference (ASP-DAC 2009)*, pp. 391-396. It is also a reprint of the paper "On-Chip Power Network Optimization with Decoupling Capacitors and Controlled-ESRs", co-authored with Ling Zhang, Amirali Shayan, Wenjian Yu, Xiang Hu, Zhi Zhu, A. Ege Engin, and Chung-Kuan Cheng, under submission. The dissertation author was the primary investigator and author of these papers. ## 5. Conclusion ### **5.1 Summary of Contributions** In this dissertation, we investigate the algorithms and methodologies for the power network analysis and optimization. We propose an efficient simulation flow, describe the models of worst-case noise analysis, and design algorithms for power-up sequence optimization and noise minimization. The contributions are summarized as follows. - (1) We propose an efficient transient simulation flow based on frequency domain computation as presented in Chapter 2. With the vector fitting technique, the frequency-domain responses are approximated by a partial fractional expression. Then, it can be easily converted to the time-domain transient waveform. With the techniques of log-scale frequency sampling and efficient iterative linear equation solver, the simulation method is orders of magnitude faster than the conventional time-domain simulation methods, while preserving sufficient accuracy. The simulation flow can be parallelized to achieve more speedup. - (2) We describe the framework for the worst-case noise analysis in a multidomain clock gated power network. The framework utilizes the time-domain voltage response corresponding to a single clock domain working for one cycle. With the voltage responses from all individual domains, the worst-case clock patterns and the corresponding maximum voltage drop and violation area are predicted using the principle of superposition. Chapter 3 introduces a linear time complexity algorithm for the worst-case voltage drop estimation, and an ILP based approach for the worst-case violation area prediction. (3) We design efficient optimization algorithms for power-up sequence and noise minimization with decap and controlled-ESR. In chapter 4, the power-up sequencing problem at power-domain level is formulated and is proved to be NP-complete. The simulated annealing based algorithm with domain ordering and initial greedy solution is proposed to find the power-up sequence with minimal noise. We extend the conventional power network optimization approaches to include both decap and controlled-ESR. A revised sensitivity computation considering voltage overshoot is derived. SQP is used to solve the optimization formulation where the revised sensitivity is regarded as the gradient. #### **5.2 Future Work** Our study improves the existing power network analysis and optimization algorithms. In the future, more optimization approaches can be investigated. For example, how to allocate voltage regulator modular (VRM) to reduce power network noise. The VRM is helpful to suppress voltage fluctuation, but costs more areas and consumes more power. Therefore, we need to design an algorithm to eliminate power network noise by allocating VRMs with minimal costs. We may also extend the on chip power network optimization via decap and controlled-ESR to multi-domain networks. Most existing power network optimization works assume single domain. They do not consider the multi-domain case which would be more complex and may lead to new results. Because the sensitivity computation plays a key role in the optimization work, the difficulty is to derive the decap and ESR sensitivity contributed by each domain. SQP may not be the best candidate for the power network case, and then more efficient optimization algorithms can be applied or developed. # 6. Bibliography - [1] H. Chen, and J. Neely. "Interconnect and circuit modeling techniques for full-chip power supply noise analysis" *IEEE Transactions on Components, Packaging, and Manufacturing Technology- part B*, vol. 21, No.3, August 1998, pp. 209-215. - [2] H. Chen, S. Schuster, "On-chip decoupling capacitor optimization for high-performance VLSI design" in *Proc. International Symposium on VLSI Technology, Systems, and Applications,* 1995, pp. 99 103. - [3] S. Lin, M. Nagata, K. Shimazake, K. Satoh, M. Sumita, H. Tsujikawa, A. Yang, "Full-chip vectorless dynamic power integrity analysis and verification against 100uV/100ps-resolution measurement," in *Proc. IEEE Custom Integr. Circuits Conf.*, Oct. 2004, pp. 509-512. - [4] T. Pillage, R. Rohrer, "Electronic Circuit and System Simulation Methods" C. Visweswariah, McGraw-Hill, 1994. - [5] H. Li, Z. Qi, S. X.-D. Tan, et al., "Partitioning-based approach to fast on-chip decap budgeting and minimization," in *Proc. Design Automation Conf.*, June 2005, pp. 170-175. - [6] S. R. Nassif, and J. N. Kozhaya, "Fast power grid simulation," in *Proc. Design Automation Conf.*, June 2000, pp. 156-161. - [7] H. Su, K. Gala, and S. Sapatnekar, "Analysis and optimization of structured power/ground networks," *IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems*, vol. 22, no. 11, pp. 1533-1544, Nov. 2003. - [8] E. Natarajan, "KLU A high performance sparse linear solver for circuit simulation problems," MS dissertation, University of Florida, 2005. - [9] T. Chen and C. Chen, "Efficient large-scale power grid analysis based on preconditioned Krylov-subspace iterative methods," in Proc. Design Automation Conf., 2001, pp. 559-562. - [10] L. Zhao and C. Shi, "An efficiently preconditioned GMRES method for fast parasitic-sensitive deep-submicron VLSI circuit simulation," in Proc. Design Automation and Test Europe Conf. and Exhibition, Mar. 2005, pp. 752-757. - [11] S. Chowdhury and J. Barkatullah, "Estimation of maximum currents in MOS IC logic circuits", *IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems*, vol. 9, no. 6, pp. 642-654, Jun. 1990. - [12] Y. Jiang and K. Cheng, "Vector generation for power supply noise estimation and verification of deep submicron designs", *IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems*, vol. 9, no. 2, pp. 329-339, Apr. 2001. - [13] G. Bai, S. Bobba, and I. N. Hajj, "Simulation and optimization of the power distribution network in VLSI circuits," in *Proc. IEEE/ACM Int. Conf. Computer Aided Des.*, Nov. 2000, pp. 481-486. - [14] H. Kriplani, F. N. Najm and I. N. Hajj, "Pattern independent maximum current estimation in power and ground buses of CMOS VLSI circuits: Algorithms, signal correlations, and their resolution", *IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems*, vol. 14, no. 8, pp. 998-1012, Aug. 1995. - [15] S. Bobba and I. N. Hajj, "Estimation of maximum current envelope for power bus analysis and design", in *Proc. of ISPD*, pp. 141-146, Apr. 1998. - [16] J. Shi, Y. Cai, S. X.-D. Tan, and X. Hong, "Efficient early stage resonance estimation techniques for C4 package," in *Proc. Asia South Pacific Design* Automation Conf., Jan. 2006, pp. 826-831. - [17] M. Keating, D. Flynn, R. Aitken, A. Gibbons, K. Shi, Low Power Methodology Manual, Springer, 2007. - [18] J. Salmon, N. Dour, "Circuit for independent power-up sequence of a multi-voltage chip," US Patent 6236250, 2001. - [19] N. Ranjan, "Mixed voltage, multi-rail, high drive, low noise, adjustable slew rate input/output buffer," US Patent 5862390, 1999. - [20] Y. Kanno, et al., "Hierarchical power distribution with 20 power domains in 90nm low-power multi-CPU processor," in *Proc. ISSCC*, Feb. 2006, pp. 2200-2209. - [21] T. Hattori, et al., "A power management scheme controlling 20 power domains for a single-chip mobile processor," in *Proc. ISSCC*, Feb. 2006, pp. 2210-2219. - [22] M. Zhao, R. Panda, S. Sundareswaran, S. Yan, Y. Fu, "A fast on-chip decoupling capacitance budgeting algorithm using macromodeling and linear programming", in *Proc. DAC*, 2006, pp. 217-222. - [23] L. Smith, R. Anderson, D. Forehand, T. Pelc, T. Roy, "Power distribution system design methodology and capacitor selection for modern CMOS technology", *IEEE Transactions on Advanced Packaging*, Vol. 22, No. 3, pp. 284-291, Aug. 1999. - [24] M. Pant, P. Pant, D. Wills, "On-chip decoupling capacitor optimization using architectural level prediction" *IEEE Transactions on VLSI*, Vol. 10, No. 3, pp. 319-326, June 2002. - [25] Z. Qi, H. Li, S. Tan, Y. Cai, X. Hong, "On-chip decoupling capacitor budgeting by sequence of linear programming", in *Proc. ASIC*, Oct. 2005, pp. 98-101. - [26] H. Su, K. Gala, S. Sapatnekar, "Analysis and optimization of structured power/ground networks", *IEEE Transactions on CAD*, Vol. 22, No. 11, pp. 1533-1544, Nov, 2003. - [27] H. Su, S. Sapatnekar, S. Nassif, "Optimal decoupling capacitor sizing and placement for standard-cell layout designs," *IEEE Transactions on CAD*, Vol. 22, No. 4, pp. 428-436, Apr. 2003. - [28] H. Li, Z. Qi, S. Tan, L. Wu, Y. Cai, X. Hong, "Partitioning-based approach to fast on-chip decap budgeting and minimization," *IEEE Transactions on CAD*, Vol. 25, No. 11, pp. 2402-2412, Nov. 2006. - [29] J. Fu, Z. Luo, X. Hong, Y. Cai, S. Tan, Z. Pan, "A fast decoupling capacitor budgeting algorithm for robust on-chip power delivery", in *Proc. ASP-DAC*, Jan. 2004, pp. 505-510. - [30] S. Balay, K. Buschelman, W. D. Gropp, et al., PETSc Web page, 2001. http://www.mcs.anl.gov/petsc. - [31] B. Gustavsen, User's Manual for Vector Fitting Version 2.1 for Matlab. <a href="http://www.energy.sintef.no/produkt/VECTFIT/index.asp">http://www.energy.sintef.no/produkt/VECTFIT/index.asp</a> - [32] B. Gustavsen, "Improving the pole relocating properties of vector fitting," *IEEE Trans. Power Delivery*, vol. 21, no. 3, pp. 1587-1592, July 2006. - [33] B. Gustavsen and A. Semlyen, "Rational approximation of frequency domain responses by vector fitting," *IEEE Trans. Power Delivery*, vol. 14, no. 3, pp. 1052-1061, July 1999. - [34] http://www.fastrack-design.com/mspice.php - [35] H. Li, S. Bhunia, Y. Chen, et al., "Deterministic clock gating for microprocessor power reduction," in Proc. Int. Symp. on High-Performance Computer Architecture, Feb. 2003, pp. 113-122. - [36] H. Su, F. Liu, A. Devgan, E. Acar, S. Nassif, "Full chip leakage estimation considering power supply and temperature variations", in *Proc. ISLPED 2003*, pp. 78 83. - [37] S. Borkar, "Low power design challenges for the decade", in *Proc. Asia South Pacific Design Automation Conf.*, 2001, pp. 293-296. - [38] M. Bickerstaff, L. Davis, C. Thomas, D. Garrett, C. Nicol, "A 24Mb/s radix-4 LogMap turbo decoder for 3GPP-HSDPA mobile wireless", in *Proc. ISSCC* 2003, pp. 150-151. - [39] "ILOG CPLEX: High performance software for mathematical programming and optimization", <a href="http://www.ilog.com/products/cplex/">http://www.ilog.com/products/cplex/</a> - [40] M. R. Garey, D. S. Johnson, *Computers and Intractability: A Guide to the Theory of NP-Completeness*, Mathematics-Worth Publishers, Incorporated, 1979. - [41] J. Shim, M. Shin, H. Kim, Y. Kim, K. Park, J. Cho, J. Kim, "An adaptive on-chip ESR controller scheme in power distribution network for simultaneous switching noise reduction," In *Proc. EPEP*, Nov. 2008, pp. 169-172. - [42] I. Novak, S. Pannala, J. Miller, "Overview of some options to create low-Q controlled-ESR bypass capacitors", in *Proc. EPEP*, Nov. 2004, pp. 55-58. - [43] I. Novak, L. Noujeim, V. Cyr, N. Biunno, A. Patel, G. Korony, A. Ritter, "Distributed matched bypassing for board-level power distribution networks", *IEEE Transactions on Advanced Packaging*, Vol. 25, No. 2, pp. 230-243, May 2002. - [44] S. P. Han, "A globally convergent method for nonlinear programming," *J. Optimization Theory and Applications*, Vol. 22, pp. 297, 1977. - [45] P. Gill, W. Murray, M. Saunders, "SNOPT: An SQP algorithm for large-scale constrained optimization", *SIAM Review*, Vol. 47, No. 1, pp. 99-131, 2005. - [46] L. Chua, P. Lin, *Computer-Aided Analysis of Electronic Circuits*, Prentice-Hall, Inc., Elglewood Cliffs, New Jersey, 1975. - [47] C. Liao, "Voltage overshoot reduction circuits," US patent 7023713B2, Apr. 2006 - [48] S. Ross, "Voltage overshoot protection circuit," US patent 6016245, Jan. 2000. - [49] K. Wang, M. Marek-Sadowska, "On-chip power supply network optimization using multigrid-based technique," in *Proc. DAC*, Jun. 2003, pp. 113-118.