# UC Santa Cruz UC Santa Cruz Previously Published Works

# Title

DCMCS: Highly Robust Low-Power Differential Current-Mode Clocking and Synthesis

# Permalink

https://escholarship.org/uc/item/2hh0p6d0

## Journal

IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 26(10)

**ISSN** 1063-8210

## **Authors**

Islam, Riadul Fahmy, Hany A Lin, Ping Y <u>et al.</u>

# **Publication Date**

2018-10-01

# DOI

10.1109/tvlsi.2018.2837681

Peer reviewed

# DCMCS: Highly Robust Low-Power Differential Current-Mode Clocking and Synthesis

Riadul Islam, Member, IEEE, Hany A. Fahmy, Student Member, IEEE, Ping Y. Lin, Student Member, IEEE, and Matthew R. Guthaus, Senior Member, IEEE

Abstract—In this paper, we present a new differential currentmode pulsed flip-flop (DCMPFF) for low-power clock distribution using a representative 45nm CMOS technology. Experimental results show that the DCMPFF has a 47% faster clock-to-output (CLK-O) delay than a traditional voltage-mode (VM) pulsed flip-flop. When the DCMPFF is integrated with a differential current-mode H-tree clock distribution, the differential technique saves 53% and 26% power compared to conventional VM and previous current-mode (CM) clock networks, respectively. In addition, we propose the first differential CM clocking and synthesis (DCMCS) methodology to improve the robustness and overall clock power of a network. The proposed DCMCS-based electromigration-aware clocking saves 79% and 51% average power with 7.7ps and 11.3ps lower clock skew when the DCM scheme is applied to ISPD 2009 and 2010 testbenches, respectively.

*Index Terms*—Differential clocking, low-power design, currentmode clocking, clock distribution network, flip-flop.

#### I. INTRODUCTION

The clock distribution network (CDN) is the most crucial network in synchronous VLSI design, as it is the basic signaling network for every synchronous block and seriously affects overall system power and performance. In terms of signaling type, clocking can be either voltage-mode (VM) or current-mode (CM). Although VM clocking is widely used due to its compatibility with standard VM logic networks, CM clocking can play an important role in low-power systems. CM signaling offers many potential advantages such as higher operating speed [1], [2], low voltage operation [3], and ease of processing [4] compared to VM techniques.

Global interconnect power and latency are increasing in traditional VM signaling schemes [5]. Systems-on-chips (SOCs) add more functionality, which means chip sizes are roughly constant while wire length increases relative to its planar dimensions. Because of this, the latency of RC lines grows linearly with wire length [5] despite using properly sized repeaters. An immediate solution is to use wide wires, but this results in higher energy per bit because of the large rail-to-rail voltage swing. An alternative signaling scheme such as CM, however, can eliminate transmission line repeaters, while in addition, decreasing necessary voltage swing to significantly reduce power [6]–[9].

We can categorize signaling as differential or nondifferential (single-ended). Differential clocks use two wires to send a pair of complementary clock signals. Differential signaling has higher reliability under electromagnetic interference, supply voltage fluctuations, and other sources of common-mode noise compared to single-ended signaling [10]–[13]. Differential CM (DCM) signaling has better noise immunity compared to a single-ended CM scheme [8], [14], [15]. However, this comes at the cost of double wiring resources and increased wiring complexity. As a result, the traditional clock routing techniques are limited to single-ended clocking [16]–[19].

In the early years, CM signaling was applied to off-chip interconnects [20]. However, over the past decade, increasing attention has been paid to on-chip CM signaling. Researchers have shown tremendous power-performance improvement over VM signaling by applying CM signaling into a symmetric network [6]–[8], [21].

In this paper, we extend the *de novo* CM clocking concept [6] to implement and analyze the first DCM clock distribution and a new DCM pulsed D-type flip-flop (FF). The clock (CLK) input to the FF is a CM receiver and the data input (D) and output (Q) are VM. In addition, we propose the first electromigration (EM) aware DCM clock synthesis (DCMCS) methodology applicable to any network (symmetric or asymmetric). In particular, the key contributions of this paper are:

- The first demonstration of a differential current-mode clocked FF.
- The first demonstration of a symmetric H-tree differential current-mode CDN.
- The effective integration of the DCM FF with VM CMOS logic.
- The first demonstration of DCM clocking on industrial testbenches.
- The first demonstration of EM aware wire-sizing for DCM clocking.

The rest of the paper is organized as follows: Section II gives a brief overview of some existing signaling schemes. Section III and Section IV propose our DCM FF and CDN, respectively. Section V introduces the automatic DCM CDN generation technique. Section VI compares our new FF and CDN with existing schemes. Section VII investigates the noise and reliability of the proposed system. Finally, Section VIII concludes the paper.

R Islam of Electrical and Computer Engineering, University of Michigan-Dearborn, MI, 48128 USA e-mail: riaduli@umich.edu.

H Fahmy, P Lin and M Guthaus are with the Department of Electrical and Computer Engineering, University of California Santa Cruz, CA 95064 USA e-mail: {plin11, hfahmy, mrg}@ucsc.edu

Copyright (c) 2018 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending an email to pubs-permissions@ieee.org.



Fig. 1: A self-level-converted driver circuit transmits two lowswing voltages and the Rx circuit amplifies the difference between them to reproduce the full-swing output voltage [8].



Fig. 2: The clamped bit-line sense amplifier Rx based DCM scheme uses factor of four sizing rule in cascaded inverters that drive the long interconnect [14].

#### II. OVERVIEW OF EXISTING SIGNALING SCHEMES

Unlike traditional buffer-based interconnect signaling, DCM signaling uses a differential CM transmitter (Tx) that sends complementary current pulses at a very low-voltage swing into a pair of interconnect wires. The interconnect is held at roughly the same voltage and is unbuffered. At the receiving end, a differential CM receiver (Rx) senses the two complementary currents and ideally converts them into two differential voltages or a single-ended, full-swing output voltage. A typical non-clock differential CM signaling scheme is shown in Figure 1 [8]. This scheme uses a self-level-converted driver circuit that limits the output voltage swing. Finally, two diodeconnected transistor pairs drive the interconnect. However, this kind of driver does not provide sufficient driving capability for large loads and is highly sensitive to noise [22]. This scheme uses a low-swing differential CM Rx circuit [8]. In order to increase the robustness of the design, the Rx uses both a common-gate and a common-source amplifier configuration. However, the Rx consumes a significant amount of static power due to double current-mirror stages.

Another prior strategy that uses differential current-sensing for interconnect signaling is shown in Figure 2 [14]. The scheme is based on a modified clamped bit-line sense amplifier (MCBLSA) Rx [14]. It utilizes the traditional "fanout of four" (FO4) sizing rule for a CMOS buffer chain to design the driver. However, there is no real guideline to design the Tx for different sized interconnects. Moreover, the Tx drives static current into the interconnect while the current is useful during only a fraction of the cycle, which results in additional power consumption. The Rx circuit requires an equalizing (EQ) signal that creates a metastable phase, while the differential input currents break this metastability and help the Rx to produce two complementary outputs. However, this scheme suffers significant static power loss in the metastable phase and also may switch the next stage's buffer or latches [23].

The previous differential current-mode schemes, however, were one-to-one data connections whereas clock networks are, by definition, a one-to-many signal distribution. A oneto-many CM clocking scheme based on CM current-pulsed FF [6] offers a large CDN power savings compared to a VM scheme. However, it consumes high static power and is highly susceptible to noise. Our differential CM scheme addresses these issues.

#### III. DIFFERENTIAL CURRENT-MODE PULSED FLIP-FLOP

We propose the first differential CM pulsed FF (DCMPFF) in Figure 3a. The DCMPFF extends the previous single input current CM pulsed FF (CMPFF) [6], [24] to have two complementary input currents, I(IN+) and I(IN-). These inputs can be either positive or negative depending on the current direction; however, the DCMPFF is sensitive only when I(IN+) has a push-current and I(IN-) has a pull-current to mimic an edge-triggered behavior.

The DCMPFF has a current-comparator (CC) with two reference voltage generators, an inverter-amplifier (amp), an output stage, and a static storage cell. An enable  $(\overline{EN})$  signal activates the DCMPFF while the CC uses the push-pull current as an input clock to provide a full-swing output voltage depending on the data input.

A reference voltage generator is built using a diodeconnected PMOS-NMOS pair (or polysilicon resistors) as shown in Figure 3a. The two reference voltage generators create two static currents in PMOS M2 and NMOS M3 and also provide a low-impedance input. The CC compares the differential current using an inverting amp (M6-M7) at node C. After the two-stage amplification, a buffer provides the required drive to generate a full-swing local clock pulse (CLKP) that activates the output stage. A feedback connection to M5 limits the CLKP pulse to less than 50% of duty cycle. A transmission gate output stage latches data into a storage cell.

The use of a differential input current is more robust to noise compared to a single-ended scheme, which will be discussed and analyzed further in Section VII. The complementary pushpull currents also help simplify the design of the current Tx, which can generate the currents from a single input voltage.

The CC compares two complementary currents which are combined using an inverter amplifier that enables smaller transistors in the CC (M2-M3) compared to the prior singleended CMPFF CC [6]. Due to the lower logical effort of M2-M3, the DCMPFF requires less input current and consumes less power.

The representative simulation waveforms of the proposed DCMPFF are shown in Figure 3b and confirm the internal current-to-voltage conversion. The internally-generated CLKP



(a) The input stage compares the complementary input currents and amplifies the difference to generate a voltage pulse that triggers a register stage to store data.



(b) Simulation waveforms confirm the complementary current-tovoltage pulse generation (CLKP) that triggers the input data capture. Fig. 3: The proposed DCMPFF and simulation results.

signal triggers the data storage, which is enabled with  $\overline{EN}$ . The amplitude of the two input currents affect the FF performance by changing the operating point of M2-M3.

Clock gating is a common phenomenon to reduce CDN power [25]. One of the major advantages of using DCMPFF is it has an embedded active-low  $\overline{EN}$  signal and can be utilized to perform clock gating in DCM CDN.

# IV. DIFFERENTIAL PULSED CURRENT TRANSMITTER AND DISTRIBUTION

A differential clocking scheme requires a differential current transmitter (DCMTx) that can efficiently provide differential push-pull current into the interconnect and distribute enough current to each sink. The DCMTx is a voltage-to-current converter that receives a traditional voltage-mode clock (CLK) from a PLL and converts it into a complementary push-pull current signal with minimal voltage swing in the interconnect line. The entire proposed scheme with the DCMPFF, DCMTx, and CDN is shown in the Figure 4a. The DCM scheme is based on a CDN that has similar impedance at each branch resulting equal current to each DCMPFF.



(a) The proposed DCMTx and CDN converts a VM input signal to complementary pulse currents with minimal interconnect voltage swing and distributes current equally to the DCMPFFs.



(b) Simulation waveforms confirm a VM input is converted to constant CDN voltages and representative complementary current distribution.

Fig. 4: The Proposed DCM CDN and simulation results.

The proposed DCMTx extends the previously reported pulsed current Tx [6] by using two extra inverters and an extra driver circuit (M3-M4) to generate two complementary currents. The second (differential) current has the same amplitude with one inverter delay of phase difference.

In order to have equal differential current, the DCMTx uses similar sizes for the M1-M2 and M3-M4 drivers. The driver sizes are adjusted for current-loss in the long transmission line and supply the required amount of current to each sink. It is important to have appropriate sizing of the wires for both reliability and performance of the CDN. A narrow or highly resistive network will produce distorted output current, while a wide network would be low resistance and not have electromigration problems.

#### V. DCM CLOCKING AND SYNTHESIS (DCMCS)

The existing CM and DCM clocking schemes are applicable only to symmetric H-tree networks, while researchers very recently demonstrated a single-ended current-mode clock synthesis (CMCS) methodology [26] and efficiently applied that



Fig. 5: The proposed EM aware DCMCS methodology is applicable to any symmetric or asymmetric network and returns minimum global clock skew and corresponding Tx sizing.

to CM clocking in asymmetric networks. However, it ignores the electromigration (EM) effect in wire sizing. Similar to CMCS, the proposed DCM clocking and synthesis (DCMCS) methodology utilizes DCM Tx sizing by computing the total admittance  $(Y_T)$  of an entire clock network with the DCMPFFs as

$$Y_T = \beta \left( \sum_{i \in sinks} \alpha_i C_{ox} + \sum_{j \in wires} C_{w,j} \right) \tag{1}$$

where  $C_{w,j}$  is the wire capacitance of wire j,  $\alpha_i$  is the admittance factor of sink/FF *i*, and  $\beta$  is a constant. The first part of Equation 1 represents the total input admittance of each DCMPFF, while the latter part represents the total wire admittance of the network. In addition, the proposed DCMCS methodology incorporates EM aware wire sizing to improve the reliability of the design. Figure 5 shows the DCM CDN generation methodology.

Algorithm 1 presents pseudocode of our DCMCS flow for the entire clock network. The algorithm takes any clock network, EM constraints or maximum current density  $(J_{max})$ from International Technology Roadmap for Semiconductors (ITRS) [27] for the corresponding technology, initial wire width ( $Wire_{width}$ ) [28], and minimum wire width ( $W_{min}$ ) as inputs and returns an EM-aware DCM CDN. In order to implement the testbench/asymmetric networks, the clock tree is routed utilizing zero-skew DME methodology [17], while the final tree nodes are connected with DCMPFFs (Line 6). DCM clocking scheme uses a single differential current Tx to drive the clock network and the DCMPFFs. The DCMCS algorithm calculates the  $Y_T$  of the network (Line 7) in the totalAdmittance(Tree) method, which applies Equation 1. Then it determines the initial Tx sizing  $(T_{init})$  of the network (Line 8) using  $sizeTx(Y_T)$ ). It runs a transient simulation (simulateTransient()) and uses calculateSkew() to measure the initial skew  $(S_{init})$  (Lines 9 - 10).  $T_{best}$  and  $(S_{best}, S_{new})$  are set to the initial values of  $T_{init}$  and  $S_{init}$ , respectively (Line 11). The initial Tx sizing value is also stored in two temporary variables  $(T_{newUp} \text{ and } T_{newDown})$ . Then we recursively size up (increase Tx size 1% from initial sizing) and size down (decrease Tx size 1% from initial sizing)

#### Algorithm 1 DCM CDN generation

- 1:
- 2: **Procedure:** DCM\_CDN(Tree, J<sub>max</sub>, Wire<sub>width</sub>,
- 3: Wiremin)
- 4: Input: clock tree (Tree), electromigration constraint  $(J_{max})$ , initial wire width ( $Wire_{width}$ ), minimum wire width ( $Wire_{min}$ );
- 5: Output: DCM CDN with Properly sized DCM Tx
- 6: ZST = zeroSkewRoutedTree(Tree)
- 7:  $Y_T = totalAdmittance(ZST)$
- 8:  $T_{init} = sizeTx(Y_T)$
- simulateTransient() 9.
- 10:  $S_{init} = calculateSkew()$
- 11:  $S_{best} = S_{init}, S_{new} = S_{init}, T_{best} = T_{newUp} = T_{newDown} =$  $T_{init}$
- 12: while  $S_{new} \leq S_{best}$  do ▷ repeat if improvement or equal
- $T_{newUp} = T_{newUp} + \delta s \triangleright \delta s$  is the 1% of  $T_{init}$ , sizing up 13:
- simulateTransient() 14:
- 15:  $S_{new} = calculateSkew()$
- 16: if  $S_{new} < S_{best}$  then

17: 
$$S_{best} = S_{new}, T_{best} = T_{newUp}$$

18: end if

19: end while

- 20: while  $S_{new} \leq S_{best}$  do ▷ repeat if improvement or equal ▷ sizing down
- 21:  $T_{newDown} = T_{newDown} - \delta s$
- 22: simulateTransient()
- 23:  $S_{new} = calculateSkew()$
- 24: if  $S_{new} < S_{best}$  then

25: 
$$S_{best} = S_{new}, T_{best} = T_{newDown}$$

26: end if 27: end while

- 28:  $J_{root} = calculateCurrentDensity()$
- 29: if  $J_{root} \leq J_{max}$  then return ▷ return if EM meet
- 30: **else**
- 31:  $Wire_{widthNew} = Wire_{width} + Wire_{min}$
- $DCM\_CDN(Tree, J_{max}, Wire_{widthNew}, Wire_{min})$ 32: repeat DCM\_CDN method with new wire sizing

33: end if

as shown in Figure 5 to extract the best clock skew and corresponding Tx size (Lines 12 - 27). After clock routing and Tx sizing, we compute the current-density of the root wires (Line 28) and compare with the ITRS suggested maximum current-density that causes EM [27] (Line 29). If the initial wire sizing  $(Wire_{width})$  does not meet the EM constraint limit, we increase the  $Wire_{width}$  by  $Wire_{min}$  and initiate the DCM\_CDN (Tree,  $J_{max}$ ,  $Wire_{width}$ ,  $Wire_{min}$ ) method with the new wire width values (Lines 30 - 33). The algorithm terminates if there is no improvement in skew and no violation of EM constraints. The proposed algorithm has worked with any network and our experimental results in Section VI will show the detailed results and the merit.

#### VI. SIMULATION RESULTS AND ANALYSIS

The circuits are simulated in HSPICE with a 45nm CMOS technology model [29]. In order to compare the power, performance, and area, we implemented several designs in layout: a master-slave D FF (MSDFF), a CMPFF [6], and the proposed DCMPFF. The layout areas, nominal CLK-Q delay, data-to-Q (D-Q) delay, and total power are listed in Table I. The performance of the FFs was evaluated considering clock frequencies from 1-5GHz and a 1V supply voltage. The power considers input data at 100% activity with a four FF load.

TABLE I: The proposed DCMPFF is 47% faster, consumes 9% less area compared to the Tra. PFF [30], and is more power efficient in the higher frequency range.

| Types of FF   | Normalized<br>Area | Delay (ps) |      | Normalized Power (static + dynamic) |       |       |       |       |  |  |
|---------------|--------------------|------------|------|-------------------------------------|-------|-------|-------|-------|--|--|
|               |                    | CLK-Q      | D-Q  | 1 GHz                               | 2 GHz | 3 GHz | 4 GHz | 5 GHz |  |  |
| MSDFF         | 1.00               | 37.0       | 58.0 | 1.00                                | 1.00  | 1.00  | 1.00  | 1.00  |  |  |
| Tra. PFF [30] | 1.49               | 75.5       | 29.5 | 1.50                                | 1.57  | 1.41  | 1.40  | 1.40  |  |  |
| CMPFF [6]     | 1.45               | 45.0       | 15.0 | 3.50                                | 3.37  | 2.47  | 1.91  | 1.61  |  |  |
| DCMPFF        | 1.36               | 39.7       | 19.7 | 1.66                                | 1.65  | 1.21  | 1.09  | 0.94  |  |  |

#### A. DCMPFF Results

The DCMPFF consumes 6% less silicon area compared to the previous CMPFF and uses 23 transistors while the MSDFF and CMPFF use 20 and 25 transistors, respectively. Figure 6 shows the layout of the proposed DCMPFF. The CLK-Q delays of the FFs are measured under relaxed timing conditions for both the VM and CM instances. In other words, the data is stable sufficiently before the arrival of the VM clock edge or the CM input current pulse.

Table I shows the nominal CLK-Q delay for both high-tolow and low-to-high Q transitions. Compared to the previous single-ended CMPFF input current of  $\pm 2.3\mu A$  amplitude, the nominal CLK-Q delay of DCMPFF requires only  $\pm 1.8\mu A$  and 70ps pulse width. Clearly, the DCMPFF has a lower CLK-Q delay than the CMPFF but is only slightly slower than the MSDFF. For each FF, we measured the setup-time  $(t_s)$  and hold-time  $(t_h)$ . These use the common definition as the time margin that causes a CLK-Q delay increase of 10% beyond nominal. The  $t_s$  and  $t_h$  of the DCMPFF are -20ps and 95ps, respectively. The setup time of the DCMPFF is  $1.95 \times$  lower than the traditional MSDFF, while the  $t_h$  of the DCMPFF is  $1.34 \times$  higher than the CMPFF. We also measure the D-Q delay of each FF. The D-Q of the DCMPFF is 66% faster than the VM MSDFF.

We measured the total power consumption of each FF considering the input clock and data switching. For VM FFs, we used a traditional approach [31]. For CM FFs, we used a CM Tx that can produce the required amount of current and the bias voltage to drive the CM FF. First, we measure the



Fig. 6: The proposed DCMPFF designed with standard cell height and consumes lower silicon area compared to the previous CMPFF [6].

total power consumption, including the Tx and CM FFs. Then we remove the FFs to measure the Tx power. The difference between these two results is the CM FF power.

In the power measurement, we also consider both static and dynamic power of VM and CM FFs. At a 1GHz clock frequency, the DCMPFF consumes 40% and 9.6% more power compared to the MSDFF and Tra. PFF, respectively. However, the power consumption of the DCMPFF is comparable to an MS DFF at 5GHz. At the same frequency the DCMPFF consumes 33% and 41% less power compared to the Tra. PFF and CMPFF [6], respectively. At low frequencies, the DCMPFF consumes higher power than the VM Tra. PFF and MSDFF due to a high static power overhead. However, the dynamic power of the CM FFs increases proportionally to the frequency at a slower rate than the VM FFs as shown in the bottom two rows of the Table I.

#### B. H-Tree Distribution

In order to validate the functionality of the DCMTx and the proposed DCMPFF in a CDN, we implemented an equalimpedance binary-tree network spanning  $1mm \times 1mm$ . Each branch of clock tree is modeled as a lumped 3-component II-model and then connected together to make a distributed CDN model. The interconnect unit capacitance and resistance values are for 45nm CMOS technology [29]. The functional simulation results with the resulting output current are shown in Figure 4b.

For initial results, our CDN analysis uses a 5-level H-tree distributed in  $7.69mm \times 7.69mm$  area for both the singleended CM and VM CDN, but buffers drive the VM CDN instead of the CM Tx circuit. In order to minimize the later stages' short-circuit power and any timing violation, the VM buffered network is optimized for an output clock signal slew with less than 10% of minimum operating clock period. In the differential CDN, two such tree networks are routed. All CDNs drive 1024 FFs.

Table II shows the power breakdown of the VM, CM, and DCM CDNs simulation of clock frequencies ranging from 1–5GHz. On average, our DCM CDN consumes less power than both the single-ended CM and VM CDN for all frequencies. The obvious reason for more power consumption of VM CDN compared to the other CM/DCM CDNs is due to the voltage swing (0-to-Vdd) in the VM CDN, whereas the CM/DCM CDN has negligible voltage swing, as shown in Figure 4b. The proposed DCM CDN consumes less power than the CM CDN due to the high static power consumptions in the CMPFFs.

As expected at low frequency, the total power of the DCMPFF system is comparable to the VM cases, as shown in Figure 7. This is because, at low-frequencies, the DCMPFF consumes higher power than the VM FFs. However, at high frequencies, the power of DCMPFFs is lower than both the VM FFs, while the power of CMPFFs is higher than the proposed DCMPFFs due to the large static power consumption. The VM interconnect power dominates the CM/DCM FF power even at low frequencies. The real advantage, however, is that the DCM CDN power does not increase with frequency like the VM CDN power. Since the fluctuation of

| Frequency (GHz)     | Normalized CDN power |      |      |      | Normalized FFs power |                     |                       | Normalized total power |                  |              |           |              | % saving compared to |      |  |
|---------------------|----------------------|------|------|------|----------------------|---------------------|-----------------------|------------------------|------------------|--------------|-----------|--------------|----------------------|------|--|
|                     | VM                   | СМ   | DCM  | MSD  | Tra. P [30]          | CMP [6]             | DCMP                  | MSD sys.               | Tra. P sys. [30] | CMP sys. [6] | DCMP sys. | MSD          | Tra. P               | CMP  |  |
| 1                   | 1.65                 | 0.33 | 1.00 | 0.60 | 0.90                 | 2.1                 | 1.00                  | 1.1                    | 1.2              | 1.31         | 1.00      | 4.6          | 17.4                 | 21.9 |  |
| 2                   | 3.34                 | 0.34 | 1.00 | 0.61 | 0.95                 | 2.0                 | 1.00                  | 1.71                   | 1.91             | 1.35         | 1.00      | 41.5         | 47.9                 | 27.6 |  |
| 3                   | 4.84                 | 0.37 | 1.00 | 0.82 | 1.16                 | 2.0                 | 1.00                  | 2.39                   | 2.60             | 1.39         | 1.00      | 58.1         | 61.5                 | 29.9 |  |
| 4                   | 6.71                 | 0.42 | 1.00 | 0.92 | 1.28                 | 1.75                | 1.00                  | 2.83                   | 3.07             | 1.29         | 1.00      | 64.7         | 67.5                 | 24.7 |  |
| 5                   | 8.37                 | 0.44 | 1.00 | 1.06 | 1.48                 | 1.71                | 1.00                  | 3.31                   | 3.60             | 1.27         | 1.00      | 69.8         | 72.2                 | 23.5 |  |
| Average Savings (%) |                      |      |      |      |                      |                     |                       |                        | 47.7             | 53.3         | 25.5      |              |                      |      |  |
| 5                   | 8.37                 | 0.44 | 1.00 | 1.06 | 1.48<br>Avera        | 1.71<br>age Savings | 1.00<br>1.00<br>s (%) | 3.31                   | 3.60             | 1.27         | 1.00      | 69.8<br>47.7 | 72.2<br>53.3         |      |  |

TABLE II: The proposed DCM CDN saves 26% to 53% power on average compared to other VM and CM CDNs @ 1-5 GHz CLK.

common-mode voltage is relatively small, the dynamic power consumption of the DCM CDN is negligible. At 1GHz in particular, the DCM CDN system exhibits 5% to 22% total power savings compared to different single-ended CM/VM CDN. As expected, the power saving increases to 24% to 72% at the high 5GHz clock frequency.

#### C. ISPD Testbench Results

It is clear from Section VI-B and Section VI-A that the proposed DCMPFF and the DCM CDN consume lower power than the other VM FFs and VM CDN at higher frequencies (i.e., 5GHz clock). However, at low 1GHz clock frequency, the DCMPFF consumes higher power than the VM FFs, resulting in smaller power savings in an H-tree distribution. Hence, it is important to show the effectiveness of the proposed scheme at low 1GHz frequency on industrial testbenches. For this we used ISPD 2009 [32] and ISPD 2010 [28] testbenches.

The clock tree and the DCM FFs are driven by a single DCM Tx at the root. The DCM Tx, the tree, and the DCM FFs compose the entire DCM CDN. Figure 8a and Figure 8b show the resulting DME routed bufferless DCM CDN for the ISPD 2009 benchmark circuit f11 and the ISPD 2010 benchmark circuit 05, respectively. In the proposed DCMCS scheme, the total power consumption includes the DCM Tx power, the parasitic power, and the total DCM FF power.



Fig. 7: The proposed DCM CDN saves 5% to 72% power on average compared to other VM and CM CDNs @ 1-5 GHz CLK.



(a) Resulting routed DCM CDN for the ISPD 2009 benchmark circuit f11.



(b) Resulting routed DCM CDN for the ISPD 2010 benchmark circuit 05.

Fig. 8: The resulting clock networks after applying DCMCS methodology in ISPD 2009 and 2010 testbenches, respectively.

The VM clocking uses the same minimum wirelength DME network [17]; however, we inserted buffers to meet the slew and skew constraints [16]. In addition, the final tree nodes are connected with the VM FFs.

The proposed DCM clocking consumes lower power than the buffered VM MSDFF and Tra. PFF-based clocking scheme for all the ISPD 2009 and ISPD 2010 testbenches at 1GHz clock frequency, as shown in Table III and Table IV. In

| Benchmark |      |           | VM I         | DCM network      |      | DCM compared to VM |      |           |           |               |
|-----------|------|-----------|--------------|------------------|------|--------------------|------|-----------|-----------|---------------|
| Name      | Sink | Chip area | MSD sys.     | Tra. P sys. [30] | Skew | Power              | Skew | MSD       | Tra. P    | $\Delta$ Skew |
|           | (#)  | $(mm^2)$  | power $(mW)$ | power $(mW)$     | (ps) | ( <i>mW</i> )      | (ps) | power (%) | power (%) | (ps)          |
| s1r1      | 81   | 69.4      | 38.1         | 39.7             | 14.0 | 8.1                | 7.3  | 78.6      | 79.5      | 6.7           |
| s2r1      | 88   | 54.6      | 37.4         | 39.2             | 20.0 | 8.5                | 11.2 | 77.4      | 78.4      | 8.8           |
| s3r1      | 131  | 165.6     | 68.7         | 71.4             | 30.0 | 12.9               | 22.5 | 81.2      | 81.9      | 7.5           |
| s4r3      | 623  | 120.7     | 134.1        | 146.6            | 33.0 | 41.6               | 29.0 | 68.9      | 71.6      | 4.0           |
| f11       | 121  | 109.2     | 59.6         | 62.0             | 14.0 | 12.0               | 6.0  | 79.8      | 80.6      | 8.0           |
| f12       | 117  | 91.2      | 56.8         | 59.1             | 20.0 | 11.5               | 20.7 | 79.7      | 80.5      | -0.7          |
| f21       | 117  | 133.3     | 61.4         | 63.8             | 28.0 | 12.2               | 12.4 | 80.2      | 80.9      | 15.6          |
| f22       | 91   | 50.4      | 37.3         | 39.1             | 12.0 | 8.6                | 18.0 | 76.9      | 78.0      | -6.0          |
| f31       | 273  | 275.6     | 130.3        | 135.8            | 37.0 | 27.1               | 15.8 | 79.2      | 80.1      | 21.2          |
| f32       | 190  | 269.0     | 99.3         | 103.1            | 23.0 | 19.9               | 10.8 | 79.9      | 80.7      | 12.2          |
| Avg.      | 183  | 133.9     | 72.3         | 76.0             | 23.1 | 16.2               | 15.4 | 77.5      | 78.6      | 7.7           |

TABLE III: The proposed DCM clocking scheme enables 77.5% and 78.6% average power saving when compared to traditional VM buffered MSDFF and Tra. PFF-based systems, respectively, with 7.7ps lower clock skew using 2009 ISPD benchmarks.

particular, the proposed DCM clocking saves more than 77% and 40% power compared to the MSDFF system using the ISPD 2009 and 2010 networks, respectively. In addition, the DCMPFF-based clocking saves 79% and 51% power compared to the Tra. PFF-based using ISPD 2009 and 2010 networks, respectively. As suggested in Section VI-B, it is certain that the proposed DCM clocking will save quadratically more power at higher frequencies.

In addition to power, the proposed DCM clocking has 7.7ps and 11.3ps lower average clock skew compared to the traditional buffered VM scheme.

Table V shows the overall power-performance comparison of existing VM and CM and and the proposed DCM clocking schemes. The proposed DCM clocking saves 43% and 62% average power compared to the CMPFF system using the ISPD 2009 and 2010 networks, respectively. This is primarily due to the large static power of CMPFF. In addition to power, the proposed DCM clocking has 11.0ps and 15.1ps lower average clock skew compared to the previous CM scheme.



Fig. 9: Monte-Carlo simulation results ensure the correct functionality and performance of the proposed DCMPFF.

#### VII. NOISE AND RELIABILITY

#### A. Jitter Analysis

In scaled technology, it becomes increasingly difficult to ensure the correctness of the multi-gigahertz clock signal. One of the main reasons is the presence of clock jitter. Depending on the measurement techniques, jitter can be categorized as period jitter, cycle-to-cycle jitter, long-term jitter, phase error, and time-interval error. However, it has been shown that these jitters are mathematically related to each other [33]; hence, we measured the period jitter to show the robustness of DCM clocking compared with the other clocking schemes. For this analysis we considered supply voltage-induced noise in the voltage-control oscillator of the clock PLL and measured the 1000 random-sample clock period. The jitter-corresponding standard deviation ( $\sigma$ ) for traditional buffered VM scheme is 1.55ps and peak-to-peak jitter is 5.5ps. The  $\sigma$  for the singleended CM scheme is 1.47ps and peak-to-peak jitter is 3.7ps. The proposed DCM scheme exhibits much better 1.46ps of  $\sigma$ and 1.46ps peak-to-peak jitter.

#### B. Supply Voltage Fluctuation

We studied the response of the proposed DCM scheme to supply voltage variation. We considered a  $\pm 10\%$  voltage fluctuation from the nominal supply voltage. The delay variation for a traditional buffered VM scheme ranges from -21ps to 12ps compared to the nominal delay. The delay variation in a single-ended CM scheme ranges from -23ps to 28ps. The proposed DCM has delay variation from -23ps to 22ps compared to the nominal voltage delay.

#### C. Electromigration

Since we used homogeneous wires from root-to-sinks for all the clock networks, the root wire carries the maximum current. The VM CDN maximum current density is  $0.53MA/cm^2$ . As expected, the proposed DCM CDN requires less current compared to the single-ended CM CDN. The maximum current density of the DCM CDN in the root wire is  $0.24MA/cm^2$ 

|       | Benchm | nark      | VM Buffered network |                  |      | DCM network   |      | DCM compared to VM |           |               |
|-------|--------|-----------|---------------------|------------------|------|---------------|------|--------------------|-----------|---------------|
| Name  | Sink   | Chip area | MSD sys.            | Tra. P sys. [30] | Skew | Power         | Skew | MSD                | Tra. P    | $\Delta$ Skew |
|       | (#)    | $(mm^2)$  | power (mW)          | power (mW)       | (ps) | ( <i>mW</i> ) | (ps) | power (%)          | power (%) | (ps)          |
| 01.in | 1107   | 64.0      | 152.0               | 174.1            | 32.0 | 67.8          | 25.9 | 55.4               | 61.0      | 6.1           |
| 02.in | 2249   | 91.0      | 294.6               | 339.5            | 32.0 | 137.2         | 42.4 | 53.4               | 59.6      | -10.4         |
| 03.in | 1200   | 1.4       | 84.8                | 108.8            | 33.0 | 63.6          | 3.2  | 25.0               | 41.5      | 29.8          |
| 04.in | 1845   | 5.7       | 118.9               | 155.8            | 33.0 | 98.4          | 30.6 | 17.3               | 36.9      | 2.4           |
| 05.in | 1016   | 5.8       | 61.9                | 82.2             | 26.0 | 53.3          | 3.0  | 13.8               | 35.1      | 23.0          |
| 06.in | 981    | 1.5       | 142.6               | 162.3            | 22.0 | 52.7          | 18.3 | 63.1               | 67.6      | 3.7           |
| 07.in | 1915   | 3.5       | 123.2               | 161.5            | 30.0 | 101.2         | 7.8  | 17.9               | 37.4      | 22.2          |
| 08.in | 1134   | 2.6       | 86.4                | 109.0            | 32.0 | 60.8          | 18.8 | 29.6               | 44.3      | 13.2          |
| Avg.  | 1431   | 21.9      | 133.0               | 161.7            | 30.0 | 79.4          | 18.8 | 40.4               | 50.9      | 11.3          |

TABLE IV: Using ISPD 2010 benchmarks, the proposed DCMPFF-based DCM clocking scheme enables more than 44% and 50% average power saving compared to the traditional VM-buffered MSDFF and Tra. PFF-based systems, respectively, with additional 11.3ps global clock skew improvements.

less than the single-ended CM CDN,  $0.275MA/cm^2$ . This more than satisfies the ITRS suggestion that current density be limited to  $1.5MA/cm^2$  and relieves the electromigration threat to the proposed CDN wire sizing.

#### D. Process Sensitivity

It is impossible to analytically predict the behavior of a large network due to the combination of the mismatch errors of individual devices, while it is really intractable to analytically model even a small SRAM cell or FF behavior due to those variations. However, using Monte-Carlo (MC) simulation, the impact of these random parameter variations on FF functionality and performance can be studied. Hence, the resiliency of the proposed DCM scheme is demonstrated through non-uniform MC simulation of process variation and mismatch. The result of this experiment is shown in Figure 9. The proposed DCMPFF has a mean CLK-Q delay of 48ps, with a standard deviation of 7ps in 1000 runs. This result is much better compared to the recently reported CMPFF. The CMPFF has a mean CLK-Q delay of 55ps, with a standard deviation of 7.4ps in 1000 runs.

#### E. Threshold Voltage Mismatch

In scaled technologies, the circuits are highly sensitive to intra-die (process) variation such as threshold voltage  $(V_{th})$  variation. The CDN can experience large delay variation or

TABLE V: The proposed DCMPFF-based DCM clocking scheme enables 43% and 62% average power saving compared to the CM CMPFF-based system with additional 11.0ps and 15.1 global clock skew improvements, using ISPD 2009 and ISPD 2010 benchmarks, respectively.

| Benchmark | Existing VM | Avg. [30] | Existing CM | I Avg. [6] | Proposed DCM Avg. |           |  |
|-----------|-------------|-----------|-------------|------------|-------------------|-----------|--|
|           | Power (mW)  | Skew (ps) | Power (mW)  | Skew (ps)  | Power (mW)        | Skew (ps) |  |
| ISPD 2009 | 76.0        | 23.1      | 28.3        | 26.4       | 16.2              | 15.4      |  |
| ISPD 2010 | 161.7       | 30.0      | 211.3       | 33.9       | 79.4              | 18.8      |  |



Fig. 10: The testbench for delay variation due to  $V_{th}$  variation at ss-ff corner in (a) DCM CDN, and (b) buffered VM CDN.

skew due to  $V_{th}$  variation. In order to quantify this timing uncertainty, we analyzed the proposed DCM CDN and a Traditional PFF-based buffered VM CDN as shown in Figure 10(a) and Figure 10(b), respectively. In addition, we considered ss-ff corners. Unlike a traditional skew computation, we considered delay variation in the FF's outputs to include the FF's  $V_{th}$ variation. The proposed DCM CDN has 41ps skew. The buffered VM scheme has 43ps skew due to the presence of buffers in the VM clock tree.



Fig. 11: The proposed DCMPFF CLK-Q delay and power increases linearly with the increase of FF load and ensures the scalability of the proposed design.

#### F. Loading effect

We studied the loading effect of different FFs by changing the driving load of each FF. For any reliable design, it is expected that the FF power-performance will linearly increase with the increase of FF load. Figure 11 shows the result of these experiments. Figure 11(a) and Figure 11(b) show the CLK-Q delay and power consumption of the proposed DCMPFF and Tra. PFF, respectively. Clearly, the proposed DCMPFF's CLK-Q delay and power increase linearly with the increase of FF load and ensure the scalability of the proposed design.

#### VIII. CONCLUSION

In this paper, we presented a DCM distribution as an alternative to conventional repeater-based VM or CM distribution. The proposed DCM scheme uses a new DCMPFF, which is 47% faster, consumes 33% less power, and requires 9% less silicon area compared to a traditional PFF at 5GHz. When applied to a symmetric H-tree network, the proposed DCM scheme saves 5% to 72% power compared to a traditional single-ended VM clock at 1-5GHz and consumes 26% less power on average compared to a previously reported singleended CM scheme. At the same frequency range, the proposed scheme save 48% and 53% average power compared to the MSD and Tra. PFF-based systems, respectively. In addition, in this paper, we presented the highly robust low-power DCMCS methodology. The proposed scheme saves 79% and 51% average power compared to the traditional buffered synthesized VM scheme using ISPD 2009 and ISPD 2010 testbenches, respectively. In addition, the DCMCS scheme exhibits 7.7ps and 11.3ps lower average clock skew compared to a VM scheme using the ISPD 2009 and ISPD 2010 testbenches, respectively. Additionally, it has 21% less delay variation due to supply voltage fluctuation.

#### REFERENCES

- Sang-Soo Lee and R.H. Zele and D.J. Allstot and Guojin Liang, "CMOS continuous-time current-mode filters for high-frequency applications," *JSSC*, Mar 1993, pp. 323-329.
- [2] Evert Seevinck and P. J. V. Beers and H. Ontrop, "Current-mode techniques for high-speed VLSI circuits with application to current sense amplifier for CMOS SRAM's," *JSSC*, Apr 1991, pp. 525-536.
- [3] R.H. Zele and D.J. Allstot, "Low-voltage fully-differential CMOS switched-current filters," CICC, May 1993, pp. 6.2.1-6.2.4.

- [4] Peter Real and D.H. Robertson and C.W. Mangelsdorf and T.L. Tewksbury, "A wide-band 10-b 20 Ms/s pipelined ADC using current-mode signals," JSSC, vol. 26, no. 8, pp. 1103-1109, Aug 1991.
- [5] A.P Jose and G. Patounakis and K.L. Shepard, "Near speed-of-light onchip interconnects using pulsed current-mode signalling," *VLSIC*, June 2005, pp. 108-111.
- [6] R. Islam and M. R. Guthaus, "Current-mode clock distribution," *ISCAS*, June 2014, pp. 1203–1206.
- [7] A. Katoch and H. Veendrick and E. Seevinck, "High Speed Currentmode Signaling Circuits for On-Chip Interconnects," *ISCAS*, May 2005, pp. 4138 – 4141.
- [8] A. Narasimhan, M. Kasotiya, and R. Sridhar, "A low-swing differential signalling scheme for on-chip global interconnects," *ICVD*, Jan 2005, pp. 634–639.
- [9] R. Islam and M. R. Guthaus, "Low-Power Clock Distribution Using a Current-Pulsed Clocked Flip-Flop," *TCASI*, vol. 62, no. 4, pp. 1156– 1164, Apr 2015.
- [10] Chan, S.C. and Shepard, Kenneth L. and Restle, P.J., "Distributed Differential Oscillators for Global Clock Networks," *JSSC*, vol. 41, no. 9, pp. 2083-2094, Sep 2006.
- [11] T. C. Hsueh and G. Balamurugan and J. Jaussi and S. Hyvonen and J. Kennedy and G. Keskin and T. Musah and S. Shekhar and R. Inti and S. Sen and M. Mansuri and C. Roberts and B. Casper, "26.4 A 25.6Gb/s differential and DDR4/GDDR5 dual-mode transmitter with digital clock calibration in 22nm CMOS," *ISSCC*, Feb 2014, pp. 444 445.
- [12] S. Y. Huang and T. Y. Huang and C. T. Liu and R. B. Wu, "Ringing Noise Suppression for Differential Signaling in Unshielded Flexible Flat Cable," *TCPMT*, vol. 5, no. 8, Aug 2015, pp. 1152 – 1159.
- [13] R. Islam, H. Fahmy, Ping-Yao Lin, and M. R. Guthaus, "Differential current-mode clock distribution," *MWSCAS*, Aug 2015, pp. 1–4.
- [14] A. Maheshwari and W. Burleson, "Differential current-sensing for onchip interconnects," TVLSI, vol. 12, no. 12, pp. 1321–1329, Dec 2004.
- [15] Sekar, D.C., "Clock trees: differential or single ended?," ISQED, Mar 2005, pp. 548–553.
- [16] G. E. Tellez and M. Sarrafzadeh, "Minimal buffer insertion in clock trees with skew and slew rate constraints," *TCAD*, vol. 16, no. 4, pp. 333–342, Apr 1997.
- [17] R.-S. Tsay, "Exact zero skew," ICCAD, Nov 1991, pp. 336-339.
- [18] P. P. Saha and S. Saha and T. Samanta, "Rectilinear Steiner Clock Tree Routing Technique with Buffer Insertion in Presence of Obstacles," *ICVD*, Jan 2015, pp. 447–451.
- [19] C. Deng and Y. Cai and Q. Zhou, "Fast synthesis of low power clock trees based on register clustering," *ISQED*, Mar 2015, pp. 303–309.
- [20] S.I. Long and J. Q. Zhang, "Low power GaAs current-mode 1.2 Gb/s interchip interconnections," JSSC, vol. 32, no. 6, pp. 890–897, Jun 1997.
- [21] N. Tzartzanis and W.W. Walker, "Differential current-mode sensing for efficient on-chip global signaling," *JSSC*, vol. 40, no. 11, pp. 2141–2147, Nov 2005.
- [22] J.C. Garcia and J.A. Montiel-Nelson and S. Nooshabadi, "Adaptive Low/High Voltage Swing CMOS Driver for On-Chip Interconnects," *ISCAS*, May 2007, pp. 881–884.
- [23] A. Maheshwari and W. Burleson, "Current-Sensing and Repeater Hybrid Circuit Technique for On-Chip Interconnects," *TVLSI*, vol. 15, no. 11, Nov 2007, pp. 1239–1244.
- [24] Guthaus, M. and Islam, R., "Current-mode clock distribution," US Patent, 9787293, Oct 2017.
- [25] Y. Liu and Y. Jin and P. Li, "Exploring sparsity of firing activities and clock gating for energy-efficient recurrent spiking neural processors," *ISLPED*, July 2017, pp. 1–6.
- [26] R. Islam and M. R. Guthaus, "CMCS: Current-Mode Clock Synthesis," TVLSI, vol. 25, no. 3, Mar 2017, pp. 1054–1062.
- [27] Semiconductor Industry Association, "The International Technology Roadmap for Semiconductor," 2012 Edition.
- [28] C. N. Sze, "ISPD 2010 High Performance Clock Network Synthesis Contest," *ISPD*, Mar 2010.
- [29] NCSU, "FreePDK45," http://www.eda.ncsu.edu/wiki/FreePDK45.
- [30] S. Kozu and M. Daito and Y. Sugiyama and H. Suzuki and H. Morita and M. Nomura and K. Nadehara and S. Ishibuchi and M. Tokuda and Y. Inoue and T. Nakayama and H. Harigai and Y. Yano, "A 100 MHz, 0.4 W RISC processor with 200 MHz multiply adder, using pulse-register technique," *ISSCC*, Feb 1996, pp. 140–141.
- [31] N. Nedovic and M. Aleksic and V.G. Oklobdzija, "Conditional techniques for low power consumption flip-flops," *ICECS*, 2001, pp. 803– 806.
- [32] C. N. Sze, P. Restle, G. J. Nam, and C. J. Alpert, "Clocking and the ISPD'09 clock synthesis contest," *ISPD*, Mar 2009, pp. 149–150.

[33] T. J. Yamaguchi and M. Soma and D. Halter and R. Raina and J. Nissen and M. Ishida, "A method for measuring the cycle-to-cycle period jitter of high-frequency clock signals," VTS, 2001, pp. 102–110.



**Riadul Islam** is currently an assistant Professor in the Department of Electrical and Computer Engineering at University of Michigan-Dearborn. In his Ph.D. dissertation work at UCSC, Dr. Riadul designed the first current-pulsed flip-flop/register that resulted in the first-ever one-to-many current-mode clock distribution networks for high-performance microprocessors. From 2007 to 2009, he worked as a full-time faculty member in the Department of Electrical and Electronic Engineering of the University of Asia Pacific, Dhaka, Bangladesh. He is a member of

the IEEE, IEEE Circuits and Systems (CAS) society. He holds one US patent and several IEEE/ACM journal and conference publications in TVLSI, TCAS, ISCAS, MWSCAS, ISQED, and ASICON. His current research interests include digital, analog, and mixed-signal CMOS ICs/SOCs for a variety of applications; verification and testing techniques for analog, digital and mixedsignal ICs; CAD tools for design and analysis of microprocessors and FPGAs; automobile electronics; and biochips.



Hany A. Fahmy received his B.Sc. degree in electronics and communications engineering from the Arab Academy for Science and Technology, Cairo, Egypt, in 2007, and the M. Sc. degree in electrical engineering from the Arab Academy for Science and Technology, Cairo, Egypt, in 2011. From 2007 to 2011, he worked as a full time faculty in the department of electronics and communications engineering of the Arab Academy for Science and Technology, Cairo, Egypt. Currently he is working towards his Ph. D at the University of California

Santa Cruz in the Computer Engineering department. His research interest includes low-power clock network design, variability-aware low-power/ high-speed digital/mixed-signal circuit design and Adiabatic circuits design. His research projects are funded in part by the National Science Foundation (NSF).



**Ping Y. Lin** is currently a PhD candidate in Electrical Engineering Department of University of California Santa Cruz. Ping received his BSE in Material Science and Engineering in 2003 from Chiao Tung University, Taiwan, and MSE in 2008 from University of Pennsylvania. His research interests are low-power resonant clocks designs. This includes resource minimization, dynamic frequency scaling, and optimal power design.



Matthew R. Guthaus is currently an Associate Professor at the University of California Santa Cruz in the Computer Engineering department. Matthew received his BSE in Computer Engineering in 1998, MSE in 2000, and PhD in 2006 in Electrical Engineering all from The University of Michigan. Matthew is a Senior Member of ACM and IEEE and a member of IFIP Working Group 10.5. His research interests are in low-power computing including applications in mobile health systems. This includes new circuits, architectures, and sensors along with

their application to mobile and clinical health systems. Matthew is the recipient of a 2011 NSF CAREER award and a 2010 ACM SIGDA Distinguished Service Award.