# Low-Power Decimation Filter for 2.5 GHz Operation in Standard-Cell Implementation

Manfred Ley, Oleksandr Melnychenko

*Abstract*—A low-power decimation filter for very high-speed over-sampling analog to digital converters implemented in semi-custom design style is presented. The possibility to use a deep-sub-micron digital standard cell library in a VHDL based automated design flow to replace work intense full-custom design is demonstrated. The choice of a polyphase nonrecursive decimation filter structure is explained and some aspects of synthesis and routing influencing the design are discussed. Results after layout and their analysis are presented for different filter structures, which demonstrate the correctness of the designed decimation filter at the specified clock speed of 2.56 GHz with a power consumption of 1.2 mW.

*Index Terms*— deep-sub-micron digital standard cell, highspeed decimation filter, low power, semi-custom design

#### I. INTRODUCTION

One of the compound parts of  $\sum \Delta$  ADCs is a digital low pass decimation filter. The low-power decimation filter presented here was designed for a new  $\sum \Delta$  modulator structure of third order with 1-bit output and 1-bit DAC, which can be used in wireless mobile communication systems. High performance and accuracy of such  $\sum \Delta$ modulators lead to a very high data rate at the decimation filter input, thus usually a first filter stage is included directly within the analog ADC block. The main goal of this investigation is to avoid a full-custom digital design for this fast circuit parts to save costs and reduce time-to-market especially in the case of adaptation of filter characteristics to new applications. Ideally, the outcome should be a configurable reusable filter macro block described in VHDL which physical implementation parameters like actual shape of layout are easily changeable according to new requirements.

As in the case of this work the  $\sum \Delta$  modulator operating frequency is 2.56 GHz the main problem of the filter design is to capture and process the input data stream at a speed that is at the edge of the operating frequency range of the target 65nm CMOS standard cell library.

Section II of this paper describes the choice of decimation filter architectures at system level. VHDL circuit design is

M. Ley is with the Department of Systems Engineering, Carinthia University of Applied Science, Villach, Austria (phone: +43-4242-90500-2119; fax: +43-4242-90500-2010; e-mail: m.ley@fh-kaernten.at).

O. Melnychenko was with the Department of Systems Engineering, Carinthia University of Applied Science, Villach, Austria. He is now with the Institute of Telecommunications, Odessa National Polytechnic University, Odessa, Ukraine. described in section III. Section IV provides information on result achieved and section V gives an overview on supplementary verification steps to ensure these results. Finally, some conclusions and summary are within section VI.

#### **II. FILTER ARCHITECTURE**

#### A. System Level Design

The requested filter transfer function and decimation factor are determined by  $\sum \Delta$  modulator noise properties, signal bandwidth and required output data rate. MatLab FFT simulations of the  $\sum \Delta$  modulator and decimation filter with characteristic sine input signals give the modulator output and filter output in frequency domain. The obtained spectrum (Fig. 1 (a)) contains both the applied signal (thin peak in the left part) and  $\sum \Delta$  modulator noise. At the point of 320 MHz the amplitude characteristic shows a maximum as at this frequency characteristic additional noise components generated inside of the  $\sum \Delta$  modulator appear [1]. One can see (Fig. 1 (a)) that the desired passband of 20 MHz for the ADC after filtration and decimation can be achieved as in this frequency range the input signal is not influenced by  $\Sigma \Delta$  modulator noise, while nearly all the  $\Sigma \Delta$ modulator noise is shifted into the high frequency range above 20 MHz and should be cut off.

To get the input signal in passband range without ADC noise and a data rate appropriate for further signal processing stages, a chain of decimation filters is needed. The decimation filter discussed here represents only the very first processing stage to cut  $\sum \Delta$  modulator noise with some data rate reduction so that following processing stages can be applied and implemented easily and efficient by a usual standard cell design flow.

System level simulations including previous considerations lead to a decimation filter with a transfer function in *z*-domain according to (1) and a decimation factor of eight.

$$H(z) = (1+z^{-1})^3 (1+z^{-2})^3 (1+z^{-4})^4$$
(1)

The black curve (Fig. 1 (a)) is the amplitude characteristic of the designed transfer function. For mathematical analysis, it is possible to separate filtration and decimation. In this way, the spectrum after filtration with no decimation (Fig. 1 (b)) can be obtained and graphical analysis of filtration results can be done. Afterwards decimation should be applied (Fig. 1 (c)) and analysis of spectrum aliasing can be done. To enable the use of the target standard cell library it

Manuscript received December 07, 2011. This work was done as contribution to the FIT-IT funded project ARDES (ADC for deep submicron technologies).



(a)  $\Sigma\Delta$  modulator output and filter transfer function, Fs = 2.56 GHz.



(c) Spectrum after filtration and decimation, Fs = 320 MHz.

Fig. 1. FFT of signals at  $\Sigma\Delta$  modulator output, after filtration, and decimation.

is important to reduce the given input clock rate of 2.56 GHz as early as possible within the filter chain, thus decimation has to precede digital filtration. The key to achieve this requirement and avoid signal aliasing is to apply a polyphase digital filter architecture [2]. Moreover, polyphase architectures save power as registers of the filter work at a lower clock frequency and all the arithmetic operations are performed at the lowest possible rate [3].

The polyphase decimation filter structure (Fig. 2) is based on the polyphase decomposition of a FIR transfer function F(z) into the phase components  $F_0(z)$  to  $F_{d-1}(z)$ . This decomposition should satisfy equation (2), where d is the decimation factor (a positive integer number).

$$F(z) = \sum_{i=0}^{d-1} z^{-i} F(z^{d})$$
<sup>(2)</sup>

The requested transfer function (1) represents a FIR digital filter. The reasons not to use IIR transfer functions are the numerous disadvantages of IIR filters in this case. First FIR filters are always stable. Second, for any FIR filter and decimation factor it is easy to build the polyphase representation while for IIR it is not a straightforward routine, which might result in increased amount of hardware and power consumption. Finally, there is a well-developed theory of digital filtering for  $\sum \Delta$  ADCs that enables the creation of FIR filter transfer functions in an optimal way for cutting  $\sum \Delta$  modulator noise.



Fig. 2. Polyphase decimation filter structure.

This optimum means to find a filter transfer function that is complex enough to cut  $\sum \Delta$  modulator noise according to application requirements but simple enough (in terms of computation complexity) not to lead into a large amount of hardware.

#### B. Decimation Filter Structures

To get more insight into implementation constraints, several filter structures were analyzed. The form of the transfer function (1) makes it possible to implement the filter in different decimation styles. Usually it is recommended for decimation filters to split the decimation into several stages making the system a multistage one [2]. For the filter in question several variants of decimation structure are possible but just the three most promising are under further investigation (Fig. 3). Thus, this decimation filter is implemented as a one-, two- or three-stage variant, labeled as "8", "4-2" and "2-2-2".

The one-stage implementation (Fig. 3 (a)), 8, means that one block of polyphase filter contains a deserializer by 8 and 8 phases of FIR filter and final adder. The output of the  $H_1$ -block is the filtered data with the reduced data rate by eight. The transfer function of the block is (1).

If the filter is implemented as a two-stage one (Fig. 3 (b)), labeled as 4-2, the circuit contains two blocks of polyphase decimation filters in series. The first block consists of a deserializer by four, 4-phased FIR filter and adder. The second block consists of the same parts but decimation factor is two here. The output of the second block is again



Fig. 3. Architectures of filter in respect to decimation structure.

the filtered signal with reduced data rate by eight. The transfer function of the filters in the first and second blocks is (3) and (4) respectively.

$$H1(z) = (1+z^{-1})^3 (1+z^{-2})^3$$
(3)  

$$H2(z) = (1+z^{-2})^4$$
(4)

The three-stage implementation (Fig. 3 (c)), labeled as 2-2-2, consists of three blocks in series of the same kind as explained before. The deserializer in each block here is by two. The output of the last block is again the needed output with reduced data rate by eight. (5), (6) and (7) represent the transfer functions of the blocks.

$$H1(z) = (1+z^{-1})^{3}$$

$$H2(z) = (1+z^{-1})^{3}$$
(6)

$$H_2(z) = (1+z^{-1})^4$$
(6)  
$$H_3(z) = (1+z^{-1})^4$$
(7)

It should be noticed that these three variants of the circuit are not different filter solutions with close parameters. They are just different architectural solutions for the same transfer function (1) the outputs of these three circuits are mathematically equivalent and (3), (4), (5), (6), (7) are produced from (1).

# C. Clock Generation Issues

An essential circuit block is not described above—clock generation. For proper operation, each register in the deserializer and filter circuits has to be clocked with the appropriate clock signal. The main input clock signal that is provided by the ADC clock generator corresponding to the modulator data rate is not suitable for the whole decimation filter. A deserializer by n in each decimation filter block needs n clock signals for n decimation phases. One of these signals should also clock the polyphase filter registers. In the last design block, one of the clock phases is needed to provide the following circuit with an output data synchronous clock signal.

In general, there are two ways to provide slower clocking out of a fast base clock – directly use clock divider circuits or generate appropriate clock gating signals to mask out single pulses of the base clock. Whereas dividers might relax clock buffer requirements due to symmetric duty cycle signals, clock gating has advantages regarding clock skew. For each clock signal, an individual decision has to be made on how to generate it. Furthermore, as the design of all clock-generating elements for the whole decimation filter should be done in semi-custom style as well, the complexity and reliability of necessary tool constraints for synthesis and layout needs to be considered at this point.

## III. CIRCUIT DESIGN

# A. Implementation Constraints

To fulfill the aim of producing an industrial quality reusable filter macro block, additional design constraints beside speed, area and power consumption are in place. First, timing clean interfaces are required, means input and output data registers are needed. Second, to ease testing, all

ISBN: 978-988-19251-2-1

ISSN: 2078-0958 (Print); ISSN: 2078-0966 (Online)

registers have to include reset circuitry. Third, all output signals have to be provided with sufficient driving strength to ease chip top level routing. These additional requirements of course degrade area and power consumption results but are mandatory to respect.

# B. Filter Design

As the starting point for the filter implementation was the acceptance of 3 basic architectural variants: 8, 4-2 and 2-2-2, each of these variants was investigated for determining the detailed structure of each block with the lowest power consumption. Several structural solutions were taken into account such as canonical and non-canonical polyphase filter structure, additional pipelining registers and different types of deserializer and clock generator circuits. In addition, different styles of arithmetic implementation were checked such as signed, unsigned and LUT (look up table) implementation. The most efficient circuits were synthesized to be checked and compared in simulation.

The whole circuit design was done in VHDL. From the beginning it was clear that in multistage decimation filter architectures all the blocks are of the same type and the only differences are number of phases (decimation factor), filter coefficients and input data width. Taking into account previous estimation results it was decided to use a noncanonical polyphase structure for each filter block. Decimation within each filter was built as a parallel set of registers connected to the same input data but clocked with different clock phases. Outputs of these registers are the inputs for the FIR filter phases. According to this, one configurable template of a decimation filter stage was created and used as many times as it appears in the circuit. Additionally, a clock generator block was designed using clock dividers and clock gating cells.

The only exception of this configurable circuit structure is the first deserializer due to timing reasons. For this circuit at a speed well above the recommended digital standard cell clock frequencies, synthesis is not a push-button process but needs careful setup and several iteration runs to analyze post-synthesis results and find best constraints. The same procedure is required for place and route steps and again results have to be carefully analyzed and compared with synthesis results.

It became clear that regardless of similar structure filter blocks operate under different conditions as the first block works at much higher clock frequency than the others. The most critical parts of the first filter block are deserializer and clock generator while the following polyphase filter stages operate without timing problems. The solution of the speed problem was to separate the circuit into two blocks at VHDL level, where the first block contains the very first deserializer and a separate clock generator for it. The second block represents the rest of the filter circuit and clock generator parts. For the first high speed block, additional tool constraints are necessary to specify exact timing requirements, but still leave enough freedom for optimization and delay balancing.

#### C. Deserializer Design for First Stage

As discussed before the first and fastest deserializer needs special attention, which actually means additional pipelining of the circuit and separate implementation constraints. As an example, Fig. 4 shows the deserializer circuit for the decimation variant of eight.



Fig. 4. Deserializer circuit of design 8

Initially, only flip-flop cells 9-16 were included in the deserializer structure and they were clocked with phase shifted clock signals. This violation of the "clean input timing" rule was done to save some fast and power-hungry registers but this structure could not capture input data reliably under all operating conditions due to wire delay effects. To make the circuit independent of the actual layout, input shift register 1-8 was added [4]. Flip-flops 1-8 operate directly on the input clock / input data so they are not sensitive to the clock phase delays in the clock generator circuit and wire delays are much easier to control. A gated input clock pulse is applied to Flip-flops 9-16 to form the input data streams at reduced clock frequency for the following polyphase filter block. The clock generator is implemented as a 3-bit counter, cells 17 and 18 (register and clock gating cell) resynchronize the clock to provide registers 9-16 with the proper pulse of the input clock signal and also generate the clock reference signal for the following filter circuit. In this way, all implementation tools timing constraints refer to a single clock, which is handled properly by synthesis and layout algorithms.

To improve confidence in the semi-custom design flow reports and timing simulations, analog Spectre simulations were used to crosscheck the critical circuit elements.

#### IV. RESULTS

A. Area

The layout was done in a rectangle shape with Magma Blast Fusion while the synthesis step was done with Synopsys Design Compiler. Fig. 5 shows the layout view of the design 8, chip area numbers are given in Table I.



Fig. 5. Layout view of design 8.

TABLE I CHIP AREA

| Design | Overall area, µm <sup>2</sup> | Core area, µm <sup>2</sup> |
|--------|-------------------------------|----------------------------|
| 8      | 1710                          | 985                        |
| 4-2    | 2209                          | 1343                       |
| 2-2-2  | 2115                          | 1500                       |
|        | 2115                          | 1500                       |

To compare different designs regardless of filler and end cells, the core area parameter is used.

The additional clock generation circuits and distribution of different clocks mainly cause the increase of area with breaking up the filter into cascaded chains.

## B. Speed

Simulation of post-layout design was done with Mentor Graphics ModelSim. To check if the circuit gives the correct results on the output a set of simulation patterns was involved. Patterns were taken from the high-level model designed in MathWorks MatLab as well as from a  $\sum \Delta$  modulator test chip in real laboratory measurements. During the simulation the input clock frequency was varied in a wide range and the input data stream was shifted relatively to clock signal to check clock / data relation.

The simulations show a sufficient clock frequency margin (Table II). This additional safety margin should take into account possible model uncertainties at maximum speed. The simulation conditions for testing the circuit speed were the worst case, slow corner technology parameters but

TABLE II

| Design | Maximum clock rate,<br>GHz | Margin to specification,<br>% |
|--------|----------------------------|-------------------------------|
| 8      | 2.906                      | 13.5                          |
| 4-2    | 3.030                      | 18.3                          |
| 2-2-2  | 2.793                      | 9.1                           |

simulation with nominal and fast corner parameters was done as well.

The maximum clock rate according to Table II shows the best balance between clocking complexity and complexity of filter transfer functions (3), (4) in terms of speed for design 4-2.

## C. Power

For power estimation, Synopsys PrimeTime in cooperation with Mentor Graphics ModelSim was used. As power consumption is determined by switching activity of nets and cells the actual value is dependent on clock frequency and input data properties. To get realistic values for circuit power consumption a set of data patterns were applied to the input. The check was done for signals out of the  $\sum \Delta$  modulator model in MatLab with sine inputs and additional artificially created periodic and pseudo random sequences. For circuits comparison fast corner technology parameters, high supply voltage and nominal clock speed (2.56 GHz) was taken (Table III).

TABLE III POWER CONSUMPTION

| Design | Maximum power consumption, mW |
|--------|-------------------------------|
| 8      | 1.2                           |
| 4-2    | 1.7                           |
| 2-2-2  | 2.1                           |

Power results clearly show the superiority of design 8. Despite more flip-flops working at highest speed in deserializer and clock generator the slower filter register clock and only a single clock distribution tree reduce overall power consumption.

## V. SUPPLEMENTARY VERIFICATION

As already mentioned, the required clock frequency for this design is above the recommended design limit for the target standard cell library. To increase confidence in semicustom gate models and tool reports additional verification by analog transistor level simulations was done. Small critical parts of the whole filter like the first deserializer were characterized standalone during the design process to prove concept and design margin.

As parasitic elements change considerable for the complete layout due to distribution of gates during place and route optimization, the transistor level netlist and parasitic elements for each of the final layouts was extracted and simulated by Cadence Spectre. Run time of analog simulations required a restriction to short input pattern and selected signals to check. To ease comparison to digital domain results, MatLab scripts are used for calculating rise / fall times and setup / hold times of data to clock out of analog simulation traces.

These simulations showed good consistency of results for speed and power consumption with digital design tools being slightly on the pessimistic save side. Output data stability and symmetry to output clock edge proved the fulfillment of relaxed setup / hold requirement for following processing stages. In addition, signal rise / fall times are according to applied constraints and driving strength requirements.

# VI. CONCLUSION

The successful implementation of a 2.56 GHz digital decimation filter using a VHDL standard cell design flow in deep-sub-micron CMOS technology is shown. Possible multirate filter architectures were examined to explore the limits of the available 65 nm standard cell library. VHDL code templates and constraints template scripts are available. Furthermore the design fulfills all essential industrial requirements for macro block reuse like straightforward testability, clear input / output timing specifications with relaxed setup / hold requirements to output clock and sufficient driving strength.

Tables I, II, and III demonstrate that the best circuit solution for implementation is the design 8.

To be more confident in the obtained results of digital analysis and simulation steps analog parasitic back annotated post layout simulation of the complete filter blocks at transistor level was done. The results of analog simulation match the results of digital simulation as well as semi-custom design tool reports pretty good certifying the validity of digital standard cell models.

#### ACKNOWLEDGMENT

The authors want to thank all colleagues from Infineon Austria, Villach involved in this investigation for their valuable help and discussions, especially for essential inputs regarding filter design, physical tool design flow and layout work.

#### REFERENCES

- L. H. Corporales, E. Prefasi, E. Pun, S. Paton, "A 1.2-MHz 10-bit Continuous-Time Sigma-Delta ADC Using a Time Encoding Quantizer", IEEE Transactions on Circuits and Systems II: Express Briefs, Vol. 56(1), pp. 16-20, Jan. 2009ISSN: 1549-7747.
- [2] Ronald E. Crochiere, Lawrence R. Rabiner, "Multirate Digital Signal Processing", Prentice-Hall.
- [3] Chi Zhang, Erwin Ofner, "Low Power Non-Recursive Decimation Filters" in Proceedings of ICECS 2007, pp804-807, ISBN 1-4244-1378-8.
- [4] F. Tobajas, R. Esper-Chaín, R. Regidor, O. Santana, R. Sarmiento, "A Low Power 2.5 Gbps 1:32 Deserializer in SiGe BiCMOS Technology", 2006 IEEE Design and Diagnostics of Electronic Circuits and systems.