Design And Analysis Of Modified Fast Compressors For Mac

On

Iaetsd mac using compressor based multiplier and carry save adder. 1. MAC USING COMPRESSOR BASED MULTIPLIER AND CARRY SAVE ADDER Nagamanohar Tenali, M.Tech Student, Department of ECE, Audisankara College Of Engineering And Technology, Gudur, India, Email: mano09441@gmail.com.

Adjoint-Based Sensitivity Analysis for Unsteady Bladerow Interaction Using Space–Time Gradient Method J. Turbomach (November, 2017) Experimental and Numerical Verification of an Optimization of a Fast Rotating High-Performance Radial Compressor Impeller.

Mukesh Gangala, Assistant Professor, Department of ECE, Audisankara College Of Engineering And Technology, Gudur, India, Email: mukeshgangala442@gmail.com. ABSTRACT: Power dissipation is recognized as a critical parameter in modern VLSI design field. This paper presents the low power compressor based Multiply- Accumulate (MAC) architecture for DSP Applications.

In VLSI, highly computed arithmetic cells including adders and multipliers are the most copiously used components. Efficient architecture of MAC using a modified Wallace tree multiplier is proposed.

The proposed MAC uses multiplier with novel compressor designs and adders as carry save adder for fast low-power application. The proposed low power compressor architecture was applied to MAC unit and compared against the conventional compressor based MAC units and observed that the proposed architecture has reduced significant amount of Delay and power. Index Terms— Multiply Accumulate, Compressor, Wallace Tree multiplier, CSA. INTRODUCTION The increasing demand for portable systems and the need to limit power consumption and heat dissipation in very-high density chips have led to rapid developments in low-power design during the recent times. The battery lifetime is also a concern on the overall power consumption of the portable system. Hence, reducing the power dissipation of integrated circuits through design improvements is a major challenge in portable systems design.

Design And Analysis Of Modified Fast Compressors For Mac

The need for low-power design is also an issue in high- performance digital systems, like microprocessors, digital signal processors (DSPs) and other applications. In digital VLSI circuits, computation is the critical part and it decides the power consumption and operating speed of the designs. For computations arithmetic circuits involves adders and multipliers; which are the most copiously used components. Digital signal processors performing filtering, convolution and etc, relies on the efficient implementation of these adder, multiplier and MAC arithmetic units. Low power compressor architecture is proposed in this brief to reduce the power consumption of the MAC architecture since the presence of more number of compressors. The impact of the circuit design level or the data path optimizations is addressed at the MAC level for DSP applications.

In MAC, additionally the carry propagate addition involved in multiplier and accumulate stages are merged to exploit and increase the number of compressors in the MAC architectures. Designs were illustrated in ASIC and FPGA domains as per the standard design methodology. CONVENTIONAL COMPRESSOR ALGORITHM Multipliers require high amount of power and delay during the partial products addition. At this stage, most of the multipliers are designed with different kind of adders that are capable to add two/three or at most 4 bits by using 4-2 compressors.

For higher order multiplications, a huge number of adders or compressors are used to perform the partial product addition. We have minimized the number of adders by introducing different compressors. The conventional 4-2 compressor structure actually compresses five partial product bits into three. The architecture can be implemented with two stages of full adder (FA) connected in series as shown in Fig.

The outputs of 4-2 compressor consist of one bit in position j and two bits in position (j + 1). This straight forward approach has four XOR gate delays. ISBN-13: 9697 www.iaetsd.in Proceedings of ICRMET-2016 ©IAETSD 201664. Fig. 1 Conventional 4-2 compressor This implementation is better and the delay is that of three XOR gates delays. With the similar logic 5-2 compressor.

The problems of this kind of conventional compressor are: (i)The uneven delay profile of the outputs arriving from different input paths tends to generate a lot of glitches. (ii) Compressors do the simple operation of addition that adds more number of bits at a time. But the conventional 4-2 compressors require one more half adder of which two inputs are ‘COUT’ and ‘C’ (shown in Figure 2), to produce the final addition result. Example: if X1=X2=X3=X4=1 and CIN =0 (in Figure 1) then the addition result be four i.e. 100 but the conventional architecture produces COUT=1, C=1 and S=0. Now if COUT and C fed to a half adder then it produces the final result in exact form as shown in Figure 2.

2 Modified 4-2 compressor (iii) For 4-2 compressor, a half adder is required but for 5-2 compressor a full adder is required because a 5-2 compressor is implemented by series connection of three full adders, that generates three carry output bits in position ‘j+1’ and one sum bit in position ‘j’, shown in Figure 3. Thus this conventional logic not only increases the critical path delay but also increases the number of output bits.

3 Conventional 5-2 compressor As the weightage of sum bit is ‘1’ and the weightage of carry bits is ‘2’ of conventional compressors, so the results that produced by those compressors are correct but not in proper binary form. When these conventional compressors are used in multiplier to achieve high speed then one half adder/full adder is required per compressor to process those carry bits. Thus it hampers the speed of operation. So the conventional compressors require one more half adder/full adder to get the final result and this eventually adds more delay and power to the reported results.

Compressors

MULTIPLICATION LOGIC Considering an example of 8 bit multiplication in which 8 bit input isX7X6X5X4X3X2X1X0 and multiplier isY7Y6Y5Y4Y3Y2Y1Y0. The multiplication process is shown in figure 4. There is the requirement of 64 AND logics.

First Y0 is multiplied withX7X6X5X4X3X2X1X0 and results X0Y0, XIY0, X2Y0, X3Y0, X4Y0, X5Y0, X6Y0 and X7Y0. After itY1 is multiplied with X7X6X5X4X3X2X1X0 and results X0Y1, XIY1, X2Y1, X3Y1, X4Y1, X5Y1, X6Y1 and X7Y1. Similarly all multiplications are taken place. In each step there is one binary shift in the resultant logic. All AND logics are represented by one bit representation starting from K0 to K63 sequentially as shown in figure5.The addition can be done using a tree formed itself. This is done using 3:2 compressor, 4:2 compressor and 5:2 compressor which are the optimized solutions instead of using 3:2 compressors only. This addition is possible using 3:2 compressors only but the implementation using 4:2 and 5:2 reduces the latency and increases the speed.

In the process the sum output of intermediate compressors is the input for next compressors in the same column and the generated carry for the corresponding adders are propagated to next column adders. The result will be of 16 bits represented by P15.P0. ISBN-13: 9697 www.iaetsd.in Proceedings of ICRMET-2016 ©IAETSD 201665. Several popular and well-known schemes, with the objective of improving the speed of the parallel multiplier, have been developed in past. Wallace introduced a very important iterative realization of parallel multiplier. This advantage becomes more pronounced for multipliers of bigger than 16 bits. In Wallace tree architecture, all the bits of all of the partial products in each column are added together by a set of counters in parallel without propagating any carries.

The advantage of Wallace tree is speed because the addition of partial products is now O (log N). Multiplier (8 bits) Fig 5. Multiplication wallace Logic tree CARRY SAVE ADDER (CSA) The Carry Save Adder (CSA) is a type of Digital Adder, used to compute the sum of three or more number of bits in binary form. CSA gives less propagation delay and the Glitching problem in RCA is also avoided.

Since, the Representation of 8 bit CSA is shown in Figure 6. Here, we compute the sum of two 8 bit binary numbers so 8 half adders at the first stage is required instead of 8 full adders. Since, we add bits of two binary numbers only.If, P and Q are two 8 bit numbers then it produces the partial products and carry Si and Ci respectively. Where, Si = Pi Qi Ci = Pi.Qi However, a CSA Produces all the output values in parallel. So that, the computation time is reduced compared to RCA. Also, Parallel in Parallel out (PIPO) is used in Accumulator Stage. Fig 6: A Typical 8 bit Carry Save Adder 4.

MAC UNIT The Multiplier-Accumulator (MAC) operation is the key operation not only in DSP applications but also in multimedia information processing and various other applications. As mentioned above, MAC unit consist of multiplier, adder and register/accumulator. In this paper, we used 8 bit modified Wallace multiplier. The MAC Unit take inputs from the memory location such as RAM and given to the multiplier block. This is very useful in 8 bit digital signal processor.

The inputs which is being fed from the memory location is 8 bit. When the input is given to the multiplier it starts computing value for the given 8 bit input and hence the output will be 16 bits. The multiplier output is given as the input to carry save adder (CSA) which performs addition.

ISBN-13: 9697 www.iaetsd.in Proceedings of ICRMET-2016 ©IAETSD 201666. The function of the MAC unit is given by the following equation 7 Y = ∑ Ai x Bi (1) i=0 Fig 7: Block Diagram of MAC Unit. Where, Ai & Bi are two 8 bit input Operands, Y is the output of MAC Unit and i is a 8 bit value. This Equation performs Summation of partial products. The Carry Save Adder (CSA) produces 17 bit output.

Since, one bit is for the carry (16 bits + 1 bit).Then, the output of CSA is given to the accumulator register. The accumulator used is designed with Parallel in Parallel out (PIPO) Type. Because the CSA Produces output in Parallel form and also the bits are huge. PIPO register is used where the input bits are given in parallel and output is taken in parallel. The output of the accumulator register is taken out or fed back as one of the input to the CSA. 7show the basic architecture of MAC unit. RESULTS Block diagram RTL Schematic diagram ISBN-13: 9697 www.iaetsd.in Proceedings of ICRMET-2016 ©IAETSD 201667.

Technology schematic Comparision table Simulation output waveform 6. CONCLUSION Hence, a High Performance 8 bit MAC Unit is designed and implemented using compressor based Wallace Tree Multiplier and Carry Save Adder. When compared to all other MAC Units which are developed earlier using different combinations of multipliers and adders, the designed compressor based Wallace Tree Multiplier offers High Performance with Less Delay, Less Power Dissipation which further increases the overall speed of MAC Unit. This MAC Unit is designed using Verilog - HDL and Synthesized using Xilinx 14.3 ISE. REFERENCES 1 Chang, Chip-Hong, Jiangmin Gu, and Mingyan Zhang.' Ultra low-voltage low-power CMOS 4-2 and 5-2compressors for fast arithmetic circuits.' Circuits andSystems I: Regular Papers, IEEE Transactions on ): 1985-1997.

2 Tung Thanh Hoang; Sjalander, M.; Larsson- Edefors, P., 'AHigh-Speed,Energy-Efficient Two- Cycle Multiply-Accumulate (MAC) Architecture and Its Application to aDouble-Throughput MAC Unit,' Circuits and Systems I:Regular Papers, IEEE Transactions on, vol.57, no.12,pp.3073,3081, Dec. 3 Chen Ping-hua; Zhao Juan, 'High-speed Parallel 32×32-bMultiplier Using a Radix-16 Booth Encoder,' IntelligentInformation Technology Application Workshops, 2009.

Third International Symposium on, vol., no.,pp.406,409, 21-22 Nov. 2009 4 Kiwon Choi; Minkyu Song, 'Design of a high performance32×32-bit multiplier with a novel sign select Boothencoder,' Circuits and Systems, 2001. The2001 IEEE International Symposium on, vol.2, no.,pp.701,704 vol. 2, 6-9 May 2001. 5 Rajput, R.P.; Swamy, M.N.S., 'High Speed ModifiedBooth Encoder Multiplier for Signed and UnsignedNumbers,' Computer Modelling and Simulation (UKSim),2012 UKSim 14th International Conference on, vol., no.,pp.649,654, 28-30 March 2012.

6 Yangbo Wu; Weijiang Zhang; Jianping Hu, 'Adiabatic 4-2compressors for low-power multiplier,' Circuits andSystems, 2005. 48th Midwest Symposium on, vol., no.,pp.1473,1476 Vol. 7 Jaina, D.; Sethi, K.; Panda, R., 'Vedic Mathematics BasedMultiply Accumulate Unit,' ISBN-13: 9697 www.iaetsd.in Proceedings of ICRMET-2016 ©IAETSD 201668. Computational Intelligenceand Communication Networks (CICN), 2011 InternationalConference on, vol., no., pp.754,757, 7-9 Oct. 8 Aliparast, Peiman, Ziaadin D.

Koozehkanani, and FarhadNazari. 'An Ultra High Speed Digital 4-2 Compressor in 65-nm CMOS.' International Journal of Computer Theory& Engineering 5.4 (2013). Weste and David Harris, “CMOS VLSI Design- ACircuits & System Perspective”, Pearson Education, 2008.

Design And Analysis Of Modified Fast Compressors For Mac

BIOGRAPHIES TENALI NAGAMANOHAR is currently PG scholar of VLSI in Audisankara college of Engineering and Technology, Gudur (Autonomous), SPSR Nellore (Dist), Affiliated to JNTU Anantapur. He received B.TECH from Ramireddy Subbaramireddy Engineering College,Kadanuthala, SPSR Nellore (Dist). His current research interest includes Analysis &VLSI System Design. MUKESH GANGALA received B.Tech from Narayana Engineering College, Gudur, SPSR Nellore (d.t), AP in the year of 2009.

Design and analysis of modified fast compressors for mac pro

He pursued M.Tech from PBR VITS, Kavali, SPSR Nellore (Dist), AP in the year of 2013.He is having teaching experience of 3 years in Audisankara college of Engineering & Technology (Autonomous), Gudur, SPSR Nellore, AP. His interested areas are Embedded System and VLSI. ISBN-13: 9697 www.iaetsd.in Proceedings of ICRMET-2016 ©IAETSD 201669.

Multiplier is one of the most commonly used circuits in the digital devices. Multiplication is one of the basic functions used in digital signal processing. Most high performance DSP systems rely on hardware multipli cation to achieve high data throughput. In t his paper we are im plementing Gate Diff usion Input (GDI) based Compressor. And Here we compare CMOS with GDI 5-3 compressor & design and implementation of 15- 4 compressor.

Here the simulation results give better performance in terms of power and delay. The simulat ions are run in Modelsim 10.1 and schematics and layouts are generated in DSCH & Microwind tools. Multiplier is one of the most commonly used circuits in the digital devices. Multiplication is one of the basic functions used in digital signal processing.

Most high performance DSP systems rely on hardware multiplication to achieve high data throughput. The various types of multipliers available depending upon the application in which they are used. Full adder is the main block of power dissipation in multiplier. So reducing the power dissipation of full adder ultimately reduces the power dissipation of multiplier. A compressor is simply an adder circuit.

Design And Analysis Of Modified Fast Compressors For Machines

It takes as inputs a number of equally-weighted bits, adds them, and produces as output the sum, in the form of a bit with the same weight as the inputs and one or more bits that have a value greater than that of the inputs. Compressors are commonly used to reduce a large number of inputs to a smaller number, such as in a multiplier, where they are used to reduce the many partial products to a final summed value. The reduction in the number of individual bits representing the value leads to the name comp ressor. For higher order multiplications, a huge number of adders or compressors are to be used to perform the partial product addition.

We have reduced the number of adders by introducing special kind of adders that are capable to add five/six/seven bits per decade. These adders are called compresso rs. Compresso rs are major components of the present multiplier designs. In multipliers maximum amount of power is consumed during the partial product addition. Using compressor adders, that can add four, five, six or seven bits at a time, the number of full adders and half adders can be reduced and thus area and power consumed also gets reduced. There are many compressors available e.g.

3-2 compressors, 4-2 compressor, 5-2 compressor, 5-3 compressors in many applications like partial product summation in multiplier. In this project Full adder 5-3 compressor is used. Full adder 5-3 compressor is faster and consumes less power. For such operations using small compressors like 5-2 and 7-2 would not give better performance in terms of speed and power. To overcome its drawback Gate diffusion input t echnolo gy is used.

Design And Analysis Of Modified Fast Compressors For Machine Learning

The conventional 5-3 compressor architecture is showed in fig 1.Conventional 5-3 compressor has five inputs and three outputs. It will compress five partial products into three outputs. It has five XOR gates and two MUX and one AND gate.

In order to generate O0 three XOR gates are required. XOR gate has more critical delay than any other gate. So conventional 5-3 compressor has more delay and consumes more power. Further optimization of this compressor is possible. The proposed design gives better performance than conventiona l structure. It must be remarked that not all of the functions are possible in standard p-well CMOS process but can be successfully implemented in twin-well CMOS or silicon on ins ul at or (S OI ) te ch no lo gi es. H o w a si mp le change of the input configuration of the simple GDI (Gate Diffusio n In put) cell corresponds to very different Boole an functions.

Most of these functions are complex (6-12 transistors) in CMOS, at the same time in criterion PTL implementations, but very simple (only 2 transistors per function) in the GDI (Gate Diffusion Input) design process. The most of designed circuits were based on the F1 and F2 functions. The reasons for this are as follows. 1) Both F1 and F2 are complete logic families (allows understanding of any possible two input logic function). 2) F1 is the only GDI function that can be realized in a standard p-well CMOS process, because th e bulk of any NMOS is constantly and equally biased. 3) When N input is driven at high logic level and P input is at low logic level, the diodes stuck betwe en NMOS and PMOS bulks to out are directly polarized and there is a short betwe en N an d P, resulting in static power dissipation. The basic architecture of 15-4 compressor 2 is shown in fig 3.This will compress 15 partial products into four outputs.

It has five full adders and two 5-3 compressors and one parallel adder. Each full adder is used to compress three partial products into sum and carry. All the sums from five full adders are compressed with the help of proposed 5-3 compressor and carry outputs are compressed with the help of proposed 5-3 compressor. Parallel adder is used to add the output of 5-3 compressors. Inputs of 4 bit parallel adder (B3 and A0) are grounded. Parallel adder circuit is shown in fig 4.