Entity: butterfly

File: butterfly.v

Diagram

Description

//////////////////////////////////////////////////////////////////////////////

Filename: butterfly.v

Project: A General Purpose Pipelined FFT Implementation

Purpose: This routine caculates a butterfly for a decimation in frequency version of an FFT. Specifically, given complex Left and Right values together with a coefficient, the output of this routine is given by:

    L' = L + R     R' = (L - R)*C
 The rest of the junk below handles timing (mostly), to make certain that L' and R' reach the output at the same clock.  Further, just to make certain that is the case, an 'aux' input exists.  This aux value will come out of this routine synchronized to the values it came in with.  (i.e., both L', R', and aux all have the same delay.)  Hence, a caller of this routine may set aux on the first input with valid data, and then wait to see aux set on the output to know when to find the first output with valid data.
 All bits are preserved until the very last clock, where any more bits than OWIDTH will be quietly discarded.
 This design features no overflow checking.

Notes: CORDIC: Much as we might like, we can't use a cordic here. The goal is to accomplish an FFT, as defined, and a CORDIC places a scale factor onto the data. Removing the scale factor would cost two multiplies, which is precisely what we are trying to avoid.

3-MULTIPLIES:     It should also be possible to do this with three multiplies     and an extra two addition cycles.
     We want         R+I = (a + jb) * (c + jd)         R+I = (ac-bd) + j(ad+bc)     We multiply         P1 = ac         P2 = bd         P3 = (a+b)(c+d)     Then         R+I=(P1-P2)+j(P3-P2-P1)
     WIDTHS:     On multiplying an X width number by an     Y width number, X>Y, the result should be (X+Y)     bits, right?     -2^(X-1) <= a <= 2^(X-1) - 1     -2^(Y-1) <= b <= 2^(Y-1) - 1     (2^(Y-1)-1)*(-2^(X-1)) <= ab <= 2^(X-1)2^(Y-1)     -2^(X+Y-2)+2^(X-1) <= ab <= 2^(X+Y-2) <= 2^(X+Y-1) - 1     -2^(X+Y-1) <= ab <= 2^(X+Y-1)-1     YUP!  But just barely.  Do this and you'll really want     to drop a bit, although you will risk overflow in so     doing.
 20150602 -- The sync logic lines have been completely redone.  The     synchronization lines no longer go through the FIFO with the     left hand sum, but are kept out of memory.  This allows the     butterfly to use more optimal memory resources, while also     guaranteeing that the sync lines can be properly reset upon     any reset signal.

Creator: Dan Gisselquist, Ph.D. Gisselquist Technology, LLC

//////////////////////////////////////////////////////////////////////////////

This file is part of the general purpose pipelined FFT project.

The pipelined FFT project is free software (firmware): you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

The pipelined FFT project is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTIBILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.

You should have received a copy of the GNU Lesser General Public License along with this program. (It's in the $(ROOT)/doc directory. Run make with no target there if the PDF file isn't present.) If not, see http://www.gnu.org/licenses/ for a copy.

License: LGPL, v3, as defined and found on www.gnu.org, http://www.gnu.org/licenses/lgpl.html

//////////////////////////////////////////////////////////////////////////////

`default_nettype none

Generics

Generic name	Value	Description
IWIDTH	16	Public changeable parameters …
SHIFT	0
CKPCE	1	The number of clocks per each i_ce. The actual number can be more, but the algorithm depends upon at least this many for extra internal processing.
F_CHECK	MPYREMAINDER	F_CHECK will be set externally by the solver, so that we can double check that the solver is actually testing what we think it is testing. We'll set it here to MPYREMAINDER, which will essentially eliminate the check--unless overridden by the solver.

Ports

Port name	Direction	Type
i_clk	input	wire
i_reset	input	wire
i_ce	input	wire
i_coef	input	wire [(2*CWIDTH-1):0]
i_left	input	wire [(2*IWIDTH-1):0]
i_right	input	wire [(2*IWIDTH-1):0]
i_aux	input	wire
o_left	output	wire [(2*OWIDTH-1):0]
o_right	output	wire [(2*OWIDTH-1):0]
o_aux	output

Signals

Name	Type	Description
f_dlyleft_r	reg signed [IWIDTH-1:0]
f_dlyleft_i	reg signed [IWIDTH-1:0]
f_dlyright_r	reg signed [IWIDTH-1:0]
f_dlyright_i	reg signed [IWIDTH-1:0]
f_dlycoeff_r	reg signed [CWIDTH-1:0]
f_dlycoeff_i	reg signed [CWIDTH-1:0]
f_dlyaux	reg signed [F_DEPTH-1:0]
f_predifr	wire [IWIDTH:0]
f_predifi	wire [IWIDTH:0]
f_predifrx	wire [IWIDTH+CWIDTH+3-1:0]
f_predifix	wire [IWIDTH+CWIDTH+3-1:0]
f_sumcoef	wire [CWIDTH:0]
f_sumdiff	wire [IWIDTH+1:0]
f_sumr	wire [IWIDTH:0]
f_sumi	wire [IWIDTH:0]
f_sumrx	wire [IWIDTH+CWIDTH+3-1:0]
f_sumix	wire [IWIDTH+CWIDTH+3-1:0]
f_difr	wire [IWIDTH:0]
f_difi	wire [IWIDTH:0]
f_difrx	wire [IWIDTH+CWIDTH+3-1:0]
f_difix	wire [IWIDTH+CWIDTH+3-1:0]
f_widecoeff_r	wire [IWIDTH+CWIDTH+3-1:0]
f_widecoeff_i	wire [IWIDTH+CWIDTH+3-1:0]
fp_one_ic	wire [(CWIDTH):0]
fp_two_ic	wire [(CWIDTH):0]
fp_three_ic	wire [(CWIDTH):0]
f_p3c_in	wire [(CWIDTH):0]
fp_one_id	wire [(IWIDTH+1):0]
fp_two_id	wire [(IWIDTH+1):0]
fp_three_id	wire [(IWIDTH+1):0]
f_p3d_in	wire [(IWIDTH+1):0]
r_left	reg [(2*IWIDTH-1):0]
r_right	reg [(2*IWIDTH-1):0]
r_coef	reg [(2*CWIDTH-1):0]
r_coef_2	reg [(2*CWIDTH-1):0]
r_left_r	wire [(IWIDTH-1):0]
r_left_i	wire [(IWIDTH-1):0]
r_right_r	wire [(IWIDTH-1):0]
r_right_i	wire [(IWIDTH-1):0]
r_sum_r	reg signed [(IWIDTH):0]
r_sum_i	reg signed [(IWIDTH):0]
r_dif_r	reg signed [(IWIDTH):0]
r_dif_i	reg signed [(IWIDTH):0]
fifo_addr	reg [(LGDELAY-1):0]
fifo_read_addr	wire [(LGDELAY-1):0]
fifo_left	reg [(2*IWIDTH+1):0]
ir_coef_r	wire [(CWIDTH-1):0]
ir_coef_i	wire [(CWIDTH-1):0]
p_one	wire [((IWIDTH+2)+(CWIDTH+1)-1):0]
p_two	wire [((IWIDTH+2)+(CWIDTH+1)-1):0]
p_three	wire [((IWIDTH+2)+(CWIDTH+1)-1):0]
fifo_i	wire [(IWIDTH+CWIDTH):0]	These values are held in memory and delayed during the multiply. Here, we recover them. During the multiply, values were multiplied by 2^(CWIDTH-2)exp{-j2pi…}, therefore, the left_x values need to be right shifted by CWIDTH-2 as well. The additional bits come from a sign extension.
fifo_r	wire [(IWIDTH+CWIDTH):0]	These values are held in memory and delayed during the multiply. Here, we recover them. During the multiply, values were multiplied by 2^(CWIDTH-2)exp{-j2pi…}, therefore, the left_x values need to be right shifted by CWIDTH-2 as well. The additional bits come from a sign extension.
fifo_read	reg [(2*IWIDTH+1):0]
mpy_r	reg signed [(CWIDTH+IWIDTH+3-1):0]
mpy_i	reg signed [(CWIDTH+IWIDTH+3-1):0]
rnd_left_r	wire [(OWIDTH-1):0]	Let's do some rounding and remove unnecessary bits. We have (IWIDTH+CWIDTH+3) bits here, we need to drop down to OWIDTH, and SHIFT by SHIFT bits in the process. The trick is that we don't need (IWIDTH+CWIDTH+3) bits. We've accumulated them, but the actual values will never fill all these bits. In particular, we only need: IWIDTH bits for the input +1 bit for the add/subtract +CWIDTH bits for the coefficient multiply +1 bit for the add/subtract in the complex multiply ------ (IWIDTH+CWIDTH+2) bits at full precision. However, the coefficient multiply multiplied by a maximum value of 2^(CWIDTH-2). Thus, we only have IWIDTH bits for the input +1 bit for the add/subtract +CWIDTH-2 bits for the coefficient multiply +1 (optional) bit for the add/subtract in the cpx mpy. -------- … multiply. (This last bit may be shifted out.) (IWIDTH+CWIDTH) valid output bits. Now, if the user wants to keep any extras of these (via OWIDTH), or if he wishes to arbitrarily shift some of these off (via SHIFT) we accomplish that here.
rnd_left_i	wire [(OWIDTH-1):0]	Let's do some rounding and remove unnecessary bits. We have (IWIDTH+CWIDTH+3) bits here, we need to drop down to OWIDTH, and SHIFT by SHIFT bits in the process. The trick is that we don't need (IWIDTH+CWIDTH+3) bits. We've accumulated them, but the actual values will never fill all these bits. In particular, we only need: IWIDTH bits for the input +1 bit for the add/subtract +CWIDTH bits for the coefficient multiply +1 bit for the add/subtract in the complex multiply ------ (IWIDTH+CWIDTH+2) bits at full precision. However, the coefficient multiply multiplied by a maximum value of 2^(CWIDTH-2). Thus, we only have IWIDTH bits for the input +1 bit for the add/subtract +CWIDTH-2 bits for the coefficient multiply +1 (optional) bit for the add/subtract in the cpx mpy. -------- … multiply. (This last bit may be shifted out.) (IWIDTH+CWIDTH) valid output bits. Now, if the user wants to keep any extras of these (via OWIDTH), or if he wishes to arbitrarily shift some of these off (via SHIFT) we accomplish that here.
rnd_right_r	wire [(OWIDTH-1):0]	Let's do some rounding and remove unnecessary bits. We have (IWIDTH+CWIDTH+3) bits here, we need to drop down to OWIDTH, and SHIFT by SHIFT bits in the process. The trick is that we don't need (IWIDTH+CWIDTH+3) bits. We've accumulated them, but the actual values will never fill all these bits. In particular, we only need: IWIDTH bits for the input +1 bit for the add/subtract +CWIDTH bits for the coefficient multiply +1 bit for the add/subtract in the complex multiply ------ (IWIDTH+CWIDTH+2) bits at full precision. However, the coefficient multiply multiplied by a maximum value of 2^(CWIDTH-2). Thus, we only have IWIDTH bits for the input +1 bit for the add/subtract +CWIDTH-2 bits for the coefficient multiply +1 (optional) bit for the add/subtract in the cpx mpy. -------- … multiply. (This last bit may be shifted out.) (IWIDTH+CWIDTH) valid output bits. Now, if the user wants to keep any extras of these (via OWIDTH), or if he wishes to arbitrarily shift some of these off (via SHIFT) we accomplish that here.
rnd_right_i	wire [(OWIDTH-1):0]	Let's do some rounding and remove unnecessary bits. We have (IWIDTH+CWIDTH+3) bits here, we need to drop down to OWIDTH, and SHIFT by SHIFT bits in the process. The trick is that we don't need (IWIDTH+CWIDTH+3) bits. We've accumulated them, but the actual values will never fill all these bits. In particular, we only need: IWIDTH bits for the input +1 bit for the add/subtract +CWIDTH bits for the coefficient multiply +1 bit for the add/subtract in the complex multiply ------ (IWIDTH+CWIDTH+2) bits at full precision. However, the coefficient multiply multiplied by a maximum value of 2^(CWIDTH-2). Thus, we only have IWIDTH bits for the input +1 bit for the add/subtract +CWIDTH-2 bits for the coefficient multiply +1 (optional) bit for the add/subtract in the cpx mpy. -------- … multiply. (This last bit may be shifted out.) (IWIDTH+CWIDTH) valid output bits. Now, if the user wants to keep any extras of these (via OWIDTH), or if he wishes to arbitrarily shift some of these off (via SHIFT) we accomplish that here.
left_sr	wire [(CWIDTH+IWIDTH+3-1):0]
left_si	wire [(CWIDTH+IWIDTH+3-1):0]
aux_pipeline	reg [(AUXLEN-1):0]
f_startup_counter	reg [F_LGDEPTH:0]

Constants

Name	Type	Value	Description
MXMPYBITS		+2	Local/derived parameters that are calculated from the above params. Apart from algorithmic changes below, these should not be adjusted The first step is to calculate how many clocks it takes our multiply to come back with an answer within. The time in the multiply depends upon the input value with the fewest number of bits--to keep the pipeline depth short. So, let's find the fewest number of bits here.
LCLDELAY		MPYDELAY	In an environment when CKPCE > 1, the multiply delay isn't necessarily the delay felt by this algorithm--measured in i_ce's. In particular, if the multiply can operate with more operations per clock, it can appear to finish "faster". Since most of the logic in this core operates on the slower clock, we'll need to map that speed into the number of slower clock ticks that it takes.
LGDELAY
AUXLEN		undefined
MPYREMAINDER		MPYDELAY - CKPCE
F_LGDEPTH
F_DEPTH		AUXLEN
F_D	[F_LGDEPTH-1:0]	-1

Processes

unnamed: ( @(posedge i_clk) )

Type: always

Description
Set up the input to the multiply

unnamed: ( @(posedge i_clk) )

Type: always

unnamed: ( @(posedge i_clk) )

Type: always

unnamed: ( @(posedge i_clk) )

Type: always

unnamed: ( @(posedge i_clk) )

Type: always

unnamed: ( @(posedge i_clk) )

Type: always

unnamed: ( @(posedge i_clk) )

Type: always

unnamed: ( @(posedge i_clk) )

Type: always

unnamed: ( @(posedge i_clk) )

Type: always

unnamed: ( @(*) )

Type: always

unnamed: ( @(*) )

Type: always

unnamed: ( @(posedge i_clk) )

Type: always

unnamed: ( @(*) )

Type: always

Description
Let's see if we can improve our performance at all by moving our test one clock earlier. If nothing else, it should help induction finish one (or more) clocks ealier than otherwise

unnamed: ( @(*) )

Type: always

unnamed: ( @(posedge i_clk) )

Type: always

Description
Induction helpers

unnamed: ( @(*) )

Type: always

Instantiations

do_rnd_left_r: convround

do_rnd_left_i: convround

do_rnd_right_r: convround

do_rnd_right_i: convround