Entity: butterfly
- File: butterfly.v
Diagram
Description
//////////////////////////////////////////////////////////////////////////////
Filename: butterfly.v
Project: A General Purpose Pipelined FFT Implementation
Purpose: This routine caculates a butterfly for a decimation in frequency version of an FFT. Specifically, given complex Left and Right values together with a coefficient, the output of this routine is given by:
L' = L + R R' = (L - R)*C
The rest of the junk below handles timing (mostly), to make certain that L' and R' reach the output at the same clock. Further, just to make certain that is the case, an 'aux' input exists. This aux value will come out of this routine synchronized to the values it came in with. (i.e., both L', R', and aux all have the same delay.) Hence, a caller of this routine may set aux on the first input with valid data, and then wait to see aux set on the output to know when to find the first output with valid data.
All bits are preserved until the very last clock, where any more bits than OWIDTH will be quietly discarded.
This design features no overflow checking.
Notes: CORDIC: Much as we might like, we can't use a cordic here. The goal is to accomplish an FFT, as defined, and a CORDIC places a scale factor onto the data. Removing the scale factor would cost two multiplies, which is precisely what we are trying to avoid.
3-MULTIPLIES: It should also be possible to do this with three multiplies and an extra two addition cycles.
We want R+I = (a + jb) * (c + jd) R+I = (ac-bd) + j(ad+bc) We multiply P1 = ac P2 = bd P3 = (a+b)(c+d) Then R+I=(P1-P2)+j(P3-P2-P1)
WIDTHS: On multiplying an X width number by an Y width number, X>Y, the result should be (X+Y) bits, right? -2^(X-1) <= a <= 2^(X-1) - 1 -2^(Y-1) <= b <= 2^(Y-1) - 1 (2^(Y-1)-1)*(-2^(X-1)) <= ab <= 2^(X-1)2^(Y-1) -2^(X+Y-2)+2^(X-1) <= ab <= 2^(X+Y-2) <= 2^(X+Y-1) - 1 -2^(X+Y-1) <= ab <= 2^(X+Y-1)-1 YUP! But just barely. Do this and you'll really want to drop a bit, although you will risk overflow in so doing.
20150602 -- The sync logic lines have been completely redone. The synchronization lines no longer go through the FIFO with the left hand sum, but are kept out of memory. This allows the butterfly to use more optimal memory resources, while also guaranteeing that the sync lines can be properly reset upon any reset signal.
Creator: Dan Gisselquist, Ph.D. Gisselquist Technology, LLC
//////////////////////////////////////////////////////////////////////////////
Copyright (C) 2015-2019, Gisselquist Technology, LLC
This file is part of the general purpose pipelined FFT project.
The pipelined FFT project is free software (firmware): you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
The pipelined FFT project is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTIBILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.
You should have received a copy of the GNU Lesser General Public License along with this program. (It's in the $(ROOT)/doc directory. Run make with no target there if the PDF file isn't present.) If not, see http://www.gnu.org/licenses/ for a copy.
License: LGPL, v3, as defined and found on www.gnu.org, http://www.gnu.org/licenses/lgpl.html
//////////////////////////////////////////////////////////////////////////////
`default_nettype none
Generics
Generic name | Type | Value | Description |
---|---|---|---|
IWIDTH | 16 | Public changeable parameters … | |
SHIFT | 0 | ||
CKPCE | 1 | The number of clocks per each i_ce. The actual number can be more, but the algorithm depends upon at least this many for extra internal processing. | |
F_CHECK | MPYREMAINDER | F_CHECK will be set externally by the solver, so that we can double check that the solver is actually testing what we think it is testing. We'll set it here to MPYREMAINDER, which will essentially eliminate the check--unless overridden by the solver. |
Ports
Port name | Direction | Type | Description |
---|---|---|---|
i_clk | input | wire | |
i_reset | input | wire | |
i_ce | input | wire | |
i_coef | input | wire [(2*CWIDTH-1):0] | |
i_left | input | wire [(2*IWIDTH-1):0] | |
i_right | input | wire [(2*IWIDTH-1):0] | |
i_aux | input | wire | |
o_left | output | wire [(2*OWIDTH-1):0] | |
o_right | output | wire [(2*OWIDTH-1):0] | |
o_aux | output |
Signals
Name | Type | Description |
---|---|---|
f_dlyleft_r | reg signed [IWIDTH-1:0] | |
f_dlyleft_i | reg signed [IWIDTH-1:0] | |
f_dlyright_r | reg signed [IWIDTH-1:0] | |
f_dlyright_i | reg signed [IWIDTH-1:0] | |
f_dlycoeff_r | reg signed [CWIDTH-1:0] | |
f_dlycoeff_i | reg signed [CWIDTH-1:0] | |
f_dlyaux | reg signed [F_DEPTH-1:0] | |
f_predifr | wire [IWIDTH:0] | |
f_predifi | wire [IWIDTH:0] | |
f_predifrx | wire [IWIDTH+CWIDTH+3-1:0] | |
f_predifix | wire [IWIDTH+CWIDTH+3-1:0] | |
f_sumcoef | wire [CWIDTH:0] | |
f_sumdiff | wire [IWIDTH+1:0] | |
f_sumr | wire [IWIDTH:0] | |
f_sumi | wire [IWIDTH:0] | |
f_sumrx | wire [IWIDTH+CWIDTH+3-1:0] | |
f_sumix | wire [IWIDTH+CWIDTH+3-1:0] | |
f_difr | wire [IWIDTH:0] | |
f_difi | wire [IWIDTH:0] | |
f_difrx | wire [IWIDTH+CWIDTH+3-1:0] | |
f_difix | wire [IWIDTH+CWIDTH+3-1:0] | |
f_widecoeff_r | wire [IWIDTH+CWIDTH+3-1:0] | |
f_widecoeff_i | wire [IWIDTH+CWIDTH+3-1:0] | |
fp_one_ic | wire [(CWIDTH):0] | |
fp_two_ic | wire [(CWIDTH):0] | |
fp_three_ic | wire [(CWIDTH):0] | |
f_p3c_in | wire [(CWIDTH):0] | |
fp_one_id | wire [(IWIDTH+1):0] | |
fp_two_id | wire [(IWIDTH+1):0] | |
fp_three_id | wire [(IWIDTH+1):0] | |
f_p3d_in | wire [(IWIDTH+1):0] | |
r_left | reg [(2*IWIDTH-1):0] | |
r_right | reg [(2*IWIDTH-1):0] | |
r_coef | reg [(2*CWIDTH-1):0] | |
r_coef_2 | reg [(2*CWIDTH-1):0] | |
r_left_r | wire [(IWIDTH-1):0] | |
r_left_i | wire [(IWIDTH-1):0] | |
r_right_r | wire [(IWIDTH-1):0] | |
r_right_i | wire [(IWIDTH-1):0] | |
r_sum_r | reg signed [(IWIDTH):0] | |
r_sum_i | reg signed [(IWIDTH):0] | |
r_dif_r | reg signed [(IWIDTH):0] | |
r_dif_i | reg signed [(IWIDTH):0] | |
fifo_addr | reg [(LGDELAY-1):0] | |
fifo_read_addr | wire [(LGDELAY-1):0] | |
fifo_left | reg [(2*IWIDTH+1):0] | |
ir_coef_r | wire [(CWIDTH-1):0] | |
ir_coef_i | wire [(CWIDTH-1):0] | |
p_one | wire [((IWIDTH+2)+(CWIDTH+1)-1):0] | |
p_two | wire [((IWIDTH+2)+(CWIDTH+1)-1):0] | |
p_three | wire [((IWIDTH+2)+(CWIDTH+1)-1):0] | |
fifo_i | wire [(IWIDTH+CWIDTH):0] | These values are held in memory and delayed during the multiply. Here, we recover them. During the multiply, values were multiplied by 2^(CWIDTH-2)exp{-j2pi…}, therefore, the left_x values need to be right shifted by CWIDTH-2 as well. The additional bits come from a sign extension. |
fifo_r | wire [(IWIDTH+CWIDTH):0] | These values are held in memory and delayed during the multiply. Here, we recover them. During the multiply, values were multiplied by 2^(CWIDTH-2)exp{-j2pi…}, therefore, the left_x values need to be right shifted by CWIDTH-2 as well. The additional bits come from a sign extension. |
fifo_read | reg [(2*IWIDTH+1):0] | |
mpy_r | reg signed [(CWIDTH+IWIDTH+3-1):0] | |
mpy_i | reg signed [(CWIDTH+IWIDTH+3-1):0] | |
rnd_left_r | wire [(OWIDTH-1):0] | Let's do some rounding and remove unnecessary bits. We have (IWIDTH+CWIDTH+3) bits here, we need to drop down to OWIDTH, and SHIFT by SHIFT bits in the process. The trick is that we don't need (IWIDTH+CWIDTH+3) bits. We've accumulated them, but the actual values will never fill all these bits. In particular, we only need: IWIDTH bits for the input +1 bit for the add/subtract +CWIDTH bits for the coefficient multiply +1 bit for the add/subtract in the complex multiply ------ (IWIDTH+CWIDTH+2) bits at full precision. However, the coefficient multiply multiplied by a maximum value of 2^(CWIDTH-2). Thus, we only have IWIDTH bits for the input +1 bit for the add/subtract +CWIDTH-2 bits for the coefficient multiply +1 (optional) bit for the add/subtract in the cpx mpy. -------- … multiply. (This last bit may be shifted out.) (IWIDTH+CWIDTH) valid output bits. Now, if the user wants to keep any extras of these (via OWIDTH), or if he wishes to arbitrarily shift some of these off (via SHIFT) we accomplish that here. |
rnd_left_i | wire [(OWIDTH-1):0] | Let's do some rounding and remove unnecessary bits. We have (IWIDTH+CWIDTH+3) bits here, we need to drop down to OWIDTH, and SHIFT by SHIFT bits in the process. The trick is that we don't need (IWIDTH+CWIDTH+3) bits. We've accumulated them, but the actual values will never fill all these bits. In particular, we only need: IWIDTH bits for the input +1 bit for the add/subtract +CWIDTH bits for the coefficient multiply +1 bit for the add/subtract in the complex multiply ------ (IWIDTH+CWIDTH+2) bits at full precision. However, the coefficient multiply multiplied by a maximum value of 2^(CWIDTH-2). Thus, we only have IWIDTH bits for the input +1 bit for the add/subtract +CWIDTH-2 bits for the coefficient multiply +1 (optional) bit for the add/subtract in the cpx mpy. -------- … multiply. (This last bit may be shifted out.) (IWIDTH+CWIDTH) valid output bits. Now, if the user wants to keep any extras of these (via OWIDTH), or if he wishes to arbitrarily shift some of these off (via SHIFT) we accomplish that here. |
rnd_right_r | wire [(OWIDTH-1):0] | Let's do some rounding and remove unnecessary bits. We have (IWIDTH+CWIDTH+3) bits here, we need to drop down to OWIDTH, and SHIFT by SHIFT bits in the process. The trick is that we don't need (IWIDTH+CWIDTH+3) bits. We've accumulated them, but the actual values will never fill all these bits. In particular, we only need: IWIDTH bits for the input +1 bit for the add/subtract +CWIDTH bits for the coefficient multiply +1 bit for the add/subtract in the complex multiply ------ (IWIDTH+CWIDTH+2) bits at full precision. However, the coefficient multiply multiplied by a maximum value of 2^(CWIDTH-2). Thus, we only have IWIDTH bits for the input +1 bit for the add/subtract +CWIDTH-2 bits for the coefficient multiply +1 (optional) bit for the add/subtract in the cpx mpy. -------- … multiply. (This last bit may be shifted out.) (IWIDTH+CWIDTH) valid output bits. Now, if the user wants to keep any extras of these (via OWIDTH), or if he wishes to arbitrarily shift some of these off (via SHIFT) we accomplish that here. |
rnd_right_i | wire [(OWIDTH-1):0] | Let's do some rounding and remove unnecessary bits. We have (IWIDTH+CWIDTH+3) bits here, we need to drop down to OWIDTH, and SHIFT by SHIFT bits in the process. The trick is that we don't need (IWIDTH+CWIDTH+3) bits. We've accumulated them, but the actual values will never fill all these bits. In particular, we only need: IWIDTH bits for the input +1 bit for the add/subtract +CWIDTH bits for the coefficient multiply +1 bit for the add/subtract in the complex multiply ------ (IWIDTH+CWIDTH+2) bits at full precision. However, the coefficient multiply multiplied by a maximum value of 2^(CWIDTH-2). Thus, we only have IWIDTH bits for the input +1 bit for the add/subtract +CWIDTH-2 bits for the coefficient multiply +1 (optional) bit for the add/subtract in the cpx mpy. -------- … multiply. (This last bit may be shifted out.) (IWIDTH+CWIDTH) valid output bits. Now, if the user wants to keep any extras of these (via OWIDTH), or if he wishes to arbitrarily shift some of these off (via SHIFT) we accomplish that here. |
left_sr | wire [(CWIDTH+IWIDTH+3-1):0] | |
left_si | wire [(CWIDTH+IWIDTH+3-1):0] | |
aux_pipeline | reg [(AUXLEN-1):0] | |
f_startup_counter | reg [F_LGDEPTH:0] |
Constants
Name | Type | Value | Description |
---|---|---|---|
MXMPYBITS | +2 | Local/derived parameters that are calculated from the above params. Apart from algorithmic changes below, these should not be adjusted The first step is to calculate how many clocks it takes our multiply to come back with an answer within. The time in the multiply depends upon the input value with the fewest number of bits--to keep the pipeline depth short. So, let's find the fewest number of bits here. |
|
LCLDELAY | MPYDELAY | In an environment when CKPCE > 1, the multiply delay isn't necessarily the delay felt by this algorithm--measured in i_ce's. In particular, if the multiply can operate with more operations per clock, it can appear to finish "faster". Since most of the logic in this core operates on the slower clock, we'll need to map that speed into the number of slower clock ticks that it takes. | |
LGDELAY | |||
AUXLEN | undefined | ||
MPYREMAINDER | MPYDELAY - CKPCE | ||
F_LGDEPTH | |||
F_DEPTH | AUXLEN | ||
F_D | [F_LGDEPTH-1:0] | -1 |
Processes
- unnamed: ( @(posedge i_clk) )
Type: always
Description
Set up the input to the multiply
- unnamed: ( @(posedge i_clk) )
Type: always
- unnamed: ( @(posedge i_clk) )
Type: always
- unnamed: ( @(posedge i_clk) )
Type: always
- unnamed: ( @(posedge i_clk) )
Type: always
- unnamed: ( @(posedge i_clk) )
Type: always
- unnamed: ( @(posedge i_clk) )
Type: always
- unnamed: ( @(posedge i_clk) )
Type: always
- unnamed: ( @(posedge i_clk) )
Type: always
- unnamed: ( @(*) )
Type: always
- unnamed: ( @(*) )
Type: always
- unnamed: ( @(posedge i_clk) )
Type: always
- unnamed: ( @(*) )
Type: always
Description
Let's see if we can improve our performance at all by moving our test one clock earlier. If nothing else, it should help induction finish one (or more) clocks ealier than otherwise
- unnamed: ( @(*) )
Type: always
- unnamed: ( @(posedge i_clk) )
Type: always
Description
Induction helpers
- unnamed: ( @(*) )
Type: always
Instantiations
- do_rnd_left_r: convround
- do_rnd_left_i: convround
- do_rnd_right_r: convround
- do_rnd_right_i: convround