· 7 min read
Pipelining a 32-bit Multiplier in SystemVerilog: Latency, Throughput, and the Area-Speed Tradeoff
- systemverilog
- pipelining
- multiplier
- fpga
- digital-design
The combinational version tops out at 82 MHz on sky130. The 4-stage pipelined version hits 107 MHz. That 30% frequency gain costs you 24% more area and — this is the part people forget — the first result doesn't arrive until 4 cycles after you assert valid_i. Whether that tradeoff is a win depends entirely on what you're feeding the multiplier, and most tutorials don't show you both sides of the equation with real numbers. This one does.
A 32×32 unsigned multiplier is conceptually simple: generate 32 partial products, each a 64-bit shifted copy of the multiplicand gated on one bit of the multiplier, then sum them all. The trouble is that "sum them all" is not a single operation. It's a chain. The combinational module makes this structure explicit:
logic [63:0] pp [0:31];
logic [63:0] sum [0:31];
assign sum[0] = pp[0];
generate
for (i = 1; i < 32; i++) begin : gen_sum
assign sum[i] = sum[i-1] + pp[i];
end
endgenerate
assign product = sum[31];
Each sum[i] depends on sum[i-1]. The synthesis tool can rebalance this into a carry-save adder tree, but it can't eliminate the fundamental depth — there are 32 partial products, and they all have to be accumulated before the output is valid. On sky130 standard cells, that chain produces a 12.16 ns critical path, which is why Fmax caps at 82 MHz. If your system clock is 100 MHz, this design simply doesn't fit, and no amount of set_multicycle_path magic will help you.
The critical path in an array multiplier is O(N) in the number of partial products — double the word width to 64 bits and you're roughly doubling the depth again. A Wallace tree reduces that to O(log N) by using carry-save adders to compress the partial products in parallel, but it's substantially more complex to write and less legible than what we're looking at here. For a 32-bit multiplier where the goal is to understand the tradeoff, the array structure is the right starting point.
The 4-stage pipeline breaks that O(N) depth by cutting the work into four equal slices. Each stage handles 8 partial products, accumulates them into a running sum, and registers the result before handing it to the next stage. The critical path in each stage is the time to add 8 values plus the setup time for the flip-flops — roughly a quarter of the combinational depth, which is why the synthesizer can run it faster.
always_ff @(posedge clk) begin
// Stage 1: accumulate pp[0..7]
s1_acc <= pp_term(a, b, 0) + pp_term(a, b, 1) + ... + pp_term(a, b, 7);
s1_a <= a;
s1_b <= b;
s1_valid <= valid_i;
end
always_ff @(posedge clk) begin
// Stage 2: add pp[8..15] to the running sum
s2_acc <= s1_acc + pp_term(s1_a, s1_b, 8) + ... + pp_term(s1_a, s1_b, 15);
s2_a <= s1_a;
s2_b <= s1_b;
s2_valid <= s1_valid;
end
Stages 3 and 4 follow the same pattern, finishing with pp[16..23] and pp[24..31] respectively. The synthesis result: 9.32 ns critical path, 107 MHz Fmax. The pipeline registers cost 1,303 additional cells over the combinational version (11,631 vs 10,328), and total area grows from 33,965 to 42,203 area units — that 24% increase is almost entirely flip-flops.
The thing in this code that trips people up the first time: notice that s1_a and s1_b — copies of the inputs — are being forwarded through every pipeline stage. Stage 2 uses s1_a and s1_b, not the original a and b, because by the time stage 2 fires, a and b already hold the next input pair. Each stage computes its partial products fresh from the sideband-registered operands, which keeps the logic balanced across stages but means you're also registering 64 bits of operand state per stage. That's 128 bits × 3 forwarding stages = 384 flip-flops just for sideband bookkeeping, before you count the 64-bit accumulators. If you computed all 32 partial products in stage 1 and forwarded only the sum, you'd eliminate the sidebands but blow up the combinational depth of stage 1 and defeat the purpose. I've watched people try it, hit 11 ns on the first stage, and wonder why they didn't get the full speedup.
The measured numbers side by side:
| Variant | Fmax (sky130) | Critical Path | Area Units | Cells | Latency |
|---|---|---|---|---|---|
mul32_comb |
82 MHz | 12.16 ns | 33,965 | 10,328 | 0 cycles |
mul32_pipe4 |
107 MHz | 9.32 ns | 42,203 | 11,631 | 4 cycles |
30% more frequency, 24% more area. Whether those numbers represent a good trade is the actual question.
Once the pipeline is full, mul32_pipe4 produces one result per clock cycle at 107 MHz. mul32_comb produces one result per clock cycle at 82 MHz — assuming you're clocking it at 82 MHz and the rest of your design is fine with that. So for a steady stream of independent multiplications, the pipelined version is strictly faster.
But the pipeline is not full on the first four cycles. For a batch of K independent multiplications, the total wall-clock time is:
- Combinational: K × T_comb = K × (1 / 82 MHz) = K × 12.2 ns
- Pipelined: (4 + K − 1) × T_pipe = (K + 3) × (1 / 107 MHz) = (K + 3) × 9.3 ns
For K = 1: combinational takes 12.2 ns, pipelined takes 37.2 ns. The pipeline is 3× slower for a single multiply. For K = 10: combinational takes 122 ns, pipelined takes 121 ns — essentially a wash. For K = 100: combinational takes 1,220 ns, pipelined takes 957 ns — a genuine 22% wall-clock win. For K = 1,000: the pipeline wins by about 20%, roughly the ratio of the two clock periods. The breakeven is somewhere around K = 10 to 15 independent operations, depending on the exact numbers.
You're paying 4 cycles up front, and you only earn them back if you keep the pipeline fed. If your use case is "multiply a coefficient by a sample, once, then wait for the result before doing anything else," the pipeline is actively harmful — you've added latency and gotten nothing back for it.
There are three situations where I'd keep the combinational version and not touch the pipeline.
Single-shot or bursty multiplications are the first: configuration-time coefficient scaling, CRC seed computation, anything where you fire one or a handful of operations and then go idle. The 4-cycle bubble isn't buying you anything because throughput isn't the bottleneck; your idle time dwarfs the compute time.
Feedback loops are the second. If your design does iterative convergence — Newton-Raphson for division, CORDIC for trigonometry, any loop where the result of cycle N feeds back as an input to cycle N+1 — a 4-stage pipeline adds 4 cycles of latency to every iteration. A 10-step Newton-Raphson convergence that takes 10 cycles combinationally now takes 40+ cycles pipelined. That's not a timing fix; that's a throughput disaster.
The third case is the one that makes both modules in this article somewhat academic for FPGA work: DSP blocks. A Xilinx DSP48E1 implements a 25×18 signed multiply in hardware, with a 3-register pipeline already baked in, running at frequencies well above what either of our soft-logic modules can reach. If your target has DSP blocks and your multiply fits within 25×18, using LUT-based soft multipliers wastes both LUTs and performance. The synthesis tool will usually infer DSP blocks from the * operator anyway, but it's worth confirming in the utilization report that you're actually getting DSP primitives and not a soft implementation.
For cases wider than the DSP can handle — 32×32 signed, 64-bit accumulation — you can build a split-multiply structure using multiple DSP48 blocks, or fall back to a soft implementation like the ones shown here. The MDPI (2016) paper on array multipliers for Xilinx FPGAs found that a pipelined soft multiplier can use 42–52% fewer LUTs than Xilinx's own LogiCORE IP for the same word width, which is a reasonable argument for rolling your own when resource budget is tight.
My decision rule: if Fmax is the constraint and I'm processing a stream of independent operand pairs, pipeline the multiplier. If the design is latency-sensitive, if there's feedback, or if DSP blocks are available and the operand widths fit, the soft-logic pipeline isn't worth the area. The sky130 numbers above give you the concrete data to plug into the breakeven formula rather than guessing.
If you want to synthesize both modules yourself and watch the timing reports, you can try them on Logicode.