· 9 min read
Synchronous FIFO Design in SystemVerilog: From Naive Counter to Extra-Bit Pointer
- systemverilog
- fifo
- rtl-design
- fpga
- digital-design
Here's a number that should stop you: the "production-quality" FIFO design synthesizes at 329 MHz on sky130. The naive tutorial design hits 410 MHz. That's 81 MHz slower for the version everyone tells you to use in real designs.
The area is nearly identical: 5539 sq-um for the naive, 5640 for the extra-bit version. So you're paying more cells and losing clock frequency. That sounds like the extra-bit pointer trick is a fraud. It isn't. But the story of why that 81 MHz gap exists is the reason this article exists, and understanding it will make you a more careful RTL designer.
What a Synchronous FIFO Actually Is
A FIFO is a queue: data goes in one end and comes out the other in order. A synchronous FIFO is the simplest case: one clock drives both the write side and the read side. No clock-domain crossing, no gray code pointers, no metastability budget. You push data in with wr_en, pull it out with rd_en, and two flags (full, empty) tell you when to stop.
Under the hood you need four things: a memory array, a write pointer, a read pointer, and a way to tell when the pointers have caught up with each other (empty) or lapped each other (full). Everything else in a FIFO implementation is a decision about how you compute those last two conditions, and those decisions have real timing consequences.
One important boundary: this article covers single-clock FIFOs only. If your write clock and read clock are different frequencies or asynchronous to each other, you need a two-clock FIFO with gray code pointers and synchronizers. That's a separate problem.
The Naive Design: Carrying a Scoreboard You Don't Need
Most tutorials implement full/empty with a separate fill counter. It's intuitive. Write increments the count, read decrements it, full fires when count equals DEPTH, empty fires when count equals zero. Here's a complete implementation:
module fifo_sync_naive #(
parameter int DATA_W = 8,
parameter int DEPTH = 16
) (
input logic clk,
input logic rst_n,
input logic wr_en,
input logic rd_en,
input logic [DATA_W-1:0] din,
output logic [DATA_W-1:0] dout,
output logic full,
output logic empty
);
logic [DATA_W-1:0] mem [0:DEPTH-1];
localparam PTR_W = $clog2(DEPTH);
logic [PTR_W-1:0] wptr;
logic [PTR_W-1:0] rptr;
logic [$clog2(DEPTH):0] count;
assign full = (count == DEPTH[PTR_W:0]);
assign empty = (count == '0);
assign dout = mem[rptr]; // combinatorial read
always_ff @(posedge clk) begin
if (wr_en && !full)
mem[wptr] <= din;
end
always_ff @(posedge clk) begin
if (!rst_n) begin
wptr <= '0;
rptr <= '0;
count <= '0;
end else begin
if (wr_en && !full)
wptr <= (wptr == PTR_W'(DEPTH - 1)) ? '0 : wptr + 1'b1;
if (rd_en && !empty)
rptr <= (rptr == PTR_W'(DEPTH - 1)) ? '0 : rptr + 1'b1;
case ({wr_en && !full, rd_en && !empty})
2'b10: count <= count + 1'b1;
2'b01: count <= count - 1'b1;
default: count <= count;
endcase
end
end
endmodule
The count register is the problem. Every cycle, the synthesizer builds a +1/-1 counter with carry-chain logic — a small adder/subtractor. For a 16-deep FIFO with 4-bit pointers this is barely noticeable. At 1024 entries with 10-bit pointers you're putting a 10-bit carry chain on your critical path for every single clock cycle, even when nothing is moving. You're paying for arithmetic you never needed.
The second problem is the assign dout = mem[rptr] line. That's a combinatorial read: every time rptr changes, the output is updated through a memory mux before the next clock edge. The synthesis tool sees a path from the flip-flop driving rptr through the memory array to the dout output. That path doesn't benefit from pipelining, and it prevents BRAM inference on any FPGA that requires registered outputs from block RAM.
There's also a subtlety in the pointer wrap logic. If DEPTH is not a power of two, the comparator wptr == DEPTH - 1 turns into a multi-bit equality check rather than a free binary overflow. Non-power-of-two depths are supported in the naive design, but at a cost.
The Extra-Bit Pointer Trick
The insight is simple. Two pointers that start at zero and walk through the memory will be equal either when the FIFO is empty (they haven't diverged) or when the write pointer has lapped the read pointer exactly once (full). With plain N-bit pointers you can't tell the difference. Add one more bit — call it the wrap parity bit — and you can.
The rule:
- Empty:
wptr == rptr(all bits equal, including the MSB) - Full: lower N bits equal, MSBs differ
That's it. No subtractor, no counter, just XOR and XNOR gates. The full condition is two comparisons on a handful of bits; the empty condition is a single N-bit equality test. Both map to a few LUTs or standard cells, and neither creates a carry chain.
For this to work, DEPTH must be a power of two. With a power-of-two depth, the (N+1)-bit pointer wraps naturally when the address bits overflow. No comparator needed. If DEPTH is not a power of two, the pointer arithmetic gets complicated and the trick breaks. That's not a real constraint in practice; almost every FIFO you'll build is power-of-two depth for exactly this reason.
Here's the relevant section of the implementation:
localparam int ADDR_W = $clog2(DEPTH);
localparam int PTR_W = ADDR_W + 1; // extra wrap bit
logic [PTR_W-1:0] wptr;
logic [PTR_W-1:0] rptr;
// empty: all bits equal
assign empty = (wptr == rptr);
// full: address bits equal, wrap bits differ
assign full = (wptr[ADDR_W] != rptr[ADDR_W]) &&
(wptr[ADDR_W-1:0] == rptr[ADDR_W-1:0]);
// Pointer advances — natural binary overflow handles wrapping
always_ff @(posedge clk) begin
if (!rst_n) begin
wptr <= '0;
rptr <= '0;
end else begin
if (wr_en && !full) wptr <= wptr + 1'b1;
if (rd_en && !empty) rptr <= rptr + 1'b1;
end
end
The pointer update is now two lines. No case statement, no conditional wrap, no DEPTH comparison. The binary overflow at bit ADDR_W gives you the wrap parity for free.
The registered read output deserves its own mention:
always_ff @(posedge clk) begin
if (rd_en && !empty)
dout <= mem[rptr[ADDR_W-1:0]];
end
You've broken the combinatorial path from rptr to dout. The memory read now happens inside a clocked process, and synthesis tools on Xilinx and Intel FPGAs will infer a BRAM from this pattern. The tradeoff is one cycle of read latency — dout is valid the clock after you assert rd_en. For the vast majority of designs, that's perfectly acceptable, and the timing benefit is real.
Why the "Better" Design Is Actually Slower
Now for the counter-intuitive part. The sky130 synthesis numbers are:
| Design | Area (sq-um) | Cells | Fmax |
|---|---|---|---|
| fifo_sync_naive | 5539 | 735 | 410 MHz |
| fifo_sync_robust | 5640 | 824 | 329 MHz |
The extra-bit design has a 0.6 ns longer critical path (3.04 ns vs 2.44 ns) and tops out 81 MHz lower. If the extra-bit trick eliminates the fill-counter subtractor, why is the timing worse?
The answer is the almost-full and almost-empty flags.
Here's the fill count computation in the extra-bit design:
logic [CNT_W-1:0] fill_count;
assign fill_count = wptr - rptr; // wraps correctly with (ADDR_W+1)-bit arithmetic
assign almost_full = (fill_count >= CNT_W'(AF_THRESH));
assign almost_empty = (fill_count <= CNT_W'(AE_THRESH));
You've reintroduced a subtraction. The wptr - rptr expression is a subtractor, and the comparisons against AF_THRESH and AE_THRESH sit on top of it. The full/empty flags are now clean — two XOR/XNOR trees, no carry. But almost_full and almost_empty are a multi-bit subtract followed by a compare, which is precisely the arithmetic the extra-bit trick was designed to avoid.
The naive design has no almost flags at all. The timing contest is unfair because the two designs don't have the same feature set.
This is the real gotcha with "production-quality" FIFO designs. You eliminate one subtractor from the critical path, then add two threshold comparisons that bring a subtractor back through a different door. The extra-bit trick genuinely works — the full/empty path is cheaper, and it does reduce logic depth for those flags. But the moment you add almost flags, the synthesis tool finds the new longest path and that becomes your Fmax.
If you strip the almost flags out of the extra-bit design and measure again, the timing improves. The technique isn't wrong; the framing of "the robust design is faster" is oversimplified.
When to Use Each Approach
Use the naive fill-counter design when you need non-power-of-two depths, or when you're writing a quick testbench scaffold and don't want to think about parameterization constraints. It's also fine for small, non-timing-critical FIFOs — 16 entries in a slow control path won't come close to the 410 MHz limit regardless.
Use the extra-bit design when you're targeting BRAM inference on an FPGA. The registered output is the bigger win there; the pointer trick is a bonus. Also use it when almost-full and almost-empty flags matter for your backpressure logic — just be aware that those flags now set your Fmax, not the pointer arithmetic.
If you need almost flags and you need maximum speed, one option is to register the almost flag outputs. You pay one cycle of flag latency but break the subtractor-plus-comparator path. Set your thresholds one count early to compensate. This is exactly what the Xilinx FIFO Generator IP does internally.
For FIFO depths above 512 or data widths above 32 bits, the fill-counter subtractor in the naive design will start appearing in your timing reports. The extra-bit trick becomes more valuable as the pointer width grows.
If you're crossing clock domains, neither design here applies. Async FIFOs need gray code pointers and synchronizer chains — a substantially different architecture.
The Comparison in Full
| fifo_sync_naive | fifo_sync_robust | |
|---|---|---|
| Full/empty logic | Fill counter (subtractor) | XOR/XNOR (no carry) |
| Read output | Combinatorial | Registered (1-cycle latency) |
| BRAM-inferrable | No | Yes |
| Almost flags | None | almost_full, almost_empty |
| Non-power-of-two depth | Supported | Not supported |
| Synthesized area (sky130) | 5539 sq-um, 735 cells | 5640 sq-um, 824 cells |
| Fmax (sky130) | 410 MHz | 329 MHz |
The Fmax gap is almost entirely explained by the almost-flag subtraction, not by the core pointer technique. The extra-bit trick does what it claims for full and empty. The production features around it are where the timing cost lives.
Both files compile clean under iverilog -g2012 with no warnings. Want to synthesize these yourself and measure the numbers on a different target? Try it on Logicode.