June 13, 2026 · 9 min read

Synchronous FIFO Design in SystemVerilog: From Naive Counter to Extra-Bit Pointer

systemverilog
fifo
rtl-design
fpga
digital-design

Here's a number that should stop you: the "production-quality" FIFO design synthesizes at 329 MHz on sky130. The naive tutorial design hits 410 MHz. That's 81 MHz slower for the version everyone tells you to use in real designs.

The area is nearly identical: 5539 sq-um for the naive, 5640 for the extra-bit version. So you're paying more cells and losing clock frequency. That sounds like the extra-bit pointer trick is a fraud. It isn't. But the story of why that 81 MHz gap exists is the reason this article exists, and understanding it will make you a more careful RTL designer.

What a Synchronous FIFO Actually Is

A FIFO is a queue: data goes in one end and comes out the other in order. A synchronous FIFO is the simplest case: one clock drives both the write side and the read side. No clock-domain crossing, no gray code pointers, no metastability budget. You push data in with wr_en, pull it out with rd_en, and two flags (full, empty) tell you when to stop.

Under the hood you need four things: a memory array, a write pointer, a read pointer, and a way to tell when the pointers have caught up with each other (empty) or lapped each other (full). Everything else in a FIFO implementation is a decision about how you compute those last two conditions, and those decisions have real timing consequences.

One important boundary: this article covers single-clock FIFOs only. If your write clock and read clock are different frequencies or asynchronous to each other, you need a two-clock FIFO with gray code pointers and synchronizers. That's a separate problem.

The Naive Design: Carrying a Scoreboard You Don't Need

Most tutorials implement full/empty with a separate fill counter. It's intuitive. Write increments the count, read decrements it, full fires when count equals DEPTH, empty fires when count equals zero. Here's a complete implementation:

module fifo_sync_naive #(
    parameter int DATA_W = 8,
    parameter int DEPTH  = 16
) (
    input  logic             clk,
    input  logic             rst_n,
    input  logic             wr_en,
    input  logic             rd_en,
    input  logic [DATA_W-1:0] din,
    output logic [DATA_W-1:0] dout,
    output logic             full,
    output logic             empty
);

    logic [DATA_W-1:0] mem [0:DEPTH-1];

    localparam PTR_W = $clog2(DEPTH);

    logic [PTR_W-1:0] wptr;
    logic [PTR_W-1:0] rptr;
    logic [$clog2(DEPTH):0] count;

    assign full  = (count == DEPTH[PTR_W:0]);
    assign empty = (count == '0);
    assign dout  = mem[rptr];   // combinatorial read

    always_ff @(posedge clk) begin
        if (wr_en && !full)
            mem[wptr] <= din;
    end

    always_ff @(posedge clk) begin
        if (!rst_n) begin
            wptr  <= '0;
            rptr  <= '0;
            count <= '0;
        end else begin
            if (wr_en && !full)
                wptr <= (wptr == PTR_W'(DEPTH - 1)) ? '0 : wptr + 1'b1;
            if (rd_en && !empty)
                rptr <= (rptr == PTR_W'(DEPTH - 1)) ? '0 : rptr + 1'b1;
            case ({wr_en && !full, rd_en && !empty})
                2'b10:   count <= count + 1'b1;
                2'b01:   count <= count - 1'b1;
                default: count <= count;
            endcase
        end
    end

endmodule

The count register is the problem. Every cycle, the synthesizer builds a +1/-1 counter with carry-chain logic — a small adder/subtractor. For a 16-deep FIFO with 4-bit pointers this is barely noticeable. At 1024 entries with 10-bit pointers you're putting a 10-bit carry chain on your critical path for every single clock cycle, even when nothing is moving. You're paying for arithmetic you never needed.

The second problem is the assign dout = mem[rptr] line. That's a combinatorial read: every time rptr changes, the output is updated through a memory mux before the next clock edge. The synthesis tool sees a path from the flip-flop driving rptr through the memory array to the dout output. That path doesn't benefit from pipelining, and it prevents BRAM inference on any FPGA that requires registered outputs from block RAM.

There's also a subtlety in the pointer wrap logic. If DEPTH is not a power of two, the comparator wptr == DEPTH - 1 turns into a multi-bit equality check rather than a free binary overflow. Non-power-of-two depths are supported in the naive design, but at a cost.

The Extra-Bit Pointer Trick

The insight is simple. Two pointers that start at zero and walk through the memory will be equal either when the FIFO is empty (they haven't diverged) or when the write pointer has lapped the read pointer exactly once (full). With plain N-bit pointers you can't tell the difference. Add one more bit — call it the wrap parity bit — and you can.

The rule:

Empty: wptr == rptr (all bits equal, including the MSB)
Full: lower N bits equal, MSBs differ

That's it. No subtractor, no counter, just XOR and XNOR gates. The full condition is two comparisons on a handful of bits; the empty condition is a single N-bit equality test. Both map to a few LUTs or standard cells, and neither creates a carry chain.

For this to work, DEPTH must be a power of two. With a power-of-two depth, the (N+1)-bit pointer wraps naturally when the address bits overflow. No comparator needed. If DEPTH is not a power of two, the pointer arithmetic gets complicated and the trick breaks. That's not a real constraint in practice; almost every FIFO you'll build is power-of-two depth for exactly this reason.

Here's the relevant section of the implementation:

localparam int ADDR_W = $clog2(DEPTH);
localparam int PTR_W  = ADDR_W + 1;      // extra wrap bit

logic [PTR_W-1:0] wptr;
logic [PTR_W-1:0] rptr;

// empty: all bits equal
assign empty = (wptr == rptr);

// full: address bits equal, wrap bits differ
assign full  = (wptr[ADDR_W]     != rptr[ADDR_W]) &&
               (wptr[ADDR_W-1:0] == rptr[ADDR_W-1:0]);

// Pointer advances — natural binary overflow handles wrapping
always_ff @(posedge clk) begin
    if (!rst_n) begin
        wptr <= '0;
        rptr <= '0;
    end else begin
        if (wr_en && !full)   wptr <= wptr + 1'b1;
        if (rd_en && !empty)  rptr <= rptr + 1'b1;
    end
end

The pointer update is now two lines. No case statement, no conditional wrap, no DEPTH comparison. The binary overflow at bit ADDR_W gives you the wrap parity for free.

The registered read output deserves its own mention:

always_ff @(posedge clk) begin
    if (rd_en && !empty)
        dout <= mem[rptr[ADDR_W-1:0]];
end

You've broken the combinatorial path from rptr to dout. The memory read now happens inside a clocked process, and synthesis tools on Xilinx and Intel FPGAs will infer a BRAM from this pattern. The tradeoff is one cycle of read latency — dout is valid the clock after you assert rd_en. For the vast majority of designs, that's perfectly acceptable, and the timing benefit is real.

Why the "Better" Design Is Actually Slower

Now for the counter-intuitive part. The sky130 synthesis numbers are:

Design	Area (sq-um)	Cells	Fmax
fifo_sync_naive	5539	735	410 MHz
fifo_sync_robust	5640	824	329 MHz

The extra-bit design has a 0.6 ns longer critical path (3.04 ns vs 2.44 ns) and tops out 81 MHz lower. If the extra-bit trick eliminates the fill-counter subtractor, why is the timing worse?

The answer is the almost-full and almost-empty flags.

Here's the fill count computation in the extra-bit design:

logic [CNT_W-1:0] fill_count;
assign fill_count = wptr - rptr;   // wraps correctly with (ADDR_W+1)-bit arithmetic

assign almost_full  = (fill_count >= CNT_W'(AF_THRESH));
assign almost_empty = (fill_count <= CNT_W'(AE_THRESH));

You've reintroduced a subtraction. The wptr - rptr expression is a subtractor, and the comparisons against AF_THRESH and AE_THRESH sit on top of it. The full/empty flags are now clean — two XOR/XNOR trees, no carry. But almost_full and almost_empty are a multi-bit subtract followed by a compare, which is precisely the arithmetic the extra-bit trick was designed to avoid.

The naive design has no almost flags at all. The timing contest is unfair because the two designs don't have the same feature set.

This is the real gotcha with "production-quality" FIFO designs. You eliminate one subtractor from the critical path, then add two threshold comparisons that bring a subtractor back through a different door. The extra-bit trick genuinely works — the full/empty path is cheaper, and it does reduce logic depth for those flags. But the moment you add almost flags, the synthesis tool finds the new longest path and that becomes your Fmax.

If you strip the almost flags out of the extra-bit design and measure again, the timing improves. The technique isn't wrong; the framing of "the robust design is faster" is oversimplified.

When to Use Each Approach

Use the naive fill-counter design when you need non-power-of-two depths, or when you're writing a quick testbench scaffold and don't want to think about parameterization constraints. It's also fine for small, non-timing-critical FIFOs — 16 entries in a slow control path won't come close to the 410 MHz limit regardless.

Use the extra-bit design when you're targeting BRAM inference on an FPGA. The registered output is the bigger win there; the pointer trick is a bonus. Also use it when almost-full and almost-empty flags matter for your backpressure logic — just be aware that those flags now set your Fmax, not the pointer arithmetic.

If you need almost flags and you need maximum speed, one option is to register the almost flag outputs. You pay one cycle of flag latency but break the subtractor-plus-comparator path. Set your thresholds one count early to compensate. This is exactly what the Xilinx FIFO Generator IP does internally.

For FIFO depths above 512 or data widths above 32 bits, the fill-counter subtractor in the naive design will start appearing in your timing reports. The extra-bit trick becomes more valuable as the pointer width grows.

If you're crossing clock domains, neither design here applies. Async FIFOs need gray code pointers and synchronizer chains — a substantially different architecture.

The Comparison in Full

	fifo_sync_naive	fifo_sync_robust
Full/empty logic	Fill counter (subtractor)	XOR/XNOR (no carry)
Read output	Combinatorial	Registered (1-cycle latency)
BRAM-inferrable	No	Yes
Almost flags	None	almost_full, almost_empty
Non-power-of-two depth	Supported	Not supported
Synthesized area (sky130)	5539 sq-um, 735 cells	5640 sq-um, 824 cells
Fmax (sky130)	410 MHz	329 MHz

The Fmax gap is almost entirely explained by the almost-flag subtraction, not by the core pointer technique. The extra-bit trick does what it claims for full and empty. The production features around it are where the timing cost lives.

Both files compile clean under iverilog -g2012 with no warnings. Want to synthesize these yourself and measure the numbers on a different target? Try it on Logicode.