← Back to the blog

· 11 min read

Clock Domain Crossing in SystemVerilog: 2-FF Synchronizer, Async FIFO, and the Multi-Bit Trap

  • clock-domain-crossing
  • cdc
  • systemverilog
  • async-fifo
  • metastability
  • fpga

Every RTL designer learns the 2-FF synchronizer early. What they don't learn — until something fails in silicon — is that naively applying it to a multi-bit bus produces values that never existed in the source domain. The synchronizer looks correct. Simulation often misses it. The bug ships.

I want to fix that gap here: not just what the 2-FF synchronizer is, but where exactly it breaks and what to reach for instead. I'll show you the full picture — synthesized on real sky130 standard cells, not just described — so you can make an actual design decision rather than cargo-culting.

The problem: a wire is not a signal

When a flip-flop output in clock domain A feeds a flip-flop input in clock domain B, the setup-and-hold relationship that the destination FF depends on doesn't exist. The clocks are unrelated. The data could arrive at any point in the destination clock period, including right at the edge. When that happens the output of the destination FF doesn't go cleanly to 0 or 1 — it oscillates, or sits at an intermediate voltage, until thermal noise eventually kicks it one way or the other. This is metastability.

The probability that a FF remains metastable past a given settling time decays exponentially, which gives you an MTBF formula of the form:

MTBF = exp(Ts / τ) / (Tw × Fc × Fd)

where Ts is the settling time available (destination clock period minus the combinational delay to the next FF), τ and Tw are device-specific constants your fab or vendor characterizes for you, Fc is the destination clock frequency, and Fd is the data transition frequency. The exponential is doing the heavy lifting: double your settling time, and MTBF jumps by orders of magnitude. This is why a two-stage synchronizer works — the first FF is allowed to metastabilize, the second FF samples the output of the first a full destination-clock period later, by which time the probability of still being metastable is negligibly small.

The 2-FF synchronizer done right

Here's the RTL I synthesized:

module cdc_sync_2ff #(
    parameter int STAGES = 2  // 2 = standard; 3 = very-high-speed or critical path
) (
    input  logic clk_dst,   // destination clock
    input  logic rst_dst_n, // async active-low reset in destination domain
    input  logic data_src,  // asynchronous input from source domain
    output logic data_dst   // synchronised output in destination domain
);

    // Synthesis attribute: keep FFs adjacent and exclude from timing analysis.
    // Xilinx Vivado recognises ASYNC_REG; Quartus uses SYNCHRONIZER_IDENTIFICATION.
    (* ASYNC_REG = "TRUE" *) logic [STAGES-1:0] sync_ff;

    always_ff @(posedge clk_dst or negedge rst_dst_n) begin
        if (!rst_dst_n) begin
            sync_ff <= '0;
        end else begin
            // Shift register: first stage may go metastable;
            // each subsequent stage gives one full clock period to resolve.
            sync_ff <= {sync_ff[STAGES-2:0], data_src};
        end
    end

    assign data_dst = sync_ff[STAGES-1];

endmodule

The (* ASYNC_REG = "TRUE" *) attribute is not optional. Without it, Vivado is free to merge or retime the synchronizer FFs as part of general optimization, which destroys the MTBF guarantee entirely — you might end up with both stages in the same fabric slice under different timing constraints, or the tool might combine them with downstream logic. The attribute forces three things: the FFs stay as distinct cells, they get placed physically adjacent to minimize routing skew between them, and the launch-to-FF1 path is excluded from timing analysis.

That exclusion is the other half of the requirement. You also need the SDC constraint:

set_false_path -from [get_clocks clk_src] -to [get_cells -hierarchical -filter {ASYNC_REG == TRUE}]

or equivalently set_clock_groups -asynchronous -group {clk_src} -group {clk_dst}. Without this, the timing engine tries to close timing on an intentionally asynchronous path, reports a massive violation, and either you ignore it (dangerous) or the tool applies optimizations that break the synchronizer structure. The attribute and the SDC constraint are a pair; one without the other is incomplete.

Sky130 synthesis on this module: 55.1 area units, 16 cells, 3030 MHz Fmax, 0.33 ns critical path. Tiny and fast, which makes it tempting to reach for it on every signal you need to cross.

The multi-bit trap

Here's the failure mode in concrete terms. Say you have an 8-bit configuration register in the source domain. Bit 7 goes from 0 to 1, and bit 0 goes from 1 to 0, in the same source clock cycle. You wire each bit through its own 2-FF synchronizer into the destination domain. Each of those eight synchronizer chains is independent. Each one resolves its metastability independently.

This means bit 7's first-stage FF might resolve within half a destination clock period, so its second stage captures the new value at the next rising edge. Bit 0's first-stage FF takes slightly longer to resolve, so its second stage captures the old value. For one or more destination clock cycles, the destination domain sees bit 7 = 1 and bit 0 = 1 simultaneously — a combination that never existed in the source domain. If that combination represents an illegal state in a protocol or an FSM, you've just silently corrupted your design.

The insidious part is how rarely this shows up in simulation. Your functional testbench runs with clocks in a known phase relationship, and typical simulators don't model metastability at all — the bit either resolves fast or slow based on what the simulation gods decide, and if your test vectors don't happen to trigger the multi-cycle capture window, you never see the bogus intermediate value.

The rule Cliff Cummings documented in SNUG 2002 is unambiguous: a 2-FF synchronizer is only safe for single-bit signals whose source is stable for at least one full destination clock period around the transition. Configuration registers, data words, addresses — none of these qualify.

The four-phase handshake

For infrequent multi-bit transfers — a configuration register written once at startup, a command word sent every few milliseconds — the four-phase handshake extends the 2-FF approach correctly. The source holds the data stable, asserts a req signal, and waits for ack to come back. The destination 2-FF synchronizes req, reads the data only after it's confirmed stable, then asserts ack, which the source 2-FF synchronizes before deasserting req. Four phase crossings, four round-trip synchronizer latencies.

The latency adds up — easily 8 to 10 destination clock cycles per transfer — but for truly infrequent control signals it's fine. The data is guaranteed stable across the entire transfer window because the protocol enforces it. You don't need Gray code, you don't need an extra buffer.

If your data rate is at all substantial, though, the handshake becomes a bottleneck fast. That's where the async FIFO comes in.

Async FIFO with Gray-code pointers

The async FIFO solves the multi-bit problem by never synchronizing the data at all — only the pointers cross the clock boundary, and only after they're converted to Gray code.

Gray code is the key insight. In standard binary, an increment from 3 (011) to 4 (100) flips three bits simultaneously. If you try to synchronize that three-bit transition with a 2-FF synchronizer, you're back in multi-bit trap territory — the destination can capture any of several intermediate values. In Gray code, any increment changes exactly one bit. 011 to 100 in binary is 010 to 110 in Gray — one bit changes. So the 2-FF synchronizer inside the FIFO always captures either the old pointer value or the new pointer value, never a corrupted mix of the two.

The second critical choice is pointer width. Pointers are ADDR_WIDTH + 1 bits wide — one extra bit beyond what you'd need to index into the memory. This extra MSB encodes wrap-around. When the write pointer has lapped the read pointer by exactly 2^ADDR_WIDTH entries, the MSBs of the two pointers will be different while the lower address bits match — that's full. When all bits match, the pointers are equal — that's empty. Without the extra bit, you can't distinguish between a full FIFO and an empty one when the address bits happen to align.

There's a subtlety in where full and empty are computed. wfull is generated in the write clock domain, using the Gray-coded read pointer that's been synchronized into the write domain. rempty is generated in the read clock domain, using the Gray-coded write pointer synchronized into the read domain. These domains are honored strictly — you cannot compute wfull in the read domain or vice versa without breaking the correctness argument.

Here's the full RTL:

module cdc_async_fifo #(
    parameter int DATA_WIDTH = 8,
    parameter int ADDR_WIDTH = 4  // depth = 2^ADDR_WIDTH entries
) (
    // Write port
    input  logic                  wclk,
    input  logic                  wrst_n,
    input  logic                  wen,
    input  logic [DATA_WIDTH-1:0] wdata,
    output logic                  wfull,

    // Read port
    input  logic                  rclk,
    input  logic                  rrst_n,
    input  logic                  ren,
    output logic [DATA_WIDTH-1:0] rdata,
    output logic                  rempty
);

    localparam int PTR_WIDTH = ADDR_WIDTH + 1; // extra MSB for full/empty

    // Dual-port memory (inferred as block RAM by synthesis tools)
    logic [DATA_WIDTH-1:0] mem [0:2**ADDR_WIDTH-1];

    // Write-domain: binary and Gray-coded write pointer
    logic [PTR_WIDTH-1:0] wbin,  wbin_next;
    logic [PTR_WIDTH-1:0] wgray, wgray_next;

    // Gray-coded read pointer synchronised into write domain
    (* ASYNC_REG = "TRUE" *) logic [PTR_WIDTH-1:0] rptr_wdom_s1, rptr_wdom;

    // Read-domain: binary and Gray-coded read pointer
    logic [PTR_WIDTH-1:0] rbin,  rbin_next;
    logic [PTR_WIDTH-1:0] rgray, rgray_next;

    // Gray-coded write pointer synchronised into read domain
    (* ASYNC_REG = "TRUE" *) logic [PTR_WIDTH-1:0] wptr_rdom_s1, wptr_rdom;

    // Binary <-> Gray conversions (combinational)
    function automatic [PTR_WIDTH-1:0] bin2gray;
        input logic [PTR_WIDTH-1:0] b;
        bin2gray = b ^ (b >> 1);
    endfunction

    // Write logic
    assign wbin_next  = wbin + (wen & ~wfull);
    assign wgray_next = bin2gray(wbin_next);

    // full when next write pointer == synchronized read pointer with top
    // two bits inverted (they have lapped each other by exactly 2^ADDR_WIDTH)
    assign wfull = (wgray_next ==
                    {~rptr_wdom[PTR_WIDTH-1:PTR_WIDTH-2], rptr_wdom[PTR_WIDTH-3:0]});

    always_ff @(posedge wclk or negedge wrst_n) begin
        if (!wrst_n) begin
            wbin  <= '0;
            wgray <= '0;
        end else begin
            wbin  <= wbin_next;
            wgray <= wgray_next;
        end
    end

    always_ff @(posedge wclk) begin
        if (wen && !wfull)
            mem[wbin[ADDR_WIDTH-1:0]] <= wdata;
    end

    // 2-FF synchroniser: Gray-coded read pointer -> write domain
    always_ff @(posedge wclk or negedge wrst_n) begin
        if (!wrst_n) begin
            rptr_wdom_s1 <= '0;
            rptr_wdom    <= '0;
        end else begin
            rptr_wdom_s1 <= rgray;
            rptr_wdom    <= rptr_wdom_s1;
        end
    end

    // Read logic
    assign rbin_next  = rbin + (ren & ~rempty);
    assign rgray_next = bin2gray(rbin_next);

    // empty when read Gray pointer == synchronized write Gray pointer
    assign rempty = (rgray_next == wptr_rdom);

    always_ff @(posedge rclk or negedge rrst_n) begin
        if (!rrst_n) begin
            rbin  <= '0;
            rgray <= '0;
        end else begin
            rbin  <= rbin_next;
            rgray <= rgray_next;
        end
    end

    assign rdata = mem[rbin[ADDR_WIDTH-1:0]];

    // 2-FF synchroniser: Gray-coded write pointer -> read domain
    always_ff @(posedge rclk or negedge rrst_n) begin
        if (!rrst_n) begin
            wptr_rdom_s1 <= '0;
            wptr_rdom    <= '0;
        end else begin
            wptr_rdom_s1 <= wgray;
            wptr_rdom    <= wptr_rdom_s1;
        end
    end

endmodule

Notice that ASYNC_REG = "TRUE" appears on both synchronizer chains — rptr_wdom_s1/rptr_wdom and wptr_rdom_s1/wptr_rdom. The same placement and timing-exclusion requirements apply to the synchronizers embedded inside the FIFO as to the standalone synchronizer. You need the corresponding SDC constraints on both paths.

Synthesis numbers

Both modules synthesized against sky130 standard cells. These are real numbers from the tool:

Module Area (units) Cells Fmax (MHz) Critical Path (ns)
cdc_sync_2ff 55.1 16 3,030 0.33
cdc_async_fifo (depth 16, 8-bit) 6,222.2 814 377 2.65

The async FIFO is 113x larger in area and runs at 8x lower frequency. That's not a knock on the FIFO — it does something fundamentally different, buffering multi-bit data between two unrelated clocks. But those numbers make the trade-off concrete. If you're crossing a single enable signal, you don't want 814 cells doing it. If you're streaming audio samples between a 48 kHz sample clock and a 100 MHz processor clock, you can't do it with 16 cells.

Choosing the right technique

Pick the right tool and the decision is straightforward. A single slow-changing control bit — reset, enable, mode select — uses the 2-FF synchronizer. The ASYNC_REG attribute and the set_false_path SDC constraint are mandatory in both cases; the synthesis netlist is not correct without them.

Multi-bit data that changes infrequently, where latency of 8–10 destination cycles per transfer is acceptable, fits the four-phase handshake. You're holding data stable in the source domain and synchronizing only a single toggle bit, so the multi-bit trap doesn't apply.

Any multi-bit transfer that needs sustained throughput — streaming data, a wide data path between processor and peripheral, FIFO-backed AXI crossings — belongs in an async FIFO. The Gray-code pointer scheme gives you the safety of a single-bit synchronizer on each boundary crossing, and the dual-port memory buffers the actual data without any cross-domain transfer of the payload itself. The cost, as the numbers above show, is real: 6,222 area units and a 2.65 ns critical path on the full logic. Budget for it.

One more thing worth saying: CDC lint tools exist and you should run them. Spyglass CDC, Cadence JasperGold CDC, Mentor Questa CDC — they catch structural violations that simulation won't, including accidental multi-bit synchronizers on buses and missing timing constraints. I've seen designs that passed full regression with a latent multi-bit synchronizer bug that only showed up when a corner-case clock frequency combination hit the metastability window. The lint tool found it in ten minutes. That's the kind of thing that's hard to justify not doing.

If you want to run these modules yourself and check the synthesis numbers against your own parameter choices, try it on Logicode.