← Back to the blog

· 9 min read

I2C Master in SystemVerilog: Why Your Flat FSM Breaks on Real Hardware (and the Two-Level Fix)

  • systemverilog
  • i2c
  • fpga
  • fsm
  • hardware-design
  • serial-protocol

I've read enough I2C tutorials to know exactly where they go wrong, and it's almost always the same two lines. The master sets scl_o <= 1, then on the very next tick — without checking whether SCL actually went high on the wire — it advances to the next state and shifts the data bit into its register. Works perfectly in simulation, where the bus obeys you instantly. Fails silently on real hardware the moment an AT24 EEPROM decides to hold SCL low for 50 µs while it finishes a page-write operation — the master has already moved on, the bit it sampled is garbage, and the rest of the transaction degrades from there.

The fix is one condition: qdiv_tick && scl_in instead of plain qdiv_tick. You gate the state advance on the actual bus value, not just your internal timer expiring. The flat FSM I'll show first gets this right — it checks scl_in in every SCL-high state before advancing. But being correct isn't the same as being clean, and the structural problems with a 21-state monolith become obvious the moment you try to add arbitration detection or a repeated-START.

I2C is a two-wire bus: SCL (clock) and SDA (data). Both lines are open-drain with pull-up resistors — no device drives them high, they only pull low or release. That wired-AND arrangement is what enables clock stretching and multi-master arbitration. A START condition is SDA falling while SCL is high; a STOP is SDA rising while SCL is high; every other SDA transition must happen while SCL is low. After a START the master sends a 7-bit address and a R/W bit, and the addressed slave pulls SDA low to ACK, then data bytes flow, each followed by an ACK, until a STOP releases the bus.

The NXP UM10204 specification defines the timing for Standard-mode (100 kHz): tLOW ≥ 4.7 µs, tHIGH ≥ 4.0 µs, tSU;DAT ≥ 250 ns setup before SCL rises. With a 50 MHz system clock, setting QDIV = CLK_FREQ / (4 × I2C_FREQ) = 125 gives a quarter-period of 2.5 µs — two of those at SCL=0 gives 5.0 µs, comfortably above spec. Clock stretching is defined in §3.1.9: a slave can hold SCL low to pause the master mid-transaction for as long as it needs. Arbitration (§3.1.8) applies when two masters start simultaneously — each reads back SDA while driving, and the one who sees a 0 when it drove a 1 has lost and must back off immediately without issuing a STOP. Those three mechanisms — clock stretching, arbitration loss, and SDA hold time — are where almost every tutorial-level implementation falls short.

i2c_master_simple.sv is a 21-state monolith. The state enum tells the whole story:

typedef enum logic [4:0] {
    S_IDLE        = 5'd0,
    S_START_HIGH  = 5'd1,
    S_START_FALL  = 5'd2,
    S_ADDR_SCL_L  = 5'd3,
    S_ADDR_SCL_H  = 5'd4,
    S_RW_SCL_L    = 5'd5,
    S_RW_SCL_H    = 5'd6,
    S_ADDR_ACK_L  = 5'd7,
    S_ADDR_ACK_H  = 5'd8,
    S_WDATA_SCL_L = 5'd9,
    S_WDATA_SCL_H = 5'd10,
    S_WACK_L      = 5'd11,
    S_WACK_H      = 5'd12,
    S_RDATA_SCL_L = 5'd13,
    S_RDATA_SCL_H = 5'd14,
    S_MACK_L      = 5'd15,
    S_MACK_H      = 5'd16,
    S_STOP_SCL_L  = 5'd17,
    S_STOP_SCL_H  = 5'd18,
    S_STOP_RISE   = 5'd19,
    S_DONE        = 5'd20
} state_t;

Every state is a distinct slot in a single 5-bit register. The clock-stretching guard appears correctly in each SCL-high state:

S_ADDR_SCL_H: begin
    scl_o <= 1'b1;
    // Wait for SCL to actually go high (handles clock stretching)
    if (qdiv_tick && scl_in) begin
        shift <= {shift[6:0], 1'b0};
        if (bit_cnt == 3'd0) begin
            state <= S_RW_SCL_L;
        end else begin
            bit_cnt <= bit_cnt - 1'b1;
            state   <= S_ADDR_SCL_L;
        end
    end
end

The double-condition qdiv_tick && scl_in means the timer fires, but if the slave is holding SCL low, the transition doesn't happen — the qdiv counter keeps ticking, re-evaluating on every cycle until scl_in goes high. This matches what NXP UM10204 §3.1.9 requires, and what fpgarelated.com practitioners describe as the essential "sample scl_in before advancing" rule. The same guard appears in S_RW_SCL_H, S_ADDR_ACK_H, S_WDATA_SCL_H, S_WACK_H, S_RDATA_SCL_H, S_MACK_H, and S_STOP_SCL_H — eight states where the check must be duplicated. For a single-byte write/read, this works. But add multi-byte support and you add more _SCL_L/_SCL_H pairs, each needing the same guard. Add a repeated-START and you need START states that branch mid-transaction. The state count scales with features, not with the protocol's inherent complexity, which is actually quite small.

The deeper structural problem is that there's no clean place to detect arbitration loss. Every _SCL_H state where the master is driving a 1 needs to check sda_in == 0, and in a 21-state machine that's scattered logic with no single locus. When something goes wrong in the lab, you're reading eight separate cases in the waveform viewer wondering which one fired.

i2c_master_seq.sv separates the same logic into two cooperating FSMs. bit_engine handles one atomic I2C primitive — START, STOP, SEND_BIT, RECV_BIT, SEND_ACK, RECV_ACK — and txn_ctrl sequences those primitives to implement the transaction. The bit engine has 8 states:

typedef enum logic [2:0] {
    BE_IDLE    = 3'd0,
    BE_SETUP   = 3'd1,   // SCL low: setup data/condition on SDA
    BE_SCL_H   = 3'd2,   // SCL rising: wait for SCL_in (stretch)
    BE_HOLD    = 3'd3,   // SCL high: hold data (sampling window)
    BE_SCL_L   = 3'd4,   // SCL falling
    BE_STA_H   = 3'd5,   // START: SDA falls while SCL=1
    BE_STO_H   = 3'd6,   // STOP:  SDA rises while SCL=1
    BE_DONE    = 3'd7
} be_state_t;

The clock-stretching check lives in exactly one state:

BE_SCL_H: begin
    if (!scl_in) begin
        // Slave is stretching: just wait, don't advance timer
    end else if (qdiv_tick) begin
        // SCL is high and hold time has elapsed
        if ((cmd == CMD_RECV_BIT) || (cmd == CMD_RECV_ACK))
            rcv_bit <= sda_in;
        // Arbitration: we drove 1 but bus reads 0 → lost
        if (((cmd == CMD_SEND_BIT) || (cmd == CMD_SEND_ACK))
                && sda_bit && !sda_in)
            arb_loss <= 1'b1;
        be_state <= BE_SCL_L;
    end
end

One state, one place — that's the entire clock-stretching and arbitration-detection implementation. If scl_in is low, the inner timer stops (the qdiv_tick branch is skipped entirely). If SCL is confirmed high and we drove a 1 (sda_bit) but the wired-AND bus reads 0 (!sda_in), someone else is driving it lower, we've lost arbitration. The arb_loss flag propagates to txn_ctrl, which aborts cleanly:

TC_ADDR: begin
    if (cmd_done) begin
        if (arb_loss) begin
            tc_state <= TC_DONE;   // arbitration lost: stop
        end else if (bit_cnt == 3'd0) begin
            // ...
        end
    end
end

The txn_ctrl itself reads almost like an I2C protocol spec table — TC_IDLE, TC_START, TC_ADDR, TC_RW, TC_ADDR_ACK, TC_WDATA, TC_WACK, TC_RDATA, TC_MACK, TC_STOP, TC_DONE. Eleven states, each issuing a command to bit_engine and waiting for cmd_done. To add a repeated-START, you add one state and one transition in txn_ctrl and the bit engine never changes.

Sky130 standard-cell synthesis measured both modules, and the result surprised me:

Module Area (µm²) Cells Fmax (MHz) Critical path (ns)
i2c_master_simple 2592.5 540 418 2.39
i2c_master_seq 3108.0 709 435 2.30

The flat FSM is 20% smaller. The sequencer costs 516.5 µm² — about 169 extra cells — and in return gains 4% Fmax (17 MHz, 2.30 ns vs 2.39 ns critical path). Those extra cells go into the cmd/cmd_done handshake, the arb_loss flag capture, and the arb_reg sticky bit in txn_ctrl. What you buy with them is a design that actually works on hardware that uses clock stretching, which is EEPROMs, RTCs, temperature sensors — basically all I2C slaves in embedded applications — plus arbitration detection that won't corrupt another master's transaction, plus a structure you can extend without counting states. The 4% Fmax improvement reflects what you'd expect from the smaller per-FSM next-state cones: two 8-state machines have shallower combinational paths than one 21-state machine, so the synthesizer finds a shorter critical path even with the added handshake overhead. I2C runs at 100–400 kHz; the 435 MHz Fmax is not the constraint that matters here. Correctness is.

On the three specific edge cases: clock stretching first. A slave can hold SCL low for milliseconds — an AT24 EEPROM spec allows up to 5 ms during a write operation, and some temperature sensors stretch while waiting for an ADC conversion. A master that doesn't gate on scl_in will advance its internal state counter and sample SDA at the wrong time, reading the previous bit value, and the slave is violating nothing — it's using a legal I2C mechanism documented in UM10204 §3.1.9. Silicon Labs AN1095 documents exactly this failure mode for masters that don't support stretching. In the flat FSM, the qdiv_tick && scl_in guard in every SCL-high state handles this. In the sequencer, BE_SCL_H's !scl_in branch handles it, and because it's one state, it's structurally impossible to introduce a regression where one SCL-high phase checks and another doesn't.

For arbitration: when two masters start simultaneously, they both generate a START and begin clocking out address bits. As long as both drive the same values they coexist on the wired-AND bus; the moment they diverge, the one driving 1 reads back a 0 from sda_in and has lost. Per UM10204 §3.1.8, the losing master must immediately stop driving SCL and SDA but must NOT generate a STOP condition, because another master is mid-transaction and a STOP would corrupt it. The flat FSM has no arbitration detection at all — the comments say as much explicitly. The sequencer detects it in BE_SCL_H, propagates arb_loss to txn_ctrl, and the controller transitions to TC_DONE without issuing CMD_STOP.

SDA hold time is subtler. The spec requires tHD;DAT ≥ 0 ns at the floor, but real devices internally need around 300 ns for the previous bit to be unambiguously latched before SDA changes — changing SDA too soon after SCL falls can glitch into a false START or STOP. The sequencer handles this structurally: BE_SETUP is an entire half-period (≥ 2.5 µs at 100 kHz) where SCL is low and SDA settles before SCL rises. That 2.5 µs is well above the 300 ns internal hold requirement, and it's guaranteed by the FSM structure rather than a comment saying "don't change SDA too fast." The flat FSM handles this correctly too, in its _SCL_L states, for the same reason.

The flat FSM is fine for a classroom exercise or an FPGA prototype where you control all the slaves — if everything on your board ACKs promptly and you're the only master, the 21 states work and the 20% area savings is real. I'd ship the sequencer for production hardware: any design that might see a slow EEPROM, any multi-master topology, any system where "it worked in simulation" isn't sufficient. The 169-cell penalty is less than a single 8-bit register in the surrounding system, and what you get is a design where extending to multi-byte writes or repeated-START takes a weekend instead of a refactor. If you want to go further — add a FIFO, support AXI-Stream, drive multiple byte sequences — the sequencer extends by adding states to txn_ctrl only, and bit_engine never changes. That's worth something that doesn't show up in the synthesis report.

More SystemVerilog controller designs, timing analysis write-ups, and synthesizable reference implementations live at logi-code.com.