June 19, 2026 · 11 min read

SPI in Four Modes: The One Edge Rule That Every Tutorial Gets Wrong

Synthesized on sky130, the generalized four-mode SPI master lands at 336 cells, 1134.8 μm², and a maximum frequency of 481 MHz. The mode-locked variant — Mode 0 only, CPOL and CPHA hardwired to zero — comes out at 345 cells, 1117.3 μm², and 565 MHz. That's a 17.5% frequency difference for a 1.6% area difference, and the mode-locked design is actually larger in cell count while running faster, because cutting the CPOL/CPHA mux logic shortens the critical path — 1.77 ns versus 2.08 ns — not because it eliminates flip-flops. If you're budgeting MHz headroom and you only ever talk to Mode 0 devices, the mode-locked version is the right call. If your sensor suite includes an accelerometer on Mode 3, carry the generalized design. The 84 MHz you're giving up is almost certainly not the bottleneck.

SPI is four signals and four modes, and the four modes are where the confusion lives. SCLK is the clock. MOSI is master-out slave-in. MISO is the other direction. CS_N is active-low chip select — the transaction happens while it's asserted. That part everyone gets right. The modes trip people up because the naming is genuinely misleading, not because the protocol is hard.

CPOL controls where the clock rests when idle: CPOL=0 means SCLK sits at zero between transactions, CPOL=1 means it sits at one. CPHA controls which edge the master samples MISO. CPHA=0 means sample on the leading edge; CPHA=1 means sample on the trailing edge. The confusion is that "leading" and "trailing" are relative terms that depend entirely on CPOL. For CPOL=0, the leading edge is rising — the clock starts at zero, so the first thing it does is rise. For CPOL=1, leading is falling. When you read a datasheet that says "sample on rising edge," it's really saying Mode 0 or Mode 3, and if you assume "rising" universally, you've already broken modes 1 and 2.

The Wikipedia mode table lays this out if you read it carefully, but the "sample edge" column lists "rising" or "falling" without noting that those are the names of edges for a given CPOL. Mode 1 (CPOL=0, CPHA=1) samples on the falling edge, but that falling edge is the trailing edge, not the leading one. Mode 2 (CPOL=1, CPHA=0) also samples on the falling edge, but now falling is the leading edge. Same physical edge type, opposite semantic role — that's the mismatch that makes every "sample on falling" shorthand wrong for at least two modes.

The way through this is a single unified rule that never uses "rising" or "falling" at all.

The shift condition — when the master advances its TX shift register — is: sclk_r == ~CPOL. This fires on the trailing edge in every mode. For CPOL=0, the trailing edge is falling, meaning SCLK is currently high before it toggles — so sclk_r == 1 == ~0. For CPOL=1, trailing is rising, SCLK is currently low before toggle — sclk_r == 0 == ~1. Same expression, two modes handled automatically.

The sample condition — when the master captures MISO — is: sclk_r == (CPOL ^ CPHA). Walk through the four cases: Mode 0 (0,0) gives 0^0=0, so sample when sclk_r==0, which is the rising edge — correct. Mode 1 (0,1) gives 0^1=1, sample when sclk_r==1, which is the falling edge — correct. Mode 2 (1,0) gives 1^0=1, sample when sclk_r==1, falling edge — correct. Mode 3 (1,1) gives 1^1=0, sample when sclk_r==0, rising edge — correct. No per-mode case statements, no boolean table, just the XOR.

Here's where it gets interesting for CPHA=1. In Mode 1 and Mode 3, both conditions fire simultaneously — the shift condition sclk_r == ~CPOL and the sample condition sclk_r == (CPOL ^ CPHA) evaluate to the same value on the trailing edge, which means the master is shifting out the next MOSI bit and sampling the current MISO bit in the same clock half-period. If you think about that naively it looks like a race: does the TX shift register update before or after the RX shift register reads MISO?

It's not a race, because of how always_ff works in SystemVerilog. All always_ff blocks evaluate their inputs simultaneously at the clock edge — the tx_shift block and the rx_shift block see the same snapshot of tx_shift and miso, the pre-toggle values. When tx_shift writes its new value, rx_shift has already latched the old miso. They don't see each other's updates within the same clock edge. The IEEE 1800-2017 LRM describes this as stratified event scheduling: the active region evaluates all right-hand sides, the NBA region commits all left-hand side updates. So the loopback test — assign miso = mosi — is valid even for CPHA=1.

There's one more piece that most tutorials skip: the SETUP state. Before any SCLK edges happen, the master needs to pre-drive the MSB on MOSI. For CPHA=0, the slave samples on the leading edge — the very first SCLK transition after CS_N asserts — so MOSI must be valid before the first edge, which means the master needs a half-period of setup time with CS_N asserted and SCLK still at its idle level. Without it, the slave for CPHA=0 modes will never see the MSB.

For CPHA=1, the slave samples on the trailing edge — the second edge — so the first edge is the shift edge, and the master does have time to prepare MOSI after the leading edge. Running SETUP unconditionally for all modes keeps the state machine uniform and harmless for CPHA=1, since it just adds one half-period of CS_N setup time that the slave spec usually requires anyway.

The full state machine runs like this: IDLE transitions to SETUP on start. SETUP holds CS_N asserted and SCLK at idle level for one half-period, then moves to TRANSFER. TRANSFER runs for 2 * DATA_WIDTH half-periods, toggling SCLK and shifting/sampling data on each half-period strobe. When the last half-period completes, it moves to FINISH. FINISH pulses done, keeps CS_N asserted for one cycle, then drops back to IDLE and deasserts CS_N — the one-cycle FINISH state matters because if you deassert CS_N and assert done in the same cycle as the last edge, the slave may not have fully processed the last bit.

The SCLK register holds the current clock level. During SETUP and at reset, it loads CPOL (the idle level). During TRANSFER, it toggles on every half_done strobe. half_done is a single-cycle pulse from a clock divider counter that counts up to CLK_DIV - 1 and resets, giving SCLK a frequency of sys_clk / (2 * CLK_DIV).

Here's the core of the generalized implementation:

// TX shift register — shift on TRAILING edge, independent of CPHA
always_ff @(posedge sys_clk) begin
    if (!rst_n) begin
        tx_shift <= '0;
    end else if (start && state == IDLE) begin
        tx_shift <= tx_data;
    end else if (state == TRANSFER && half_done) begin
        if (sclk_r == ~CPOL)
            tx_shift <= {tx_shift[DATA_WIDTH-2:0], 1'b0};
    end
end

// RX shift register — sample when sclk_r == (CPOL ^ CPHA)
always_ff @(posedge sys_clk) begin
    if (!rst_n) begin
        rx_shift <= '0;
    end else if (state == TRANSFER && half_done) begin
        if (sclk_r == (CPOL ^ CPHA))
            rx_shift <= {rx_shift[DATA_WIDTH-2:0], miso};
    end
end

Those two conditions — sclk_r == ~CPOL for shift and sclk_r == (CPOL ^ CPHA) for sample — are the entire CPOL/CPHA logic. Everything else in the module is bookkeeping: the clock divider, the edge counter, the state machine transitions, the output assignments. mosi is just tx_shift[DATA_WIDTH-1] — the MSB, always combinationally visible on MOSI so the slave sees it before the sample edge.

The SCLK register and state machine look like this:

always_ff @(posedge sys_clk) begin
    if (!rst_n) begin
        sclk_r <= CPOL;
    end else if (state == IDLE || state == FINISH) begin
        sclk_r <= CPOL;           // return to idle
    end else if (state == SETUP) begin
        sclk_r <= CPOL;           // hold idle while CS_N settles
    end else if (state == TRANSFER && half_done) begin
        sclk_r <= ~sclk_r;        // toggle each half-period
    end
end

always_ff @(posedge sys_clk) begin
    if (!rst_n) begin
        state <= IDLE;
    end else begin
        case (state)
            IDLE:     if (start) state <= SETUP;
            SETUP:    if (half_done) state <= TRANSFER;
            TRANSFER: if (half_done && edge_cnt == EDGE_W'(TOTAL_EDGES - 1))
                          state <= FINISH;
            FINISH:   state <= IDLE;
            default:  state <= IDLE;
        endcase
    end
end

The mode-locked variant (spi_master_fast) strips all of this down to Mode 0 assumptions. The shift condition becomes sclk_r == 1'b1 and the sample condition becomes sclk_r == 1'b0 — constants now, no XOR, no parameter reference. The SETUP state disappears entirely; the fast variant goes straight from IDLE to TRANSFER because Mode 0's CPHA=0 means the MSB is pre-driven during the IDLE-to-TRANSFER transition itself, with CS_N asserting in the same cycle. It also collapses to a three-state FSM (IDLE → TRANSFER → DONE) and derives CS_N combinationally from the state rather than from a registered output. Those changes together — shorter critical path through eliminated mux logic, combinational CS_N, simplified FSM encoding — account for the 0.31 ns improvement from 2.08 to 1.77 ns.

What I didn't expect was the cell count. The fast variant has more cells (345 vs 336), not fewer. The generalized design is smaller in raw cell count despite carrying four modes — the synthesis tool found ways to share logic across the CPOL/CPHA conditions in the generalized version that the fast variant can't exploit, and the combinational derivations in the fast variant add a few cells of their own. What actually shifts is area and critical path: 1117.3 μm² versus 1134.8 μm², and 1.77 ns versus 2.08 ns. Cell count is a poor proxy for either.

Here's the measured comparison:

Variant	Cells	Area (μm²)	Fmax (MHz)	Critical Path (ns)
`spi_master` (all 4 modes)	336	1134.8	481	2.08
`spi_master_fast` (Mode 0 only)	345	1117.3	565	1.77

Both synthesized on sky130 standard cells. The 17.5% frequency difference and 1.6% area difference are the actual numbers; the cell-count reversal is real, and worth keeping in mind if you're using cell count as a cost estimate — use area instead.

The bugs I've seen most often in SPI implementations that come through code review: starting the transfer from IDLE directly without a SETUP state, which breaks CPHA=0 for every device that has CS_N setup requirements (most of them); shifting on the leading edge for CPHA=1 instead of the trailing edge, which means the slave samples stale data on every bit; and not returning SCLK to its idle level after FINISH, which leaves the clock in an indeterminate state and can trigger a false edge when the next transaction starts. That last one is subtle — it only bites when back-to-back transactions have a gap shorter than a half-period, and it's easy to miss in simulation if your testbench always idles long enough between transfers.

Testing the generalized design against all four modes uses a combinational loopback: assign miso = mosi. This works because the master's CLK_DIV parameter guarantees at least two system clock cycles of setup time between when MOSI updates and when the sample edge arrives — the MOSI value is stable well before the master samples it. For the loopback to be valid you need CLK_DIV >= 2, which the parameter comment already documents as a requirement. The testbench runs four byte values (0xA5, 0x3C, 0xFF, 0x00) through all four modes — 16 transactions total, all 16 pass. The four test vectors cover alternating bits, inverted alternating bits, all-ones, and all-zeros, which catches shift-register off-by-one errors that a single vector wouldn't expose.

// In the testbench — per-mode DUT instantiation and loopback
spi_master #(.DATA_WIDTH(8),.CLK_DIV(4),.CPOL(0),.CPHA(0)) dut0 ( ... );
assign m0_miso = m0_mosi;   // combinational loopback slave

spi_master #(.DATA_WIDTH(8),.CLK_DIV(4),.CPOL(0),.CPHA(1)) dut1 ( ... );
assign m1_miso = m1_mosi;

All four DUTs run concurrently in simulation. Each transfer takes (1 + 2*DATA_WIDTH) * CLK_DIV system clock cycles — one SETUP half-period plus 16 data half-periods, each lasting CLK_DIV=4 cycles — so about 68 cycles at 100 MHz system clock, meaning the testbench completes in under a microsecond of simulated time.

My recommendation is to ship the generalized design. The 84 MHz gap matters only if your system clock runs so high that 481 MHz SPI is the actual constraint, and if you're in that situation you're probably not using a software-configurable SPI master anyway. The real value of the generalized version is that it parameterizes once and handles any CPOL/CPHA combination at compile time — no code duplication, no per-mode bugs to track separately. The unified edge rule (sclk_r == ~CPOL for shift, sclk_r == (CPOL ^ CPHA) for sample) is the only thing that needs to be right, and once it's right, it's right for all four modes simultaneously.

Want to synthesize this yourself and tweak the parameters? Try it on Logicode.