· 8 min read
The AXI4-Lite Write Bug Every Tutorial Misses: AWVALID, WVALID, and the Missing Backpressure Check
- axi4-lite
- systemverilog
- fpga
- rtl
- amba
- digital-design
I've watched this happen to three different engineers in the past year. Write transaction fires. Waveform looks perfect — AWVALID, WVALID, AWREADY, WREADY, BVALID all toggle in the right order. Simulation passes. You load the bitstream, and half your writes silently disappear. No bus error. No timeout. BVALID went high, BREADY accepted it, the handshake completed — but the register never changed.
The culprit is two lines in the Xilinx AXI4-Lite peripheral template, repeated across hundreds of open-source cores, that make an assumption the AXI4 spec explicitly forbids. The simulation passes because every standard testbench drives the exact happy path the broken code handles. The hardware fails because real interconnect doesn't.
The AXI4 write path has three independent channels: AW (write address), W (write data), and B (write response). "Independent" means what it says. ARM's specification (IHI0022H, §A3.3) states that a master must not wait for AWREADY before asserting WVALID, and vice versa. Either channel can arrive first. The slave is obligated to handle both orderings — AW before W, W before AW, or both in the same cycle.
This isn't an obscure corner of the spec. It's in the chapter on basic transactions. The reason the rule exists is deadlock prevention: if a slave required both valid signals before asserting either ready signal, and the master required either ready signal before driving the other valid signal, the whole system would lock up. So the spec makes it unconditional. A slave must handle both orderings.
The Xilinx template doesn't.
The AWREADY logic in the generated peripheral looks like this:
always_ff @(posedge clk or negedge rst_n) begin
if (!rst_n) begin
axi_awready_r <= 1'b0;
axi_awaddr_r <= '0;
end else begin
if (!axi_awready_r && s_axi_awvalid && s_axi_wvalid) begin
// Pulse AWREADY for one cycle — but only when W is also valid.
axi_awready_r <= 1'b1;
axi_awaddr_r <= s_axi_awaddr;
end else begin
axi_awready_r <= 1'b0;
end
end
end
The WREADY block is the same pattern with the signals swapped. Both require s_axi_awvalid && s_axi_wvalid simultaneously. When a Cortex-M or a well-written VIP presents AWVALID one cycle before WVALID, this slave sees AWVALID go high, does nothing — because WVALID isn't up yet — then on the next cycle when WVALID finally appears, AWREADY pulses as if everything is fine. Except the master already de-asserted AWVALID. The address handshake never completed. The slave just captured a stale or zero address and wrote data to the wrong register, or nowhere at all.
The second bug is quieter but nastier. The write response logic in the naive implementation asserts BVALID whenever a write completes, with no check on whether the previous BVALID was acknowledged:
always_ff @(posedge clk or negedge rst_n) begin
if (!rst_n) begin
axi_bvalid_r <= 1'b0;
axi_bresp_r <= 2'b00;
end else begin
if (write_en) begin
// BUG 2: Unconditionally asserts BVALID after every write,
// even if the previous BVALID hasn't been accepted yet.
axi_bvalid_r <= 1'b1;
axi_bresp_r <= 2'b00;
end else if (s_axi_bready && axi_bvalid_r) begin
axi_bvalid_r <= 1'b0;
end
end
end
If BREADY stays low for even one extra cycle — a slow master, a blocked interconnect, anything — and a second write completes during that window, BVALID gets re-asserted over the un-acknowledged response. The first response is gone. The master sees one BVALID pulse for two writes, which is illegal under the spec, and depending on how the master tracks outstanding transactions you either get a silent dropped response or a bus hang.
In simulation, BREADY is almost always driven high immediately. Cocotb's AXI driver does this. Xilinx's AXI VIP (PG267) does this in its default mode. Nobody tests "master holds BREADY low for three cycles while processing the response" because that's not interesting to watch in a waveform, and in hardware it's the case that kills you.
The correct implementation separates AW and W acceptance into independent state bits. Here's the write channel from axil_slave_correct.sv:
logic aw_pending; // address latched, waiting for data
logic w_pending; // data latched, waiting for address
// response-channel stall: cannot accept new writes while BVALID is high
// and master hasn't accepted it yet.
logic resp_stall;
assign resp_stall = s_axi_bvalid && !s_axi_bready;
always_ff @(posedge clk or negedge rst_n) begin
if (!rst_n) begin
aw_pending <= 1'b0;
aw_addr_lat <= '0;
s_axi_awready <= 1'b0;
end else begin
if (!aw_pending && !resp_stall && s_axi_awvalid) begin
aw_pending <= 1'b1;
aw_addr_lat <= s_axi_awaddr;
s_axi_awready <= 1'b1;
end else begin
s_axi_awready <= 1'b0;
if (aw_pending && w_pending)
aw_pending <= 1'b0;
end
end
end
The W channel is symmetric. Each one accepts its channel independently, latches it into aw_addr_lat or w_data_lat, and sets its pending bit. The commit fires only when both bits are set:
logic write_commit;
assign write_commit = aw_pending && w_pending;
Walk through the AW-first case: AWVALID appears at cycle 1, WVALID appears at cycle 2. The slave pulses AWREADY at cycle 1, sets aw_pending, and waits. At cycle 2, WVALID arrives — the slave pulses WREADY, sets w_pending. Now write_commit is high, the register gets updated, BVALID asserts. The master can hold BREADY low for as long as it needs; resp_stall blocks any new acceptance until the response is acknowledged. No data dropped, no response overwritten.
The W-first case runs the same logic in reverse order, arriving at the same correct outcome.
Notice the !resp_stall gate in both the AW and W acceptance paths. This is what prevents the BVALID overwrite. A new write cannot even begin to be accepted while the response channel has an outstanding un-acknowledged BVALID. The pending bits stay set, AWREADY and WREADY stay de-asserted, and the master has to wait — the spec allows a slave to de-assert READY at any time, so this is all legal.
Now the question you're probably asking: what does this correctness actually cost?
I synthesized both modules against the sky130 standard-cell library at a 50 MHz target (20 ns period). The numbers:
| Variant | Area (µm²) | Cells | Fmax (MHz) | Critical path (ns) |
|---|---|---|---|---|
| axil_slave_naive | 6,377 | 983 | 437 | 2.29 |
| axil_slave_correct | 7,557 | 1,067 | 270 | 3.70 |
The correct implementation is 18.5% larger by area — 84 extra cells, 1,180 µm². The Fmax drops from 437 MHz to 270 MHz, a 1.41 ns increase in the critical path.
270 MHz versus 437 MHz is a real difference on paper. If you were building a 400 MHz interconnect fabric, this would matter. But AXI4-Lite isn't a high-frequency protocol — it's used for control-plane register access in ARM SoC designs, where the typical target clock is 50-100 MHz, and both of these designs hit timing at 50 MHz by a factor of five. The timing penalty from adding resp_stall and the pending registers is completely irrelevant at any clock rate you'd realistically use AXI4-Lite for.
The 84-cell overhead is similarly unalarming. A single 32-bit pipeline register on your data path would cost you more than this. The pending bits are two flip-flops; the stall combinational logic is a handful of gates. On an FPGA, you're looking at one or two LUT-FFs that wouldn't even show up as a utilization delta in Vivado's implementation report.
The read channel is a simpler story and both implementations get it right. AR and R are the only two channels involved — you accept ARVALID, pulse ARREADY, capture the address, then drive RVALID with data and hold it until RREADY accepts it. The single-channel nature means there's no "arrived in different cycles" problem to handle. You still need to hold RVALID until RREADY fires rather than pulsing it for one cycle, but that's a more obvious requirement to a designer than the simultaneous AW/W problem.
Before you ship any AXI4-Lite slave, run through five checks on the write path. AWREADY should not gate on WVALID — if it does, you have the first bug. WREADY should not gate on AWVALID — same bug from the data side. Verify that you have separate aw_pending and w_pending state, or equivalent logic that handles both orderings. BVALID must stay asserted until BREADY is high — confirm that the only thing that de-asserts BVALID is the bvalid && bready handshake, not a write-enable signal. And AWREADY and WREADY must both be gated on !bvalid || bready to prevent new transaction acceptance while a response is outstanding. If all five pass, your write path is spec-compliant. If one fails, you have the Xilinx pattern — and the bug is already in production on some percentage of your boards.
The fix is two pending bits, one stall signal, and a commit condition. The code above is the whole thing. Simulation never applies backpressure — so if you only ever test with Xilinx's VIP or cocotb's AXI driver on their default settings, you will ship this bug and a waveform that proves you didn't.
Want to synthesize both variants and inspect the full waveforms yourself? Try it on Logicode.