· 10 min read
Round-Robin Arbiter in SystemVerilog: Starvation, the Rotating-Mask Fix, and Real Synthesis Numbers
- systemverilog
- arbiter
- rtl
- digital-design
- round-robin
- fpga
Requester 3 hasn't been granted access in 10,000 clock cycles. Not because it stopped asking — it's been asserting req[3] the entire time. The problem is req[0], which is also always asserted and sits at index zero of a fixed-priority arbiter. The carry-chain in the combinational logic sees req[0], grants it, and never evaluates anything below. Every cycle. For as long as both requesters exist. There's no maximum wait time for req[3] — it just never wins.
That's not a pathological edge case. That's exactly what happens in a shared bus when one master has a constant stream of traffic — a DMA engine hammering a memory bus, a high-priority CPU holding a PCIe port. AXI crossbars need an arbiter per slave port to handle multiple masters competing for the same downstream channel. PCIe switch fabrics run one per output port. Network-on-chip routers embed them in each routing node to prevent a single high-traffic flow from monopolizing a link. The code is usually fifty lines of SystemVerilog. The design decision — fixed-priority vs. round-robin — has consequences that compound at runtime.
The fixed-priority implementation is almost embarrassingly simple. Here's the full RTL:
module priority_arbiter #(
parameter int N = 4
) (
input logic clk,
input logic rst_n,
input logic [N-1:0] req,
output logic [N-1:0] grant
);
// Combinational priority encoder implemented as a carry-chain:
// carry[i] = 1 if any req[j], j < i, is asserted.
// grant[i] = req[i] & ~carry[i] (win only if no higher-priority request)
//
// This maps to a ripple-carry chain of OR gates — identical to the
// structure a synthesizer generates from "find-first-set" logic.
logic [N:0] carry;
always_comb begin
carry[0] = 1'b0;
for (int i = 0; i < N; i++) begin
grant[i] = req[i] & ~carry[i];
carry[i+1] = carry[i] | req[i];
end
end
// clk and rst_n are present in the port list to match round_robin_arbiter
// and allow the synthesis tool to treat both modules identically.
// (No sequential logic needed for a stateless fixed-priority arbiter.)
endmodule
The carry chain propagates a "someone lower already won" signal through the loop. carry[0] starts at zero. grant[0] = req[0] & ~0 = req[0]. If req[0] is high, carry[1] becomes 1, which forces grant[1] to zero regardless of req[1]. That logic ripples all the way to index N-1. The synthesizer turns this into a chain of OR gates — no flip-flops, no state, purely combinational. On sky130 standard cells at N=4, synthesis reports 27.5 µm² and 28 cells with a 0.2 ns critical path, which works out to a theoretical 5000 MHz.
The starvation proof is implicit in the structure. grant[i] can only assert when carry[i] is zero, which only happens when every req[j] for j < i is deasserted. If req[0] is permanently high, carry[1] is permanently high, and req[1] through req[3] can never win. The maximum wait time for requester N-1 isn't a function of N — it's unbounded. For a DMA engine running indefinitely, "unbounded" means never.
The rotating-mask technique solves this by maintaining a one-hot state register — last_ptr — that tracks which requester won last cycle. Instead of always starting the priority encode from index zero, it constructs a mask that suppresses every slot at or below last_ptr, then runs the priority encoder on what's left. After processing the higher-indexed requesters, it wraps around. Every active requester gets a grant within N cycles, because the mask visits each slot exactly once before coming back around.
module round_robin_arbiter #(
parameter int N = 4
) (
input logic clk,
input logic rst_n,
input logic [N-1:0] req,
output logic [N-1:0] grant
);
// ---------------------------------------------------------------------------
// State: one-hot pointer to the last-granted requester
// ---------------------------------------------------------------------------
logic [N-1:0] last_ptr;
// ---------------------------------------------------------------------------
// Step 1 – double-wide request vector
// ---------------------------------------------------------------------------
logic [2*N-1:0] double_req;
assign double_req = {req, req};
// ---------------------------------------------------------------------------
// Step 2 – rotating mask
// mask[i] = 1 means slot i is eligible to win this cycle.
// We suppress all bits at index <= k, where k is the index of last_ptr.
// Equivalently, mask = all_ones << (k+1), clamped to 2N bits.
// When last_ptr == 0 (reset, no previous winner), mask = all_ones.
// ---------------------------------------------------------------------------
logic [2*N-1:0] rot_mask;
always_comb begin
rot_mask = {2*N{1'b1}}; // default: no suppression
for (int k = 0; k < N; k++) begin
if (last_ptr[k])
rot_mask = ({2*N{1'b1}} << (k + 1));
end
end
// ---------------------------------------------------------------------------
// Steps 3 & 4 – priority-encode masked window, fold 2N → N
// Priority encoder via carry chain (same structure as priority_arbiter.sv):
// carry[i] propagates "some lower bit was set" so only the first set bit
// in the masked window is selected.
// ---------------------------------------------------------------------------
logic [2*N-1:0] masked_req;
logic [2*N:0] mcarry;
logic [2*N-1:0] masked_grant_2n;
assign masked_req = double_req & rot_mask;
always_comb begin
mcarry[0] = 1'b0;
for (int i = 0; i < 2*N; i++) begin
masked_grant_2n[i] = masked_req[i] & ~mcarry[i];
mcarry[i+1] = mcarry[i] | masked_req[i];
end
end
// ---------------------------------------------------------------------------
// Step 5 – fallback priority encode on the full (unmasked) double vector
// Activated when the masked window contains no requests (wrap-around).
// ---------------------------------------------------------------------------
logic [2*N:0] pcarry;
logic [2*N-1:0] plain_grant_2n;
always_comb begin
pcarry[0] = 1'b0;
for (int i = 0; i < 2*N; i++) begin
plain_grant_2n[i] = double_req[i] & ~pcarry[i];
pcarry[i+1] = pcarry[i] | double_req[i];
end
end
// Fold from 2N → N: OR the upper and lower halves.
// At most one bit is set in each half (priority encoder guarantees one-hot
// in 2N bits), so the OR produces a one-hot N-bit result.
logic [N-1:0] masked_fold;
logic [N-1:0] plain_fold;
logic [N-1:0] grant_next;
// Fold outside always_comb to keep constant part-selects at continuous-
// assignment level (avoids iverilog elaboration warnings on static slices).
assign masked_fold = masked_grant_2n[N-1:0] | masked_grant_2n[2*N-1:N];
assign plain_fold = plain_grant_2n[N-1:0] | plain_grant_2n[2*N-1:N];
assign grant_next = |masked_grant_2n ? masked_fold : plain_fold;
assign grant = grant_next;
// ---------------------------------------------------------------------------
// Step 6 – advance the pointer
// ---------------------------------------------------------------------------
always_ff @(posedge clk or negedge rst_n) begin
if (!rst_n)
last_ptr <= '0;
else if (|grant_next)
last_ptr <= grant_next;
end
endmodule
Step 1 concatenates {req, req} to get a 2N-bit vector. The duplication is the key insight — it lets you think of the N requesters as a linear array of 2N slots where slots 0..N-1 and slots N..2N-1 represent the same requesters, so you can express "wrap around from the end back to the beginning" as a contiguous forward scan through the upper half.
Step 2 builds rot_mask. If last_ptr is one-hot at position k, the mask shifts all_ones left by k+1, suppressing bits 0..k. Bits k+1 through 2N-1 stay set, meaning those slots are eligible. When last_ptr is all-zeros (reset state), the default branch applies — no suppression.
Steps 3 and 4 run the same carry-chain priority encoder from priority_arbiter, but on the 8-bit masked_req vector instead of the 4-bit req. Only the first set bit above the suppressed window wins. Then fold the 2N result back to N by OR-ing the upper and lower halves — because the priority encoder guarantees at most one set bit in the 2N output, at most one bit will survive per half, so OR-ing gives a correct one-hot N-bit result.
Step 5 is the wrap-around fallback. If the mask suppresses everything — meaning every pending request is at or below last_ptr, so the "higher priority" window above it is empty — |masked_grant_2n will be zero, and grant_next falls through to plain_fold. The plain encoder on double_req (no mask) finds the lowest-indexed active requester, which is correct: all higher-indexed requesters were already served in the current round, so starting fresh from the lowest is exactly what round-robin should do.
Step 6 is the registered pointer. On each clock where a grant fires, last_ptr latches the winning one-hot grant — those are the 4 flip-flops in the synthesis report.
The sky130 numbers for N=4:
| Metric | Fixed-Priority | Round-Robin | Overhead |
|---|---|---|---|
| Area (µm²) | 27.5 | 270.3 | 882% more |
| Cell count | 28 | 96 | 3.4× |
| Flip-flops | 0 | 4 | N FFs added |
| Critical path (ns) | 0.2 | 1.18 | 5.9× longer |
| Fmax (MHz) | 5000 | 847 | 5.9× slower |
That 882% area overhead sounds alarming until you remember what it buys you. The fixed-priority arbiter's critical path is 0.2 ns because it's four OR gates. The round-robin critical path is 1.18 ns because it runs through the mask computation, the 8-bit carry chain, and the fold — but 847 MHz is still fast enough for nearly any bus application. AXI interfaces in most SoC designs run at 200–500 MHz. The round-robin arbiter is faster than the bus it sits in front of.
The 4 flip-flops are exactly N flip-flops for any value of N — one per requester, because last_ptr is a one-hot register of width N. The combinational logic scales roughly as 2× the priority arbiter at the same N, because you're running two separate carry chains (masked and plain) over a 2N-bit vector plus the fold logic, and that's where most of the 96-vs-28 cell difference comes from.
There are two gotchas here that trip people up in RTL reviews.
The first is reset behavior. At reset, last_ptr is all-zeros. The rot_mask logic's default branch fires — no last_ptr[k] is set, so rot_mask stays all-ones. The masked priority encoder sees double_req = {req, req} unmasked, and the carry chain grants req[0] if it's asserted. So the very first cycle after reset behaves exactly like a fixed-priority arbiter. On the second cycle, last_ptr holds the one-hot for whoever won, the mask suppresses them, and true round-robin begins. This is correct behavior, not a bug — but if you're writing a testbench that checks "req[0] and req[3] both asserted at cycle 1, expect req[3] to win," you're testing the wrong spec. The first cycle is always fixed-priority.
The second: this is a combinational grant output. grant is driven by grant_next, which is a combinational function of req and last_ptr, meaning grant can change mid-cycle if req glitches. For a bus master that samples grant on the clock edge, this is usually fine. But if downstream logic uses grant as a level-sensitive enable — directly gating a FIFO write or a register update — you've got a potential glitch path. The fix is a registered grant: add a flip-flop on the output, drive grant from always_ff, and accept the one-cycle latency between request and grant. The combinational version is faster; the registered version is safer in sloppy integration contexts. Know which one you've implemented before you're on the other side of a timing closure meeting explaining why a FIFO corrupted.
There's also the question of what "wrap-around" means when all four requesters are simultaneously asserting at reset. last_ptr is zero, mask is all-ones, masked priority encoder fires on the unmasked 8-bit vector starting from index 0 — so req[0] wins. Next cycle, last_ptr points to requester 0, mask suppresses bit 0, and req[1] wins. Then 2, then 3. After four cycles, everyone's been served once. Not "probably N cycles" — exactly N, in the worst case of all requesters simultaneously active.
I'd ship round-robin for any shared bus. Yes, the area is 9.8× larger than fixed priority, but the absolute numbers are tiny — 270 µm² on sky130 — and the behavioral guarantee matters enormously at system level. Starvation bugs in bus fabrics are the kind that only manifest under load, pass all unit tests, and then cause a production system to hang under a specific traffic mix nobody thought to test. The ~243 extra µm² of silicon is cheap insurance.
If you want to synthesize this yourself and see the numbers directly, the RTL compiles clean with iverilog and maps straightforwardly to standard cells — try it on Logicode.