← Back to the blog

· 10 min read

Round-Robin Arbiter in SystemVerilog: Starvation, the Rotating-Mask Fix, and Real Synthesis Numbers

  • systemverilog
  • arbiter
  • rtl
  • digital-design
  • round-robin
  • fpga

Requester 3 hasn't been granted access in 10,000 clock cycles. Not because it stopped asking — it's been asserting req[3] the entire time. The problem is req[0], which is also always asserted and sits at index zero of a fixed-priority arbiter. The carry-chain in the combinational logic sees req[0], grants it, and never evaluates anything below. Every cycle. For as long as both requesters exist. There's no maximum wait time for req[3] — it just never wins.

That's not a pathological edge case. That's exactly what happens in a shared bus when one master has a constant stream of traffic — a DMA engine hammering a memory bus, a high-priority CPU holding a PCIe port. AXI crossbars need an arbiter per slave port to handle multiple masters competing for the same downstream channel. PCIe switch fabrics run one per output port. Network-on-chip routers embed them in each routing node to prevent a single high-traffic flow from monopolizing a link. The code is usually fifty lines of SystemVerilog. The design decision — fixed-priority vs. round-robin — has consequences that compound at runtime.

The fixed-priority implementation is almost embarrassingly simple. Here's the full RTL:

module priority_arbiter #(
  parameter int N = 4
) (
  input  logic        clk,
  input  logic        rst_n,
  input  logic [N-1:0] req,
  output logic [N-1:0] grant
);

  // Combinational priority encoder implemented as a carry-chain:
  //   carry[i] = 1 if any req[j], j < i, is asserted.
  //   grant[i] = req[i] & ~carry[i]   (win only if no higher-priority request)
  //
  // This maps to a ripple-carry chain of OR gates — identical to the
  // structure a synthesizer generates from "find-first-set" logic.
  logic [N:0] carry;

  always_comb begin
    carry[0] = 1'b0;
    for (int i = 0; i < N; i++) begin
      grant[i]    = req[i] & ~carry[i];
      carry[i+1]  = carry[i] | req[i];
    end
  end

  // clk and rst_n are present in the port list to match round_robin_arbiter
  // and allow the synthesis tool to treat both modules identically.
  // (No sequential logic needed for a stateless fixed-priority arbiter.)

endmodule

The carry chain propagates a "someone lower already won" signal through the loop. carry[0] starts at zero. grant[0] = req[0] & ~0 = req[0]. If req[0] is high, carry[1] becomes 1, which forces grant[1] to zero regardless of req[1]. That logic ripples all the way to index N-1. The synthesizer turns this into a chain of OR gates — no flip-flops, no state, purely combinational. On sky130 standard cells at N=4, synthesis reports 27.5 µm² and 28 cells with a 0.2 ns critical path, which works out to a theoretical 5000 MHz.

The starvation proof is implicit in the structure. grant[i] can only assert when carry[i] is zero, which only happens when every req[j] for j < i is deasserted. If req[0] is permanently high, carry[1] is permanently high, and req[1] through req[3] can never win. The maximum wait time for requester N-1 isn't a function of N — it's unbounded. For a DMA engine running indefinitely, "unbounded" means never.

The rotating-mask technique solves this by maintaining a one-hot state register — last_ptr — that tracks which requester won last cycle. Instead of always starting the priority encode from index zero, it constructs a mask that suppresses every slot at or below last_ptr, then runs the priority encoder on what's left. After processing the higher-indexed requesters, it wraps around. Every active requester gets a grant within N cycles, because the mask visits each slot exactly once before coming back around.

module round_robin_arbiter #(
  parameter int N = 4
) (
  input  logic        clk,
  input  logic        rst_n,
  input  logic [N-1:0] req,
  output logic [N-1:0] grant
);

  // ---------------------------------------------------------------------------
  // State: one-hot pointer to the last-granted requester
  // ---------------------------------------------------------------------------
  logic [N-1:0] last_ptr;

  // ---------------------------------------------------------------------------
  // Step 1 – double-wide request vector
  // ---------------------------------------------------------------------------
  logic [2*N-1:0] double_req;
  assign double_req = {req, req};

  // ---------------------------------------------------------------------------
  // Step 2 – rotating mask
  // mask[i] = 1  means slot i is eligible to win this cycle.
  // We suppress all bits at index <= k, where k is the index of last_ptr.
  // Equivalently, mask = all_ones << (k+1), clamped to 2N bits.
  // When last_ptr == 0 (reset, no previous winner), mask = all_ones.
  // ---------------------------------------------------------------------------
  logic [2*N-1:0] rot_mask;

  always_comb begin
    rot_mask = {2*N{1'b1}};           // default: no suppression
    for (int k = 0; k < N; k++) begin
      if (last_ptr[k])
        rot_mask = ({2*N{1'b1}} << (k + 1));
    end
  end

  // ---------------------------------------------------------------------------
  // Steps 3 & 4 – priority-encode masked window, fold 2N → N
  // Priority encoder via carry chain (same structure as priority_arbiter.sv):
  //   carry[i] propagates "some lower bit was set" so only the first set bit
  //   in the masked window is selected.
  // ---------------------------------------------------------------------------
  logic [2*N-1:0] masked_req;
  logic [2*N:0]   mcarry;
  logic [2*N-1:0] masked_grant_2n;

  assign masked_req = double_req & rot_mask;

  always_comb begin
    mcarry[0] = 1'b0;
    for (int i = 0; i < 2*N; i++) begin
      masked_grant_2n[i] = masked_req[i] & ~mcarry[i];
      mcarry[i+1]        = mcarry[i] | masked_req[i];
    end
  end

  // ---------------------------------------------------------------------------
  // Step 5 – fallback priority encode on the full (unmasked) double vector
  // Activated when the masked window contains no requests (wrap-around).
  // ---------------------------------------------------------------------------
  logic [2*N:0]   pcarry;
  logic [2*N-1:0] plain_grant_2n;

  always_comb begin
    pcarry[0] = 1'b0;
    for (int i = 0; i < 2*N; i++) begin
      plain_grant_2n[i] = double_req[i] & ~pcarry[i];
      pcarry[i+1]       = pcarry[i] | double_req[i];
    end
  end

  // Fold from 2N → N: OR the upper and lower halves.
  // At most one bit is set in each half (priority encoder guarantees one-hot
  // in 2N bits), so the OR produces a one-hot N-bit result.
  logic [N-1:0] masked_fold;
  logic [N-1:0] plain_fold;
  logic [N-1:0] grant_next;

  // Fold outside always_comb to keep constant part-selects at continuous-
  // assignment level (avoids iverilog elaboration warnings on static slices).
  assign masked_fold = masked_grant_2n[N-1:0] | masked_grant_2n[2*N-1:N];
  assign plain_fold  = plain_grant_2n[N-1:0]  | plain_grant_2n[2*N-1:N];

  assign grant_next  = |masked_grant_2n ? masked_fold : plain_fold;

  assign grant = grant_next;

  // ---------------------------------------------------------------------------
  // Step 6 – advance the pointer
  // ---------------------------------------------------------------------------
  always_ff @(posedge clk or negedge rst_n) begin
    if (!rst_n)
      last_ptr <= '0;
    else if (|grant_next)
      last_ptr <= grant_next;
  end

endmodule

Step 1 concatenates {req, req} to get a 2N-bit vector. The duplication is the key insight — it lets you think of the N requesters as a linear array of 2N slots where slots 0..N-1 and slots N..2N-1 represent the same requesters, so you can express "wrap around from the end back to the beginning" as a contiguous forward scan through the upper half.

Step 2 builds rot_mask. If last_ptr is one-hot at position k, the mask shifts all_ones left by k+1, suppressing bits 0..k. Bits k+1 through 2N-1 stay set, meaning those slots are eligible. When last_ptr is all-zeros (reset state), the default branch applies — no suppression.

Steps 3 and 4 run the same carry-chain priority encoder from priority_arbiter, but on the 8-bit masked_req vector instead of the 4-bit req. Only the first set bit above the suppressed window wins. Then fold the 2N result back to N by OR-ing the upper and lower halves — because the priority encoder guarantees at most one set bit in the 2N output, at most one bit will survive per half, so OR-ing gives a correct one-hot N-bit result.

Step 5 is the wrap-around fallback. If the mask suppresses everything — meaning every pending request is at or below last_ptr, so the "higher priority" window above it is empty — |masked_grant_2n will be zero, and grant_next falls through to plain_fold. The plain encoder on double_req (no mask) finds the lowest-indexed active requester, which is correct: all higher-indexed requesters were already served in the current round, so starting fresh from the lowest is exactly what round-robin should do.

Step 6 is the registered pointer. On each clock where a grant fires, last_ptr latches the winning one-hot grant — those are the 4 flip-flops in the synthesis report.

The sky130 numbers for N=4:

Metric Fixed-Priority Round-Robin Overhead
Area (µm²) 27.5 270.3 882% more
Cell count 28 96 3.4×
Flip-flops 0 4 N FFs added
Critical path (ns) 0.2 1.18 5.9× longer
Fmax (MHz) 5000 847 5.9× slower

That 882% area overhead sounds alarming until you remember what it buys you. The fixed-priority arbiter's critical path is 0.2 ns because it's four OR gates. The round-robin critical path is 1.18 ns because it runs through the mask computation, the 8-bit carry chain, and the fold — but 847 MHz is still fast enough for nearly any bus application. AXI interfaces in most SoC designs run at 200–500 MHz. The round-robin arbiter is faster than the bus it sits in front of.

The 4 flip-flops are exactly N flip-flops for any value of N — one per requester, because last_ptr is a one-hot register of width N. The combinational logic scales roughly as 2× the priority arbiter at the same N, because you're running two separate carry chains (masked and plain) over a 2N-bit vector plus the fold logic, and that's where most of the 96-vs-28 cell difference comes from.

There are two gotchas here that trip people up in RTL reviews.

The first is reset behavior. At reset, last_ptr is all-zeros. The rot_mask logic's default branch fires — no last_ptr[k] is set, so rot_mask stays all-ones. The masked priority encoder sees double_req = {req, req} unmasked, and the carry chain grants req[0] if it's asserted. So the very first cycle after reset behaves exactly like a fixed-priority arbiter. On the second cycle, last_ptr holds the one-hot for whoever won, the mask suppresses them, and true round-robin begins. This is correct behavior, not a bug — but if you're writing a testbench that checks "req[0] and req[3] both asserted at cycle 1, expect req[3] to win," you're testing the wrong spec. The first cycle is always fixed-priority.

The second: this is a combinational grant output. grant is driven by grant_next, which is a combinational function of req and last_ptr, meaning grant can change mid-cycle if req glitches. For a bus master that samples grant on the clock edge, this is usually fine. But if downstream logic uses grant as a level-sensitive enable — directly gating a FIFO write or a register update — you've got a potential glitch path. The fix is a registered grant: add a flip-flop on the output, drive grant from always_ff, and accept the one-cycle latency between request and grant. The combinational version is faster; the registered version is safer in sloppy integration contexts. Know which one you've implemented before you're on the other side of a timing closure meeting explaining why a FIFO corrupted.

There's also the question of what "wrap-around" means when all four requesters are simultaneously asserting at reset. last_ptr is zero, mask is all-ones, masked priority encoder fires on the unmasked 8-bit vector starting from index 0 — so req[0] wins. Next cycle, last_ptr points to requester 0, mask suppresses bit 0, and req[1] wins. Then 2, then 3. After four cycles, everyone's been served once. Not "probably N cycles" — exactly N, in the worst case of all requesters simultaneously active.

I'd ship round-robin for any shared bus. Yes, the area is 9.8× larger than fixed priority, but the absolute numbers are tiny — 270 µm² on sky130 — and the behavioral guarantee matters enormously at system level. Starvation bugs in bus fabrics are the kind that only manifest under load, pass all unit tests, and then cause a production system to hang under a specific traffic mix nobody thought to test. The ~243 extra µm² of silicon is cheap insurance.

If you want to synthesize this yourself and see the numbers directly, the RTL compiles clean with iverilog and maps straightforwardly to standard cells — try it on Logicode.