· 8 min read
CRC-32 in Hardware: Why the Serial LFSR Fails at Line Rate and How the Parallel XOR Tree Fixes It
- crc32
- systemverilog
- fpga
- ethernet
- hardware-design
- digital-logic
The serial CRC-32 LFSR synthesizes to 1,295 µm² and runs at 575 MHz — and it's completely useless for Gigabit Ethernet. One bit per clock at 125 MHz is 125 Mb/s, eight times too slow for a 1 Gb/s link. Every tutorial I've seen teaches you the serial form first, which is fine for understanding the math, but then stops there as if the job is done. It isn't, and the gap matters: the parallel 8-bit version isn't a different algorithm or a black-box generator output — it's the same LFSR recurrence unrolled into combinational logic, and once you see that, you don't need to copy equations from the internet ever again.
CRC-32 is polynomial long division over GF(2) — the field where addition is XOR and multiplication is AND. You treat your data as a polynomial, divide it by the generator polynomial, and keep the remainder. That remainder is your checksum. The generator polynomial for Ethernet (IEEE 802.3), PCIe, SATA, gzip, and PNG is the same one: degree 32, often written as 0x04C11DB7 in its normal form. Here is where almost every tutorial quietly lies by omission.
Ethernet transmits bits LSB-first on the wire. That reversal propagates all the way into the hardware implementation — you need the bit-reflected form of the polynomial, which is 0xEDB88320. If you implement a left-shifting LFSR with 0x04C11DB7, your hardware computes a mathematically valid CRC-32, just not the one that Ethernet, gzip, or every other tool in existence expects. You'll get a completely wrong FCS, and you'll spend a morning staring at Wireshark wondering if your FPGA is haunted. Use 0xEDB88320 and shift right.
The initial register value is 0xFFFFFFFF. After all data bytes are processed, you XOR the register one more time with 0xFFFFFFFF to get the final FCS. Both of those inversions are specified in IEEE 802.3 clause 3.2.9 and exist to catch runs of leading or trailing zero bytes that would otherwise be invisible to a naive CRC.
The serial module looks like this:
module crc32_serial (
input logic clk,
input logic rst_n,
input logic valid, // process data_bit this cycle
input logic data_bit, // one payload bit, LSB of byte first
output logic [31:0] crc // running CRC state; XOR with 32'hFFFF_FFFF for FCS
);
localparam logic [31:0] POLY = 32'hEDB8_8320;
// Feedback: XOR the incoming bit with the current LSB of the shift register.
// Because this is the bit-reflected (right-shifting) LFSR, tapping crc[0] and
// shifting right matches the bit ordering that Ethernet hardware uses.
logic fb;
assign fb = crc[0] ^ data_bit;
always_ff @(posedge clk or negedge rst_n) begin
if (!rst_n)
crc <= 32'hFFFF_FFFF; // prescribed initial value for IEEE 802.3
else if (valid)
crc <= {1'b0, crc[31:1]} ^ (fb ? POLY : 32'h0);
end
endmodule
Mechanically: every clock where valid is high, you XOR bit 0 of the current register with the incoming data bit. If that result is 1, shift right and XOR in the polynomial; if it's 0, shift right and XOR in zero. Sky130 synthesis comes out to 1,295 µm², 256 cells, and 575 MHz with a 1.74 ns critical path — the timing is dominated by flip-flop setup plus a couple of XOR stages, about as fast as this structure can be.
The problem isn't the frequency. The problem is that 575 MHz × 1 bit/cycle = 575 Mb/s, and that's the absolute ceiling of what this module can ever do. In practice you'll be running at 125 MHz with an 8-bit memory bus feeding you one byte per clock, which means you either stall for 7 cycles after each byte — destroying your MAC pipeline — or you accept that this design simply doesn't fit the interface. Gigabit Ethernet needs 8 bits into the CRC engine on every 125 MHz cycle. The serial LFSR delivers one.
The fix comes from a property of GF(2) arithmetic: the operations are linear, so you can compose eight serial steps algebraically. To make that concrete: after one serial step, bit j of the new CRC state is an XOR of exactly two inputs — crc[j+1] (the shifted-in bit) plus fb & POLY[j] (the feedback contribution from the polynomial tap at position j). After two steps, you substitute the first step's result into the second, and bit j is now an XOR of three or four original CRC bits and data bits. By step eight, each of the 32 output bits is an XOR of somewhere between 10 and 20 bits drawn from the original 32 CRC bits and the 8 incoming data bits — the exact set depends on which polynomial taps overlap which shift positions across the eight unrolled steps. The generate loop encodes this derivation procedurally rather than forcing you to precompute and hardwire all 32 equations by hand. Instead of computing crc_next one feedback bit at a time, you compute what crc will look like after eight steps and collapse the whole thing into combinational logic between the registers — no extra pipeline stage, no stall, one byte per clock:
module crc32_parallel (
input logic clk,
input logic rst_n,
input logic valid, // process data_byte this cycle
input logic [7:0] data_byte, // one payload byte, bit 0 = first serial bit
output logic [31:0] crc // running CRC state; XOR with 32'hFFFF_FFFF for FCS
);
localparam logic [31:0] POLY = 32'hEDB8_8320;
// Eight intermediate CRC states: s[0] = current crc, s[8] = next crc.
// Each stage is one serial step: right-shift then conditionally XOR POLY.
// Synthesis unrolls these 8 assign statements into an XOR tree — no registers,
// just combinational logic between the flip-flops.
wire [31:0] s [0:8];
wire fb [0:7];
genvar i;
assign s[0] = crc;
generate
for (i = 0; i < 8; i = i + 1) begin : step
assign fb[i] = s[i][0] ^ data_byte[i];
assign s[i+1] = {1'b0, s[i][31:1]} ^ (fb[i] ? POLY : 32'h0);
end
endgenerate
always_ff @(posedge clk or negedge rst_n) begin
if (!rst_n)
crc <= 32'hFFFF_FFFF;
else if (valid)
crc <= s[8];
end
endmodule
The generate loop writes out eight assign statements. Each describes one serial step — shift right, conditional XOR — but they're all combinational wires connecting s[0] through s[8], no intermediate registers. The synthesizer sees eight levels of XOR logic, applies constant propagation and Boolean optimization across all eight stages simultaneously, flattens the result into a two-level XOR tree where each output bit is driven directly from a subset of the input CRC and data bits, and the whole chain settles to a stable s[8] before the next clock edge. This is exactly what a CRC generator from an online tool produces — it just pre-computes the XOR expressions analytically rather than deriving them through the loop. I prefer the loop because it's obviously correct and reads like what it is.
Sky130 synthesis on both variants:
| Variant | Area (µm²) | Cells | fmax (MHz) | Critical path (ns) | Throughput |
|---|---|---|---|---|---|
crc32_serial |
1,295 | 256 | 575 | 1.74 | 575 Mb/s |
crc32_parallel |
1,873 | 421 | 413 | 2.42 | 3.3 Gb/s |
The parallel form is 45% larger in area and 39% slower in max frequency — the XOR tree adds about 0.68 ns of combinational delay beyond the flip-flop path. That's the cost. The gain is 8× the data throughput per clock cycle: at 125 MHz you get exactly 1 Gb/s, and at 413 MHz you can push 3.3 Gb/s before timing closes. For anything running a GigE MAC or a PCIe framer, the parallel version is the only one you'd ship.
The 45% area delta deserves a second look because 45% sounds alarming until you see where it comes from. The serial module has 32 data flip-flops plus roughly 15 XOR gates — one per polynomial tap, there are 14 set bits in 0xEDB88320, plus one XOR gate for the fb feedback bit itself. The parallel module has the same 32 flip-flops plus an XOR tree encoding 8 composed steps, each touching up to 20-odd bits of the state — that tree comes out to 165 additional cells. At 1,873 µm², the whole thing is a rounding error in any design that has memory or serious arithmetic. The tradeoff is not symmetric: you pay a small area tax and get an 8× throughput multiplier.
You can verify both modules against the standard test vector: feed the ASCII bytes for "123456789" (0x31 through 0x39) into a register initialized to 0xFFFFFFFF, process each byte either one bit at a time with crc32_serial or one byte at a time with crc32_parallel, then XOR the result with 0xFFFFFFFF. Both give 0xCBF43926 — the canonical value from Koopman & Chakravarty's 2004 CMU paper on CRC polynomial selection, and the same result every CRC-32 tool on the internet expects. The serial path takes 72 clocks to process all nine bytes; the parallel path takes nine. Same output, different throughput. I simulated both; both pass.
Wiring crc32_parallel into a GigE MAC is straightforward: connect it to the byte-wide data path, tie valid to the data-valid strobe, and when the frame ends deassert valid, read crc ^ 32'hFFFFFFFF, and append the four-byte FCS — LSByte first, as Ethernet specifies. The same structure generalizes: swap to a 16-bit polynomial for USB packet CRC-16, keep the full 32 bits for SATA FIS frames or PCIe TLP and DLLP payloads. Different polynomials, same generate-loop derivation, same one-byte-per-clock result.
Want to paste in a variant and see the sky130 area and fmax immediately? Logicode runs the synthesis for you.