· 12 min read
UART Transmitter and Receiver in SystemVerilog: Mid-Bit Sampling, Metastability, and 8N1 by the Numbers
- uart
- systemverilog
- fpga
- serial-communication
- rtl-design
- state-machine
The most common mistake in every UART receiver tutorial I've encountered isn't the state machine — those are all fine. It's a single line: resetting the oversample counter to zero when the START state transitions to DATA. That one reset puts every data bit sample half a bit-period early, which means you're reading the transition edge instead of the stable center. Most simulations still pass because they use ideal waveforms with zero rise time and no clock drift. Move to real hardware — add a few nanoseconds of cable skew, bump the baud rate to 115200 — and your data is corrupted in a way that's genuinely hard to trace back to a counter reset.
This article builds a complete, synthesized, simulation-verified 8N1 UART in SystemVerilog that gets the oversampling right and adds the two-FF metastability synchronizer that almost no tutorial-level example bothers with. The full TX+RX design synthesizes to 3540.9 μm² across 763 cells with a 2.03 ns critical path and fmax of 492.6 MHz on sky130 standard-cell — plenty of headroom for a 50 MHz FPGA clock.
A UART frame is 10 bits: one start bit (logic 0), eight data bits sent LSB first, and one stop bit (logic 1). The line idles high. Both ends must agree on baud rate in advance — there is no clock signal. The receiver's tolerance is ±50% of one bit period, which sounds generous until you realize the error accumulates across all 10 bits. At 115200 baud on a 50 MHz clock, CLKS_PER_BIT = 434 (the exact value is 434.03, truncated), giving a per-bit error of about 0.007%. Over the full 10-bit frame that accumulates to roughly 20 clocks of drift — 0.46% of a bit period, well within the ±50% window. Where CLKS_PER_OVS introduces a second rounding (434 / 16 = 27 with a remainder of 2), the situation is similar: each oversample tick drifts slightly, but the sample at tick 7 of each bit still lands comfortably in the center.
The TX side is the simpler half. Four states — IDLE, START, DATA, STOP — driven by a 16-bit baud counter that fires every CLKS_PER_BIT clocks. In IDLE the line is held high and tx_ready is asserted. A byte is accepted on the cycle where tx_valid && tx_ready are both true (AXI-Stream handshake), the data gets latched into a shift register, tx_ready drops, and the machine enters START. The start bit holds the line low for one full bit period, then DATA shifts out shift_reg[bit_idx] for bits 0 through 7, LSB first. STOP drives the line high for one bit period, then returns to IDLE and reasserts tx_ready.
module uart_tx #(
parameter int CLK_FREQ_HZ = 50_000_000,
parameter int BAUD_RATE = 9_600,
parameter int CLKS_PER_BIT = CLK_FREQ_HZ / BAUD_RATE // 5208 @ default
) (
input logic clk,
input logic rst_n,
// AXI-S-style handshake: byte accepted on the cycle where valid & ready
input logic tx_valid,
input logic [7:0] tx_data,
output logic tx_ready,
// serial output (idle-high)
output logic tx_serial
);
typedef enum logic [2:0] {
TX_IDLE = 3'd0,
TX_START = 3'd1,
TX_DATA = 3'd2,
TX_STOP = 3'd3
} tx_state_t;
tx_state_t state;
logic [15:0] clk_cnt; // baud-period counter (fits CLKS_PER_BIT <= 65535)
logic [2:0] bit_idx; // data bit index (0..7)
logic [7:0] shift_reg; // data being clocked out
// One pulse per baud period
logic baud_tick;
assign baud_tick = (clk_cnt == (CLKS_PER_BIT - 1));
always_ff @(posedge clk or negedge rst_n) begin
if (!rst_n) begin
state <= TX_IDLE;
clk_cnt <= '0;
bit_idx <= '0;
shift_reg <= '0;
tx_serial <= 1'b1;
tx_ready <= 1'b1;
end else begin
case (state)
TX_IDLE: begin
tx_serial <= 1'b1;
tx_ready <= 1'b1;
clk_cnt <= '0;
bit_idx <= '0;
if (tx_valid && tx_ready) begin
shift_reg <= tx_data;
tx_ready <= 1'b0;
state <= TX_START;
end
end
TX_START: begin
tx_serial <= 1'b0; // start bit (logic 0)
if (baud_tick) begin
clk_cnt <= '0;
state <= TX_DATA;
end else begin
clk_cnt <= clk_cnt + 1'b1;
end
end
TX_DATA: begin
tx_serial <= shift_reg[bit_idx]; // LSB first
if (baud_tick) begin
clk_cnt <= '0;
if (bit_idx == 3'd7) begin
bit_idx <= '0;
state <= TX_STOP;
end else begin
bit_idx <= bit_idx + 1'b1;
end
end else begin
clk_cnt <= clk_cnt + 1'b1;
end
end
TX_STOP: begin
tx_serial <= 1'b1; // stop bit (logic 1)
if (baud_tick) begin
clk_cnt <= '0;
tx_ready <= 1'b1;
state <= TX_IDLE;
end else begin
clk_cnt <= clk_cnt + 1'b1;
end
end
default: begin
state <= TX_IDLE;
tx_serial <= 1'b1;
tx_ready <= 1'b1;
end
endcase
end
end
endmodule
The RX side uses 16x oversampling: each bit period is divided into 16 oversample ticks, where one tick equals CLKS_PER_OVS = CLKS_PER_BIT / 16 clocks. The receiver has two counters: clk_cnt, which counts raw clock cycles up to CLKS_PER_OVS and fires an ovs_tick pulse, and ovs_cnt, which counts ticks 0 through 15 within the current bit period.
The state machine has five states: IDLE, START, DATA, STOP, DONE. IDLE waits for rx_sync to go low (the start bit's falling edge). On that edge, it enters START with both counters at zero.
What separates the correct design from the broken majority is what the START state actually does with those 16 ticks. It counts all 16 oversample ticks of the start bit. At tick 7 (the mid-point, with HALF_OVS defined as 8 and the check at ovs_cnt == HALF_OVS - 1) it samples for a false start — if rx_sync is already back to 1, the falling edge was a glitch and the machine returns to IDLE silently. Otherwise it keeps counting. At tick 15 (end of the start bit), it resets ovs_cnt to zero and enters DATA.
Because ovs_cnt is reset to zero at the end of the start bit, the DATA state begins with a fresh counter starting at 0. D0 will therefore be sampled at its own tick 7 — the true center of the first data bit, one full bit period after the start bit's falling edge. Every subsequent data bit inherits the same alignment.
The broken version that most examples ship: reset ovs_cnt to zero at tick 7 of the START state (right after the false-start check), then immediately enter DATA. Now DATA begins with ovs_cnt already at 8, and the first sample of D0 fires at tick 7 of a counter that started at 8 — which corresponds to tick 15 of the first data bit period, the very end of D0 rather than its center. You're reading the transition region. The bit values you get depend entirely on how fast your signal edges are.
module uart_rx #(
parameter int CLK_FREQ_HZ = 50_000_000,
parameter int BAUD_RATE = 9_600,
parameter int CLKS_PER_BIT = CLK_FREQ_HZ / BAUD_RATE // 5208
) (
input logic clk,
input logic rst_n,
// asynchronous serial input (idle-high; from external source)
input logic rx_serial,
// received byte and status
output logic rx_valid, // pulses one cycle when rx_data is ready
output logic [7:0] rx_data,
output logic rx_frame_err // pulses one cycle on framing error (stop != 1)
);
localparam int OVERSAMPLE = 16;
localparam int CLKS_PER_OVS = CLKS_PER_BIT / OVERSAMPLE;
localparam int HALF_OVS = OVERSAMPLE / 2; // 8: mid-bit sample point
// -------------------------------------------------------
// 2-FF metastability synchronizer on the asynchronous RX pin.
// ASYNC_REG tells Vivado to place these FFs adjacent and
// exclude them from timing analysis.
// -------------------------------------------------------
(* ASYNC_REG = "TRUE" *) logic rx_sync_ff0, rx_sync_ff1;
logic rx_sync;
always_ff @(posedge clk or negedge rst_n) begin
if (!rst_n) begin
rx_sync_ff0 <= 1'b1;
rx_sync_ff1 <= 1'b1;
end else begin
rx_sync_ff0 <= rx_serial;
rx_sync_ff1 <= rx_sync_ff0;
end
end
assign rx_sync = rx_sync_ff1;
typedef enum logic [2:0] {
RX_IDLE = 3'd0,
RX_START = 3'd1,
RX_DATA = 3'd2,
RX_STOP = 3'd3,
RX_DONE = 3'd4
} rx_state_t;
rx_state_t state;
logic [15:0] clk_cnt;
logic [3:0] ovs_cnt;
logic [2:0] bit_idx;
logic [7:0] shift_reg;
logic ovs_tick;
assign ovs_tick = (clk_cnt == (CLKS_PER_OVS - 1));
always_ff @(posedge clk or negedge rst_n) begin
if (!rst_n) begin
state <= RX_IDLE;
clk_cnt <= '0;
ovs_cnt <= '0;
bit_idx <= '0;
shift_reg <= '0;
rx_valid <= 1'b0;
rx_data <= '0;
rx_frame_err <= 1'b0;
end else begin
rx_valid <= 1'b0;
rx_frame_err <= 1'b0;
case (state)
RX_IDLE: begin
clk_cnt <= '0;
ovs_cnt <= '0;
bit_idx <= '0;
if (!rx_sync) begin
state <= RX_START;
end
end
RX_START: begin
if (ovs_tick) begin
clk_cnt <= '0;
if (ovs_cnt == (HALF_OVS - 1)) begin
if (rx_sync != 1'b0) begin
state <= RX_IDLE;
ovs_cnt <= '0;
end else begin
ovs_cnt <= ovs_cnt + 1'b1;
end
end else if (ovs_cnt == (OVERSAMPLE - 1)) begin
ovs_cnt <= '0;
state <= RX_DATA;
end else begin
ovs_cnt <= ovs_cnt + 1'b1;
end
end else begin
clk_cnt <= clk_cnt + 1'b1;
end
end
RX_DATA: begin
if (ovs_tick) begin
clk_cnt <= '0;
if (ovs_cnt == (HALF_OVS - 1)) begin
shift_reg <= {rx_sync, shift_reg[7:1]};
end
if (ovs_cnt == (OVERSAMPLE - 1)) begin
ovs_cnt <= '0;
if (bit_idx == 3'd7) begin
bit_idx <= '0;
state <= RX_STOP;
end else begin
bit_idx <= bit_idx + 1'b1;
end
end else begin
ovs_cnt <= ovs_cnt + 1'b1;
end
end else begin
clk_cnt <= clk_cnt + 1'b1;
end
end
RX_STOP: begin
if (ovs_tick) begin
clk_cnt <= '0;
if (ovs_cnt == (HALF_OVS - 1)) begin
if (rx_sync != 1'b1) begin
rx_frame_err <= 1'b1;
end
end
if (ovs_cnt == (OVERSAMPLE - 1)) begin
ovs_cnt <= '0;
state <= RX_DONE;
end else begin
ovs_cnt <= ovs_cnt + 1'b1;
end
end else begin
clk_cnt <= clk_cnt + 1'b1;
end
end
RX_DONE: begin
rx_valid <= 1'b1;
rx_data <= shift_reg;
state <= RX_IDLE;
end
default: state <= RX_IDLE;
endcase
end
end
endmodule
The shift register accumulates bytes with shift_reg <= {rx_sync, shift_reg[7:1]} — the new bit gets inserted at the MSB, existing bits slide right. Since UART sends LSB first, D0 arrives first and gets pushed all the way down to shift_reg[0] by the time D7 arrives. After 8 bits, shift_reg[7] holds D7 and shift_reg[0] holds D0. The byte is correct without any bit-reversal step.
The RX serial input is asynchronous — it has no timing relationship with the FPGA clock whatsoever. Connecting it directly to a flip-flop's data input means that input can change arbitrarily close to a clock edge, violating setup and hold time. When that happens, the flip-flop can enter a metastable state where its output is neither a clean 0 nor a clean 1 and can take an arbitrarily long time to resolve. If it resolves to the wrong value without a synchronizer in place, that corrupt output feeds directly into your state machine and the machine will do something unpredictable.
The fix is two flip-flops in series, both clocked by the same clock, with the (* ASYNC_REG = "TRUE" *) attribute on both. The first FF takes the hit — it may go metastable. The second FF is clocked one cycle later, by which point the first FF has had a full clock period to resolve. The ASYNC_REG attribute tells Vivado to place both FFs in adjacent slices, minimizing routing delay between them and maximizing the resolution time. It also excludes the path from rx_serial to rx_sync_ff0 from static timing analysis, since that path has no meaningful timing requirement. Cummings' CDC paper works through the MTBF math: with a 50 MHz clock and moderate switching activity, a 2-FF synchronizer gives MTBF on the order of thousands of years. You'll never see a metastability failure in practice with this in place.
Omitting the synchronizer — which I've seen in roughly half the UART examples I've looked at, including the widely-referenced Nandland implementation — is not a theoretical risk. It's a real failure mode that will bite you on fast clocks or high baud rates, because those conditions reduce the resolution time available to the first FF.
The top-level wrapper connects TX and RX and passes parameters through:
module uart #(
parameter int CLK_FREQ_HZ = 50_000_000,
parameter int BAUD_RATE = 9_600,
parameter int CLKS_PER_BIT = CLK_FREQ_HZ / BAUD_RATE
) (
input logic clk,
input logic rst_n,
input logic tx_valid,
input logic [7:0] tx_data,
output logic tx_ready,
output logic tx_serial,
input logic rx_serial,
output logic rx_valid,
output logic [7:0] rx_data,
output logic rx_frame_err
);
uart_tx #(
.CLK_FREQ_HZ (CLK_FREQ_HZ),
.BAUD_RATE (BAUD_RATE),
.CLKS_PER_BIT (CLKS_PER_BIT)
) u_tx (
.clk (clk), .rst_n (rst_n),
.tx_valid (tx_valid), .tx_data (tx_data),
.tx_ready (tx_ready), .tx_serial (tx_serial)
);
uart_rx #(
.CLK_FREQ_HZ (CLK_FREQ_HZ),
.BAUD_RATE (BAUD_RATE),
.CLKS_PER_BIT (CLKS_PER_BIT)
) u_rx (
.clk (clk), .rst_n (rst_n),
.rx_serial (rx_serial),
.rx_valid (rx_valid), .rx_data (rx_data),
.rx_frame_err(rx_frame_err)
);
endmodule
The testbench wires tx_serial directly back to rx_serial for loopback testing. It sends 0xA5, waits for 12 bit periods, sends 0x3C, then waits 30 more bit periods and checks both received bytes against what was sent. Running it at 115200 baud on a 50 MHz clock (CLKS_PER_BIT = 434) keeps simulation time manageable while exercising all ten bits of each frame through the full state machine path.
Synthesis results on sky130 standard-cell, full TX+RX:
| Module | Area (μm²) | Cells | fmax (MHz) | Critical path (ns) |
|---|---|---|---|---|
| uart (TX+RX) | 3540.9 | 763 | 492.6 | 2.03 |
492.6 MHz fmax against a 50 MHz target means you have roughly 9× frequency margin — none of the baud counters or state registers are anywhere near the critical path. The 2.03 ns critical path likely runs through the baud counter comparator logic. For a design that will only ever run at 50 MHz, you could fold the TX and RX clk_cnt registers into a single counter and save some area, but at 763 cells there's no reason to.
The two things that break most UART receivers: resetting the oversample counter at the wrong time (the counter must run all the way to tick 15 in the START state before entering DATA — not tick 7), and skipping the synchronizer on the RX input. The baud quantization error is genuinely not worth worrying about at standard baud rates with a 50 MHz clock — 0.46% accumulated over ten bits leaves you with 49.54% of margin on each side. If you want to cover fractional divisors for unusual clock/baud combinations, Bresenham-style baud generation is the right direction, but for everything from 9600 to 115200 on a 50 MHz system clock the integer divisor is fine.
Want to synthesize this design against a different clock frequency or baud rate and see the area numbers yourself? Try it on Logicode.