← Back to the blog

· 10 min read

Carry-lookahead isn't the fast adder you were sold

  • digital-design
  • systemverilog
  • adders
  • carry-lookahead
  • rtl-synthesis
  • timing

At 16 bits, on a sky130 standard-cell flow, the carry-lookahead adder runs at 245 MHz and the ripple-carry runs at 161. That's the whole textbook story confirmed in one line: lookahead is 52% faster. It costs you 7% more area — 473 cells of equivalent area against 440 — and for that you buy a critical path that drops from 6.2 ns to 4.08. If you only ever build 16-bit adders, stop reading, because at that width carry-lookahead simply wins and you should ship it.

But the number that should bother you isn't the 52%, it's the 7%, and the reason it should bother you is that 7% area for 52% speed is a steal — the textbook promised me something close to that scaling. CLA is O(log n) delay against ripple's O(n), so at 16 bits that gap should already be yawning open, and the area cost should be the only thing standing between me and free speed. And the synthesized area gap is almost nothing, which means the carry logic isn't expensive yet. The interesting question is what happens when it gets expensive, and the answer is the reason no production datapath you've ever used is a flat wide CLA.

Let me back up to where this starts. A ripple-carry adder is the most honest circuit in digital design: one full adder per bit, carry-out of each bit wired straight into carry-in of the next. The sum bit is two XORs, the carry is an AND-OR, and there's nothing clever or hidden anywhere in it.

module ripple_carry_adder #(
    parameter int WIDTH = 16
) (
    input  logic [WIDTH-1:0] a,
    input  logic [WIDTH-1:0] b,
    input  logic             cin,
    output logic [WIDTH-1:0] sum,
    output logic             cout
);
    logic [WIDTH:0] carry;
    assign carry[0] = cin;

    genvar i;
    generate
        for (i = 0; i < WIDTH; i++) begin : g_fa
            // full adder: sum and carry built from gates the way an RCA does it.
            assign sum[i]     = a[i] ^ b[i] ^ carry[i];
            assign carry[i+1] = (a[i] & b[i]) | (carry[i] & (a[i] ^ b[i]));
        end
    endgenerate

    assign cout = carry[WIDTH];
endmodule

Look at carry[i+1]. It depends on carry[i]. Which depends on carry[i-1]. Which depends on the one below it, all the way down. The critical path is carry walking the entire width — carry[0] to carry[16], fifteen full adders deep — and the synthesizer can't fix that. It's not an implementation choice it can optimize around. It's the data dependency you wrote down. 6.2 ns of critical path at 16 bits is that carry chain, end to end. Double the width and you roughly double the path. That's the O(n) you were warned about.

The carry-lookahead idea is to refuse to wait. For every bit you compute two signals up front: propagate, p = a ^ b (this bit passes an incoming carry through), and generate, g = a & b (this bit makes a carry no matter what comes in). Once you have p and g for every bit, the carry into any position is a flat boolean function of the p's and g's below it and the original carry-in. No rippling — every carry computed in parallel.

In theory, anyway. In theory every carry pops out at once and the whole add is two gate delays and a sum XOR. In practice the word "flat boolean function" is doing an enormous amount of quiet, expensive work, and you only see how much once you write the equations down and stare at the fan-in. So let's write them down. Here's the actual textbook 16-bit build — four 4-bit lookahead blocks stitched together by a second-level lookahead carry unit. Watch the carry equations, not the loop.

module carry_lookahead_adder #(
    parameter int WIDTH = 16  // fixed to 16 here: 4 blocks of 4 bits
) (
    input  logic [WIDTH-1:0] a,
    input  logic [WIDTH-1:0] b,
    input  logic             cin,
    output logic [WIDTH-1:0] sum,
    output logic             cout
);
    // Per-bit propagate / generate.
    logic [WIDTH-1:0] p, g;
    assign p = a ^ b;
    assign g = a & b;

    // Block-level group propagate / generate (one per 4-bit block).
    logic [3:0] PG, GG;
    // Carry into each block, produced by the lookahead carry unit.
    logic [3:0] bcin;

    genvar blk;
    generate
        for (blk = 0; blk < 4; blk++) begin : g_blk
            localparam int B = blk * 4;

            // Intra-block lookahead carries: each carry is a flat function of
            // the block carry-in and the p/g bits below it. The fan-in climbs
            // term by term -- c3 inside the block is a 4-deep AND-OR.
            logic [3:0] c;  // c[k] = carry into bit B+k
            assign c[0] = bcin[blk];
            assign c[1] = g[B+0] | (p[B+0] & bcin[blk]);
            assign c[2] = g[B+1] | (p[B+1] & g[B+0])
                                 | (p[B+1] & p[B+0] & bcin[blk]);
            assign c[3] = g[B+2] | (p[B+2] & g[B+1])
                                 | (p[B+2] & p[B+1] & g[B+0])
                                 | (p[B+2] & p[B+1] & p[B+0] & bcin[blk]);

            // Sums are cheap once the carries exist.
            assign sum[B+3:B] = p[B+3:B] ^ c;

            // Group propagate/generate for this 4-bit block.
            assign PG[blk] = &p[B+3:B];
            assign GG[blk] = g[B+3]
                           | (p[B+3] & g[B+2])
                           | (p[B+3] & p[B+2] & g[B+1])
                           | (p[B+3] & p[B+2] & p[B+1] & g[B+0]);
        end
    endgenerate

    // Second-level lookahead carry unit: same AND-OR shape, one level up. This
    // is the extra gate level a flat CLA pretends it doesn't pay for.
    assign bcin[0] = cin;
    assign bcin[1] = GG[0] | (PG[0] & cin);
    assign bcin[2] = GG[1] | (PG[1] & GG[0])
                           | (PG[1] & PG[0] & cin);
    assign bcin[3] = GG[2] | (PG[2] & GG[1])
                           | (PG[2] & PG[1] & GG[0])
                           | (PG[2] & PG[1] & PG[0] & cin);

    // Carry-out of the whole adder, top of the LCU.
    assign cout = GG[3] | (PG[3] & GG[2])
                        | (PG[3] & PG[2] & GG[1])
                        | (PG[3] & PG[2] & PG[1] & GG[0])
                        | (PG[3] & PG[2] & PG[1] & PG[0] & cin);
endmodule

Count the terms in c[3]. Four product terms, the widest of them a 4-input AND, all OR'd together. Now GG[blk] — the group-generate — same shape, four products feeding one OR. Now cout at the bottom: five product terms, the widest a 5-input AND. Every single carry you compute "in parallel" is a sum-of-products whose fan-in climbs term by term as you walk up the bit positions, so the cheap, one-wire carry at the bottom of a block and the fat AND-OR tree at the top of it are the same idea charged at wildly different prices. The bottom carry of a block is barely a wire while the top carry is a whole tree, and that's exactly why the block is four bits wide and not sixteen — it's the thing the O(log n) handwave hides.

A flat 16-bit CLA — one level, every carry a direct function of all sixteen p/g pairs — would need product terms up to 17 inputs wide, and no standard cell has a 17-input gate. So the synthesizer decomposes it into a tree of smaller gates, and now your "one gate level" is several, and the fan-out on the low-order p and g signals is enormous, because they feed into every carry above them. Fan-in and fan-out are not free. They're delay and they're area. So nobody builds a flat wide CLA. You build the hierarchy you just read: 4-bit blocks, then a lookahead carry unit on top to produce the block carry-ins.

And that LCU is another gate level. The "log n" in O(log n) is literally counting these levels — each level of lookahead is a stage, and a wider adder needs more stages. A 64-bit CLA isn't two gate delays. It's a tree of 4-bit blocks under two or three tiers of lookahead units, and every tier adds the same fan-in-heavy AND-OR you saw in cout, plus the wiring to carry group signals up and block carries back down — and all of that compounds.

Here's what that does to the curve. At 16 bits the hierarchy is shallow — four blocks, one LCU on top — so the overhead is tiny, and you get the 52% / 7% trade I opened with.

adder area (sky130) cells f_max critical path
ripple-carry 440.4 196 161 MHz 6.2 ns
carry-lookahead 473.0 228 245 MHz 4.08 ns

Now scale it. The ripple-carry critical path grows linearly and predictably — every bit is one more full adder of delay, and you can read the trend straight off these two designs. The CLA's advantage does not grow to match. Add a third tier of lookahead, watch the top-level carry equations widen, watch the fan-out on the shared group-propagate signals climb, watch wiring start to dominate. The lookahead logic doesn't get simpler as the adder gets wider. It gets fatter. And the area that was a flat 7% at 16 bits stops being flat. Ripple area is linear in width. Full CLA area runs off the top of the chart. The speedup you measured at 16 bits is the best case, not the trend.

So no, I would not reach for a flat wide carry-lookahead at 32 or 64 bits, and neither does anyone shipping a real datapath. They reach for carry-select, which builds two ripple sums per block — one assuming carry-in 0, one assuming 1 — and muxes the right one once the real block carry arrives, trading area for a carry that hops blocks instead of bits. Or they go parallel-prefix, which is CLA's idea done as a clean log-depth tree of associative carry operators. Kogge-Stone is the famous one: minimum logic depth, low fan-out at every stage, fast — and it pays in area and a wiring mess that gets congested at width. Brent-Kung goes the other way, far less area and wiring, at the cost of higher fan-out and a couple more logic levels. Han-Carlson splits the difference, Kogge-Stone in the middle, Brent-Kung at the edges. Which one wins is a real area-speed-power decision tied to your width and your cell library, and "it depends" is the honest answer — but the choice is between those, not between ripple and flat CLA. Flat CLA past about a nibble is a thing you learn, not a thing you build.

One more wrinkle, and it flips the whole thing, and it's the one that actually bites people: none of this holds on an FPGA. FPGAs have a dedicated hard carry chain wired between adjacent logic cells — silicon built specifically to make ripple-carry fast. The fabric will route a + b straight down that chain at speeds your soft lookahead logic can't touch, and your hand-built CLA just burns LUTs re-implementing in slow soft logic what the carry chain does for free. On an FPGA, write assign sum = a + b, let the synthesizer hit the carry primitive, and walk away. On an ASIC the carry chain doesn't exist and the area-timing curves above are what you actually get, so pick by target. Building a CLA on an FPGA because a textbook called it "the fast adder" is how you ship something slower than the plus sign.

So here's the honest summary, the thing the O(log n) marketing leaves out. Lookahead is a real, measurable win at narrow widths, a shrinking win as the hierarchy deepens, and a design you should retire entirely once carry-select and parallel-prefix exist. The textbook isn't wrong. It's just selling you the 16-bit case as if it were the asymptote.

Want to build both of these, push them through synthesis, and see the area and f_max numbers for yourself instead of taking mine? Try it on Logicode.