Back to blog

The Wallet That Couldn't Count: Scaling Transactions the Hard Way

How we went from race conditions to sub-millisecond wallet updates - a journey through mistakes, over-engineering, and finally letting the database do its job.

Octopus EngineeringDecember 10, 20254 min read

Our wallet system is fast, correct, and runs on a single PostgreSQL function that executes in under a millisecond. We're pretty proud of it.

But it didn't start that way. There's a special kind of dread that comes with realising your wallet system can't count. Not "off by a penny" can't count - "a client just made three purchases simultaneously and two of them went through on the same balance" can't count. You're running a cards platform. Money moves through your system. Real money. Client money.

We operate under the same constraints as a bank. If a client has £10,000 and spends £3,000, that balance had better be exactly £7,000. Not eventually. Not usually. Always. This is how we got there - and like the rest of our stack, the answer turned out to be the most boring one imaginable.

The Read-Update-Write

The simplest thing that could possibly work:

func (r *WalletRepo) Debit(ctx context.Context, walletID int64, amount decimal.Decimal) error {
    tx, err := r.db.BeginTx(ctx, nil)
    if err != nil {
        return err
    }
    defer func() { _ = tx.Rollback() }()

    var balance decimal.Decimal
    err = tx.QueryRowContext(ctx,
        "SELECT balance FROM wallets WHERE id = $1", walletID,
    ).Scan(&balance)
    if err != nil {
        return err
    }

    newBalance := balance.Sub(amount)
    if newBalance.IsNegative() {
        return errors.InsufficientFunds
    }

    _, err = tx.ExecContext(ctx,
        "UPDATE wallets SET balance = $1 WHERE id = $2",
        newBalance, walletID,
    )
    if err != nil {
        return err
    }

    return tx.Commit()
}

Shipped it. Worked beautifully in testing. Then traffic happened.

Two requests arrive at the same time. Both start a transaction and read the balance as £10,000. Both subtract £3,000. Both write £7,000. The client spent £6,000 but was only charged £3,000. We'd invented a money printer.

What worked: Dead simple. Fast. Good enough at low concurrency.

What didn't: Fundamentally broken under concurrent load. No atomicity between the read and write - that gap is where money disappears. No audit trail either.

We needed a better solution. FAST!

The Append-Only Ledger

So we over-engineered it. Instead of a mutable balance, we'd keep an append-only ledger - every transaction gets a new row, and the balance is computed by reading every entry. We were emulating addAndGet of atomic operations at the application layer, with a full scan on every operation.

func (r *WalletRepo) Debit(ctx context.Context, walletID int64, amount decimal.Decimal) error {
    // Application-level lock - no two goroutines touch the same wallet
    unlock := r.lockManager.Lock(walletID)
    defer unlock()

    tx, err := r.db.BeginTx(ctx, nil)
    if err != nil {
        return err
    }
    defer func() { _ = tx.Rollback() }()

    // Read ALL ledger entries for this wallet
    rows, err := tx.QueryContext(ctx,
        "SELECT amount, type FROM wallet_ledger WHERE wallet_id = $1",
        walletID,
    )
    if err != nil {
        return err
    }
    defer func() { _ = rows.Close() }()

    // Manually compute balance by looping through every entry
    balance := decimal.Zero
    for rows.Next() {
        var entryAmount decimal.Decimal
        var entryType string
        if err := rows.Scan(&entryAmount, &entryType); err != nil {
            return err
        }
        if entryType == "credit" {
            balance = balance.Add(entryAmount)
        } else {
            balance = balance.Sub(entryAmount)
        }
    }

    if balance.Sub(amount).IsNegative() {
        return errors.InsufficientFunds
    }

    _, err = tx.ExecContext(ctx,
        `INSERT INTO wallet_ledger (wallet_id, amount, type)
         VALUES ($1, $2, 'debit')`,
        walletID, amount,
    )
    if err != nil {
        return err
    }

    return tx.Commit()
}

Look at that code. We're acquiring an application-level mutex, pulling every ledger row into Go memory, iterating one by one, doing arithmetic that SQL can do natively, and holding a lock the entire time.

What worked: Race conditions gone. Complete audit trail. Auditors loved it.

What didn't: Application-level locks held connections open while Go did arithmetic. Every operation scanned the entire ledger - O(n) and growing. p99 climbed to 200-500ms. We were emulating atomicity at the wrong layer, reinventing what PostgreSQL already does natively.

The ledger wasn't wrong. Our implementation was just in the wrong place and orders of magnitude slower.

The Stored Procedure

We'll be honest - we resisted this. Stored procedures have a reputation problem. They feel like something your dad's enterprise team wrote in 2004 at a company where deployments involved a change advisory board and a three-week lead time. Thousand-line PL/SQL monsters buried in Oracle databases that nobody dares touch because the one person who understood them retired in 2017.

But biases are always worth questioning. When we looked at what we actually needed - read one row, do arithmetic, insert one row, update one row - it was embarrassingly obvious that this is what stored procedures were made for. Sometimes a tool survives for decades because it's genuinely good at what it does.

CREATE OR REPLACE FUNCTION wallet_debit(
    p_wallet_id BIGINT, p_amount NUMERIC, p_reference TEXT
) RETURNS NUMERIC AS $$
DECLARE
    v_balance NUMERIC;
BEGIN
    -- Row-level lock on just this wallet
    SELECT balance INTO v_balance
    FROM wallets WHERE id = p_wallet_id
    FOR UPDATE;

    -- validate, compute new balance ...

    -- Still keep the full audit trail
    INSERT INTO wallet_ledger (wallet_id, amount, type, reference)
    VALUES (p_wallet_id, p_amount, 'debit', p_reference);

    -- update balance, return ...
END;
$$ LANGUAGE plpgsql;

The magic is FOR UPDATE - a row-level lock on just the wallet being modified. The procedure reads the balance, validates it, appends a ledger entry, updates the wallet, and returns the new balance. All inside one transaction, entirely within the database process. No round trips.

The Go side collapsed to almost nothing:

func (r *WalletRepo) Debit(ctx context.Context, walletID int64, amount decimal.Decimal, ref string) (decimal.Decimal, error) {
    tx, err := r.db.BeginTx(ctx, nil)
    if err != nil {
        return decimal.Zero, err
    }
    defer func() { _ = tx.Rollback() }()

    var newBalance decimal.Decimal
    err = tx.QueryRowContext(ctx,
        "SELECT wallet_debit($1, $2, $3)",
        walletID, amount, ref,
    ).Scan(&newBalance)
    if err != nil {
        return decimal.Zero, err
    }

    return newBalance, tx.Commit()
}

The key insight was FOR UPDATE - a row-level lock on just the wallet being modified. Not a table lock. Not an application mutex. Wallet A never blocks wallet B. The entire operation - lock, validate, ledger insert, balance update - runs in under a millisecond, entirely inside the database.

What worked: ~1ms execution. True atomicity. Row-level locking. Full ledger audit trail. The Go code is trivial.

What didn't: Honestly? Nothing yet. We keep waiting for the catch.

The Numbers

ApproachCorrectnessp99 LatencyConcurrency
Read-Update-WriteRace conditions~5msBroken under load
Append-Only LedgerCorrect~200-500msSequential per wallet
DB ProcedureCorrect~1msParallel across wallets

Boring Is Still Beautiful

We wrote a whole post about choosing boring technology. This wallet journey might be the best example of that philosophy in action.

Too simple - ignored the problem. Too clever - solved it in the wrong place. The answer was a plpgsql function that would have been unremarkable in 2005. Nothing about it would get upvotes on Hacker News. But it runs in under a millisecond, it's been in production for months without a single incident, and when a new engineer asks "how do wallet transactions work?" we point them at one SQL function and one Go method.

We didn't need a distributed lock service, event sourcing, CQRS, or a saga orchestrator. We needed PostgreSQL to do what it's been doing since before most of our team started programming. We handle money - not page views, not recommendations. When a client loads £100,000 into their wallet, that number has to be sacred. A stored procedure running in ~1ms gives us not just speed, but confidence. The confidence to pass an audit without crossing our fingers.

Sometimes the best engineering is getting out of the way and letting proven, boring technology do its thing.


This is part of our engineering series. Read about how we built the platform with boring technology, or learn how digital gift cards work from the user's perspective.

License

This article is licensed under CC BY-NC-SA 4.0. You are free to:

  • Share — copy and redistribute the material in any medium or format
  • Adapt — remix, transform, and build upon the material

Under the following terms:

  • Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
  • NonCommercial — You may not use the material for commercial purposes.
  • ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

Buy what this post is about

Ready to get a card?

Browse Octopus Cards on Driffle — gaming top-ups, mobile recharges, and travel eSIMs at a discount, delivered instantly.

Related posts