Breaking Our Own System on Purpose: Mocks, Chaos, and Fuzz

There's a particular flavour of dread that comes with being on the other end of someone else's outage.

You're sitting at your desk, sipping coffee, watching the dashboards. Everything is green. And then one of your vendors - some upstream system you don't own and can't fix - decides today is the day they're going to return a 200 OK with the body <html>504 Gateway Timeout</html>. Or worse: they just hold the connection open for 90 seconds and then quietly drop it. Or, our personal favourite, they return a perfectly valid JSON response with a field that used to be a string and is now an integer because someone shipped a "small refactor" on a Friday.

We integrate with dozens of external vendors. Gift card suppliers. Mobile top-up providers. eSIM aggregators. Each one with their own API quirks, their own definition of "uptime," and their own preferred way of breaking. On any given day, at least one of them is having a bad time. The trick is making sure their bad day doesn't become ours - or yours.

This is the story of how we test that. It's a love letter to three of our favourite tools: mocks, chaos, and fuzz. Like the rest of our stack, they're boring in the best possible way.

The Problem: We Don't Own the Other End

Most engineering testing advice assumes you control both sides of the wire. Write a unit test, mock the database, assert the function returns the right thing. Easy.

Now imagine your "database" is a third-party REST API in another country, owned by another company, with a documentation page last updated in 2022 and a support inbox that responds in 4-7 business days. Imagine you have forty of those, each with their own auth scheme, their own retry semantics, and their own opinions about what HTTP status codes mean.

You can't pin them. You can't replay them. You can't even trust them. The contract is a moving target, and the only feedback loop is real money flowing through real customer orders.

We needed a way to test the system as if every vendor was actively trying to ruin our day. Not because we think they are - they're lovely people, mostly - but because assuming good behaviour is how you wake up to a Slack alert at 4 AM.

Layer One: Mocks That Actually Look Like the Real Thing

The first instinct when testing external integrations is to mock them. The second instinct is to do it badly. We've made every mistake here.

We started with the classic: a Go interface that wraps the vendor client, plus a fake implementation in tests that returns hardcoded responses. It worked. For about three days. Then we hit the predictable wall - our tests passed beautifully, our mocks returned exactly what we told them to, and production failed the moment a vendor returned a response shape our mock never imagined.

Mocks at the function boundary lie. They let you assert that your code is consistent with itself, which is a profoundly useless property.

So we built mocky-balboa - an HTTP-level mock server that sits where the vendor would. Our code makes real HTTP requests, hits real URLs, parses real JSON. The only difference is the server on the other end is one we control.

// In our test setup:
//   - Code under test is configured to point at http://localhost:8788
//   - mocky-balboa serves canned responses that match the vendor's real schema
//   - Every test hits the same network stack production uses

vendor := vouchers.NewCardVendor(vouchers.Config{
    BaseURL: testhelpers.MockyBaseURL(),
    APIKey:  "test-key",
})

order, err := vendor.PlaceOrder(ctx, req)

The seed data inside mocky-balboa is copied from real vendor responses we've captured. When a vendor changes their schema, we update mocky-balboa, and the failures show up in CI - not in a customer's order.

This sounds obvious. It is obvious. We're including it because we know exactly how many teams skip this step and pay for it later. If your mocks live below the HTTP layer, you're not testing the integration. You're testing your imagination of the integration.

Layer Two: Chaos, Because Vendors Will Betray You

Mocks tell you what happens when the vendor behaves. Chaos tells you what happens when they don't.

We built a small chaos management API on top of mocky-balboa - call it /_chaos. From any test, we can arm a rule that says "the next time someone hits the supplier's /v2/orders endpoint, return a 504 instead of the happy-path response." Then we run the code under test and assert that we handled it correctly.

chaos := mockychaos.NewForTest(t)

chaos.ArmForTest(t, mockychaos.Rule{
    Vendor:        "card-supplier",
    PathSubstring: "/v2/orders",
    Mode:          "one_shot",
    FailureType:   "status",
    StatusCode:    mockychaos.IntPtr(504),
    Label:         "card-supplier create-order one-shot 504",
})

// Now exercise the create-order flow and assert:
//   - We didn't double-charge the wallet
//   - We retried with the same idempotency key
//   - We didn't tell the customer their order was complete when it wasn't

The rule library is small but pointed:

Failure Type	What It Does
`status`	Returns a specific HTTP status (4xx or 5xx) instead of success
`timeout`	Hangs for N seconds, then drops the connection
`malformed`	Returns syntactically valid HTTP with garbage JSON inside
`drop`	Closes the TCP connection mid-response
`delay`	Adds latency to simulate a slow vendor having a bad day

Each one maps to a real failure we've seen in the wild. The malformed mode is especially fun - we once had a vendor return a response that was valid JSON, valid against their published schema, and also contained a numeric field encoded as a string with a leading zero. Our JSON parser was thrilled. Our order pipeline was not.

The thing that makes chaos testing useful rather than theatrical is that we wire it into the same orchestration tests that cover the happy path. Our reference vendor's orchestration suite has 28 subtests - chaos, malformed responses, async polling, signed webhooks, idempotency, retry caps, the works. When we onboard a new supplier, we copy that structure. The expectation is: every vendor adapter ships with its own chaos suite, in the same PR.

If you can't break it on purpose in a test, you'll break it by accident in production. We'd rather find out at our desks than in our inbox.

A Story About Idempotency

We had a memorable bug early on. Vendor returns a timeout. Our code retries. Vendor was actually fine - the response was just slow - and now we've created two orders against the same wallet debit. The customer got two gift cards. We ate the cost.

It happened once. We added a chaos test that simulates the exact sequence - slow vendor, retry, eventual success on the original request - and asserts that we use the same idempotency key on the retry and that the vendor's deduplication kicks in. That test has caught the same class of bug at least four times since, in different vendor adapters, before any of them touched production.

The bug doesn't repeat. The category of bug does.

Layer Three: Fuzz, Because Inputs Are a Disaster

Mocks and chaos cover the network side. The other half of the problem is inputs - the things customers, partners, and our own admin team type into forms and POST to APIs.

Input validation is one of those topics that everyone agrees is important and almost nobody does well. It's not glamorous. It's not the kind of thing that gets you a conference talk. It's also the thing that, when skipped, ends up on the front page of Hacker News.

We were reminded of this when CVE-2026-41940 hit cPanel/WHM earlier this month - an authentication bypass that let unauthenticated, remote attackers gain "elevated control of the control panel," weaponised by multiple threat actors (Mirai variants, ransomware crews) within 24 hours of disclosure. The details are different every time, but the pattern is depressingly familiar - somewhere in a sprawling codebase, the code assumed an input was well-formed, an attacker found a shape it hadn't anticipated, and a check that was supposed to mean "this person is allowed to do this" silently waved them through. Auth bypasses, injection bugs, parser exploits - they all rhyme. Somebody, somewhere, trusted a string. The fix is always some flavour of "validate the input properly." The problem is that "properly" requires finding every place input flows, and that's where humans lose to machines.

So we let machines do it.

Go's Built-In Fuzzer

Go 1.18 shipped a fuzzer that lives inside the standard testing package. We use it everywhere we accept untrusted input - which, given that we're a payments platform, is basically everywhere.

func FuzzParseGamerID(f *testing.F) {
    // Seed with known-good inputs
    f.Add("123456789")
    f.Add("user@platform")
    f.Add("a-b-c-1-2-3")

    f.Fuzz(func(t *testing.T, input string) {
        result, err := vouchers.ParseGamerID(input)
        if err != nil {
            return // Errors are fine. Panics are not.
        }
        // Whatever we parsed, the round-trip must be stable.
        if !utf8.ValidString(result) {
            t.Fatalf("parser produced invalid UTF-8 for input %q", input)
        }
    })
}

The fuzzer generates millions of input mutations - empty strings, strings full of nulls, strings full of emoji, strings with embedded SQL, strings with malformed UTF-8, strings that are 50KB of random bytes. The contract is simple: the parser is allowed to return an error. It is not allowed to panic, hang, or produce a result that downstream code can't handle.

The first time we ran the fuzzer against our Gamer ID parser, it found a panic in 14 seconds. Fourteen. We'd been running that parser in production for months. The input that broke it was a 200-character string of zero-width joiners. Nobody types that on purpose. Eventually, somebody would have, and we'd have been the ones explaining it.

What We Fuzz

We fuzz the boundaries:

Webhook payloads. Every vendor sends us webhooks. Every webhook is a potential injection vector. We fuzz the parser with the spec-compliant fields, plus random bytes, plus deliberately malformed timestamps and signatures.
Public API inputs. Card codes, PINs, gamer IDs, phone numbers, email addresses. Anything a customer types or a partner POSTs.
Admin-side inputs. The admin panel is internal, but "internal" is not a synonym for "trustworthy." We fuzz the bulk-upload parsers because a malformed CSV uploaded by accident is just as bad as one uploaded on purpose.
Vendor response parsing. Yes - even responses we technically "trust." We fuzz the JSON decoders that turn vendor responses into our internal types. If a vendor accidentally ships malformed JSON for an hour, our parser shrugs instead of panicking the whole worker.

A worker that panics is a worker that gets restarted. Restarts mean re-processing. Re-processing means duplicate orders if you're not careful. Fuzz testing is the cheapest possible insurance against that whole chain of consequences.

The Three Together

Each layer is fine on its own. Together, they let us answer the question that actually matters: what happens when several things go wrong at once?

A real-world scenario from our test suite:

Customer submits a redemption with a gamer ID that contains a slightly weird Unicode character (fuzz-discovered class of input).
Vendor's order endpoint is up, but the inventory endpoint is timing out (chaos rule).
Our retry kicks in, the second attempt succeeds, but the response shape is subtly different because the vendor's load balancer routed it to a different region (mocky-balboa serves the alternate shape).
Webhook arrives a few seconds later with a signature that's technically valid but missing a field (fuzz-discovered).

In a saner industry, this would be paranoid overkill. In ours, this is Tuesday. The test passes. We move on.

Layer	What It Catches
Mocks	Schema drift, contract mismatches, "the integration as documented" vs "the integration as built"
Chaos	Timeouts, 5xx storms, malformed responses, dropped connections, slow vendors
Fuzz	Panics, infinite loops, parser bugs, injection-shaped inputs, the things humans don't think to type

Boring Tools, Beautiful Outcomes

None of this is novel. Mocks have been around forever. Chaos engineering has a Wikipedia page. Go's fuzzer has been stable since 2022. The trick isn't using exotic tools - it's actually using the boring ones, consistently, on every integration, in every PR.

Our rule is straightforward: a new vendor adapter doesn't merge without its mocky-balboa routes, its chaos suite, and fuzz coverage on whatever parsing it introduces. Not because we're rigid, but because every time we've made an exception, we've paid for it - usually within a fortnight, usually on a weekend.

We sleep better for it. We pass audits without crossing our fingers. And when a vendor has a bad day, our customers don't notice. Their order goes through, the gift card arrives, and the chaos stays where it belongs - in our test suite.

That's the goal. That's always been the goal. Not heroic incident response, not war rooms, not 4 AM pages. Just a system that's been broken on purpose so many times in CI that production feels like a holiday.

If you enjoyed this, you might like Around the World in 80 Days - the story of the stack underneath all this testing - or The Wallet That Couldn't Count, where we break a different system on purpose and rebuild it boring.

License

This article is licensed under CC BY-NC-SA 4.0. You are free to:

Share — copy and redistribute the material in any medium or format
Adapt — remix, transform, and build upon the material

Under the following terms:

Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
NonCommercial — You may not use the material for commercial purposes.
ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.