Sizing up your test suite

Sam Starling·June 16, 2026

It’s an age-old question: would you rather fight one horse-sized duck or 100 duck-sized horses? Every team that automates end-to-end testing eventually hits a different, and yet equally divisive, question: do you write one big end-to-end test that walks the whole journey, or many small focused tests that each check one thing?

There's a classic software testing answer, straight from the testing pyramid. Start out with lots of unit tests: they should be cheap, fast and isolated. Then come integration tests, you have fewer of them because they're harder to write. There are more dependencies, things to mock, processes to co-ordinate...

Then there's end-to-end testing of your application, from the outside, just as users would experience it. You probably don't have many of these tests: not because they're not valuable, but because of the cost – the effort to write them, the ongoing maintenance as your app changes, and the time spent debugging when they break. That cost is the entire reason the pyramid looks the way it does – but what happens when the cost changes?

The pyramid paradox

The most valuable tests are the ones that are closest to reality, but those are the tests that we're told to write the fewest of. We don't ration them because they're low value, but because they're expensive. In return, we accept a gap in our coverage at the layer that matters most to our users.

The pyramid starts to feel a bit more like a budget than a statement about what's worth testing. That budget exists for a reason: selectors and locators break as your app shifts over time. You end up babysitting retries, and re-authoring tests in lockstep with your changes.

The cost is shifting

Is this the part where I tell you that Semaloop magically solves everything? Not quite. Instead, the point I want to make is that the cost is shifting rapidly, and the whole pyramid is up for re-negotiation.

The main driver in this shift is AI. Brittle selectors give way to natural language, and the vision capabilities of models mean tests can take UI changes in their stride, while still providing high signal.

If end-to-end tests become cheap to write, cheap to maintain, and high signal when they run, the reasons for only having coverage of your most important features start to evaporate. Of course, unit tests will always win on raw speed and pinpointing logic bugs: we're not suggesting you throw the entire pyramid in the bin.

Even when tests get cheaper, some awkward facts remain: flakiness compounds with length, so a long enough journey is never reliably successful. When a forty-step test dies at step thirty-seven, you've just paid for thirty-six steps to learn... almost nothing. The question was never just about affording the coverage, it's also about sizing each test so it stays trustworthy.

One or many

Say you can cover all of your functionality end-to-end. What does that actually look like? One of the main benefits of these tests is that they're realistic. The ways that users roam around your app are complex, and never as pristine as you might hope. You could argue that one test that does many things is closer to how users actually use your app. But push that to its conclusion and you get the forty-step monster from earlier: realistic, but flaky and impossible to debug when it breaks.

Hundreds of single-click tests swing too far the other way. When one breaks you know exactly what failed, but they're so far from real usage that they miss the bugs that live in the seams between features.

Size each test to a single goal that's meaningful to users. Not one tap, not the whole app, but one coherent thing a person sets out to do.

There's no magic bullet here, but some things to consider are:

  • Does the test have a meaningful goal that is easily understood? If your boss's boss saw that "Background audio conversation (Premium user)" had failed, would that make sense to them? What about "Subscribe to premium"?
  • Are there tests that pass and fail together? This can often be a sign that they're actually providing you the same signal, and should be merged.
  • Can you name the test in a single sentence? If you can't without reaching for "and", it's probably doing too much, and a failure won't tell you what actually broke.

Wrapping up

The original question isn't really about ducks and horses: it's really about picking the option where you stand the best chance. The goal isn't to have big tests, or small tests, but to have trustworthy tests that tell you whether your app is actually working for real people: and these days, that's something you can have a lot more of.

We're building Semaloop with all of this in mind, and we'd love to show you more.


Sam Starling·June 16, 2026