How to write a great test

Charlie Kingston·June 19, 2026

There are a hundred ways to write a test, and people will argue about all of them. Most of it is a distraction from the one thing that actually separates a good test from a bad one: a good test only fails for a useful reason. It tells you the product is broken, not that someone renamed a button.

How we think about tests at Semaloop

Semaloop runs your tests on real devices using an AI agent that drives the app the way a person would. That one fact changes what a test is.

You don't write a script that says tap here, then here, then here. You describe the goal and the result, and the agent works out the path itself. So the principle the whole industry has been reaching toward, write tests that resemble how the software is used, stops being an aspiration you maintain through discipline and becomes the default way the tool works.

It also means a test survives the things that break traditional suites. When a team renames every CSS class, redesigns a screen, or runs an A/B test, the agent doesn't care. It isn't looking for .btn-primary. It's looking for the way to add a track to a playlist, the same way a person would, whatever the button is called this week.

Here are three key principles we use to write great tests:

Describe the goal, not every tap

Tell the agent the destination, not the turn-by-turn directions.

Good: Add a track to a playlist and check it appears in that playlist.
Avoid: Tap the Search icon, search for a track, tap the three dots, tap "Add to playlist", select a playlist, tap Done, then open the playlist and check the track is there.

Both cover the same flow end to end, but only the first one survives a redesign. A useful gut check: if your team redesigned that screen tomorrow, would the test still describe a valid thing to verify? If yes, it's durable.

One workflow per test

Keep each test focused on a single user goal. When one test bundles several jobs together, a failure is hard to digest. Did sign-up break, or search, or checkout? Split them, and a failure points straight at the problem.

Good (three separate tests): (1) Create a new playlist and check it appears in your playlists. (2) Add a track to a playlist and check it appears in that playlist. (3) Delete a note and check it no longer appears in the notes list.
Avoid: Sign in, create a playlist, add three tracks, share it with a friend, then delete an old note and update your profile photo and check everything worked.

One exception worth knowing: if the interaction between two things is the point, say two users in a shared workspace where one edits a document and the other should see the change, keep it as one test.

Be specific about what success means

A vague check is one the agent can't reliably judge. "Looks right", "loads quickly" and "everything works" can't be measured the same way twice. Point to a concrete, named thing you can actually observe.

Good: Switch the app language to Spanish and check the home screen heading reads "Inicio".
Avoid: Switch the language to Spanish and make sure all the text is in Spanish and the page looks right.

A test, brittle and then durable

Here's the whole philosophy in one before-and-after:

Before: Open the app and wait for it to load. Tap the Search tab at the bottom. Type "tennis" into the search box and wait for results. Tap the second result in the list. On the detail screen, tap the Play button. While it's playing, tap the Summary icon in the top right. Wait for the loading spinner to disappear, then check the green "Summary ready" banner appears and make sure the summary text looks correct.

This dictates every tap. It depends on the second result being a specific item. It checks a spinner and a banner that both vanish quickly. And it ends on a vague "looks correct" judgement. Four different ways to fail without anything actually being wrong.

After: Play audio about tennis and check that a summary is shown for it.

Same intent, but far more likely to pass. Most importantly, it stays valid even if the search tab moves, the results reorder, or the banner copy changes.

The checklist

Before you push a test into production, read it back:

Does it describe a goal, not a sequence of taps?
Is it focused on one workflow?
Could you cut anything without losing the goal, the data, or the check?
Is the success check specific and observable?
Would it still be valid if the screen were redesigned tomorrow?

If they're all yes, you've written a test that only fails for useful reasons.

Want to see what testing looks like when this is the default? Find out more and book a demo at semaloop.com.

Charlie Kingston·June 19, 2026