Week 73 | Managing AI Coding Agents

Last Week

Last week was about accepting the practical shape of the App Store review problem. Apple had rejected the iOS submission again over the short-duration access products, and after a direct call with App Review, the path forward became clearer: stop defending the non-renewing subscription interpretation and rebuild the one-, two-, and three-day passes as consumable in-app purchases.

That work still needs to happen. Android remains live on Google Play, iOS is still blocked by App Store review, and the short-pass conversion will touch RevenueCat, the paywall, the app code, the backend, and the database. But this week’s transcript stepped back from App Review and focused on a broader development process issue that affects how I build Shokken every day: AI coding agents are powerful, but they are not trustworthy by default.

The trick is not to trust them. The trick is to manage them.

Agents Are Employees, Not Oracles

I use AI coding agents constantly on Shokken. They help with implementation, refactoring, tests, documentation, and the general grind of moving a Kotlin Multiplatform product forward. I also do not trust them.

That sounds contradictory only if you think of an AI coding agent as a deterministic tool. A compiler is deterministic. A formatter is deterministic. A type checker is deterministic. Given the same input, you expect the same output, and if the tool says something is valid, you can usually build process around that assumption.

Coding agents do not behave like that in practice.

They are closer to employees. Sometimes very productive employees, but still employees. They have context gaps. They misunderstand instructions. They overgeneralize from patterns that are not actually relevant. They make judgment calls. They can produce good work one minute and questionable work the next. They can also become dangerously confident while doing the wrong thing.

That is why the usual question, “Can I trust AI to code for me?” is the wrong question.

The better question is: if I had a large team of imperfect developers writing code in the same product, what system would I need around them so their work becomes useful instead of chaotic?

That is the mental model I use. Every new agent session is a new hire with no lived memory of the project. It may read the repo, infer patterns, and follow instructions, but it does not have the institutional context of someone who has been living inside the codebase for months. So I need onboarding material, objective checks, and review loops.

Those are the same things a normal engineering organization needs.

What does it mean in English?

AI coding agents can write useful code, but they should not be treated like magic.

If you ask an agent to build something and then merge whatever it produces, you are depending on a system that can misunderstand the task, ignore a rule, use the wrong library, skip a test, or accidentally leak something sensitive. That is not a reason to avoid the tool. It is a reason to wrap the tool in process.

The process is familiar: give clear instructions, run tests the agent cannot quietly bypass, and review the code with fresh eyes before it lands.

That is how teams already manage human developers. You do not assume every employee will remember every convention perfectly. You write down standards, automate what can be automated, and ask other people to review changes. AI agents need the same treatment.

Used that way, they can add real productivity. Used without guardrails, they can produce a mess faster than you can understand it.

Nerdy Details

The criticism is partly right

There is a lot of frustration around AI coding tools right now.

Some of it is about the larger industry shift. People can feel that these tools are changing the economics and expectations of software development, and that creates understandable anxiety. I am not going to solve that debate in a weekly Shokken update.

The more practical criticism is harder to dismiss: people have used these tools and gotten bad results.

That criticism is legitimate. Agents can commit secrets. They can invent APIs. They can generate code that compiles but does not fit the architecture. They can produce awkward, overfit, non-standard solutions. They can reach for random dependencies instead of using the components the project already has. They can turn a simple fix into a pile of incidental complexity.

Those are not theoretical problems. They are exactly the kinds of problems that keep software from being production quality.

But I do not think the conclusion should be “therefore, never use them.” The conclusion should be “therefore, stop treating them like deterministic tools.”

If I hire a junior engineer, I do not expect that person to absorb every architectural rule by intuition. I give them onboarding material. I put tests in front of their changes. I review their pull requests. I expect mistakes, and I build the development process so those mistakes are caught before they become production behavior.

That is the same frame I use for agents.

Every new session starts with no lived context

The biggest practical issue is context.

Every time I start a fresh coding-agent session, I am effectively bringing in a new developer. That agent can inspect files, read instructions, search the repo, and build a working model of the codebase. But it does not naturally inherit the full experience of every previous agent session.

That matters because software projects are full of tacit knowledge:

which patterns are preferred
which APIs are deprecated
which shortcuts are unacceptable
which tests are authoritative
which modules own which behavior
which libraries are already approved
which naming conventions carry meaning
which parts of the system are fragile

Human teams solve this with documentation, onboarding, architecture decision records, code review norms, CI, and a lot of repeated correction. Agent workflows need the same shape.

That is why the first guardrail is written instruction.

In this repo, the important file is agents.md. In other coding tools, it might be CLAUDE.md or another project instruction file. The exact filename is less important than the job it performs: it tells a fresh agent what kind of project it is entering, what style to follow, what not to do, and how work should be validated.

The instruction file is not magic. It does not make the agent deterministic. But it gives the agent a starting point that is better than guessing from scattered files.

Instructions are necessary, not sufficient

Clear instructions are the first layer, but they are not enforcement.

This is where a lot of AI-assisted workflows break down. The user writes a rule like “do not modify tests” or “do not suppress warnings” or “use the existing design system,” and then assumes the agent will obey it.

Sometimes it will. Sometimes it will not.

That is not even uniquely an AI problem. Human developers also violate conventions. Sometimes they forget. Sometimes they misunderstand. Sometimes they decide a rule is less important than the thing they are trying to accomplish. Sometimes they are just moving too quickly.

So the instruction file should be treated like onboarding documentation, not like a compiler. It tells the agent what the expectations are. It does not prove the final change followed them.

That is why the next layer is automated checks.

CI is where trust starts becoming mechanical

If an agent can change code, it can also change the local environment.

That sounds obvious, but it matters. If the only validation happens locally, the agent may be able to make the validation less meaningful. It can skip a test. It can change a test. It can suppress a warning. It can report a command as “mostly passing” while hiding the part that failed.

The answer is to run important checks somewhere the agent cannot quietly edit its way around.

For Shokken, that means CI/CD. The repository uses a set of checks appropriate for a Compose Multiplatform and Kotlin Multiplatform project: architecture tests, static analysis, normal repository tests, build verification, and other safeguards around the codebase. The exact suite changes over time, but the principle stays the same: an agent’s output has to pass a set of checks that are not just vibes.

Some checks are about behavior. Some are about style. Some are about architecture. Some are about safety.

Secret scanning belongs in that category too. If one of the nightmare scenarios is an agent committing an API key, then the answer is not “hope the agent remembers not to.” The answer is to run a tool such as Gitleaks in the validation path so secret-looking material is caught before it reaches the wrong place.

The point is to move as much judgment as possible from prose into executable policy.

Instructions say, “do not do this.” CI says, “you did it, and the build is red.”

Tests need to cover more than happy paths

The value of tests goes up when agents are writing code because the failure modes get stranger.

A human developer usually has a reasonably stable mental model of why they made a change. An agent may produce something that looks plausible without having the same causal understanding. That means a narrow test suite can be misleading. If the tests only cover the happy path, the agent may optimize directly for that path and leave broken edges around it.

The better pattern is layered validation:

unit tests for local behavior
integration tests for module boundaries
architecture tests for dependency direction
static analysis for style and unsafe patterns
build verification for platform targets
secret scanning for accidental credential leaks
warnings or checks around suppression usage

The suppression point matters. Sometimes suppressions are legitimate. Kotlin projects occasionally need them. Static-analysis tools sometimes need local exceptions. But suppressions are also an escape hatch. If agents start adding them casually, the codebase slowly teaches the tooling to stop complaining.

So I do not necessarily want every suppression to hard-fail the build. But I do want visibility. A warning, report, or review cue can be enough to force a human decision: is this suppression justified, or did the agent silence a useful signal?

Code review still matters

The third layer is code review.

That may sound odd if the whole point of using an agent is to get implementation help. Why ask another agent, or another pass, to review the work?

Because review is not the same activity as implementation.

Human teams already understand this. The person who wrote the code is not the best person to review it. They carry the assumptions and compromises that produced the implementation. A reviewer approaches the change with different context, different expectations, and a different failure-finding mindset.

The same applies to coding agents.

If one agent implemented a change, I do not want that exact same context to be the only review context. It has already built a story for why the change makes sense. It may overlook the same mistake twice. A fresh agent session, even using the same model, can notice problems because it starts from a different prompt history and reads the diff with different attention.

Using a different model can also be useful, but I do not think it is strictly required. The important property is fresh context. The reviewer should not simply be the implementation agent congratulating itself.

This has caught real issues for me. A second pass can notice a missing edge case, a bad assumption about a shared API, an accidental architectural violation, or a test that looks impressive but does not actually prove the behavior that changed.

Code review is cheap compared with the cost of quietly merging a bad abstraction.

Agent reviews are still not a substitute for ownership

There is an important caveat: an agent review is not final authority.

It is a filter. A useful one, but still a filter.

The responsibility stays with the human maintaining the product. I still need to understand the changes that matter. I still need to know what tradeoffs I am accepting. I still need to decide whether the tests prove enough, whether the architecture is drifting, and whether the code fits the product I am building.

That is especially true for Shokken because this is not a toy app anymore. It has real store-review constraints, paid access, backend state, messaging behavior, and business users I am trying to serve. A bad change in the wrong place can become much more than an ugly diff.

The right goal is not to remove human judgment. The right goal is to spend human judgment where it matters most.

Agents can do a lot of routine work. Tests can catch many obvious failures. Static analysis can enforce many mechanical rules. Review agents can flag suspicious implementation choices. All of that reduces the surface area I personally have to inspect line by line.

But the final decision still belongs to me.

The workflow I actually want

The practical workflow looks like this:

Give the agent clear project instructions.
Keep the task bounded.
Let the agent implement the change.
Run local checks.
Run CI checks that the agent cannot casually bypass.
Review the diff myself.
Use a fresh agent context for another review when the change is large or risky.
Fix whatever the tests or review uncover.

That is not as glamorous as “AI builds the app for you.” It is also much closer to reality.

The value is not that the agent becomes trustworthy. The value is that the surrounding system makes untrusted work usable.

That distinction matters. If I expect perfection, I will either be disappointed or, worse, fooled. If I expect a fast but imperfect contributor, I can design the workflow accordingly.

The same management principles scale down

The big-company analogy may sound exaggerated for a one-person project, but it is the right mental model.

If I had a thousand developers working under me, I would not manage quality by personally trusting everyone. I would manage quality through standards, automation, review, and release controls. I would assume variation in skill, attention, and judgment. I would build a system that turns inconsistent individual output into a coherent product.

AI coding agents create a miniature version of that problem.

Every fresh session is another contributor. It may be capable, but it is not aligned by default. It needs context. It needs constraints. It needs validation. It needs review.

Once I treat it that way, the tool becomes much more useful. I can use agents to move faster without pretending they are infallible. I can let them do implementation work while preserving the engineering standards that keep Shokken from turning into a pile of shortcuts.

That is the core lesson of the week: do not trust agents as authorities. Manage them as contributors.

Next Week

Next week should return to the iOS release path.

The App Store review problem is still waiting for concrete implementation work: converting the short passes into consumable purchases, updating RevenueCat, changing the paywall, and making sure the backend records and enforces the access window correctly. The tooling discipline from this week’s transcript matters because that work touches billing, entitlement state, and platform-specific behavior. It is exactly the kind of change where agents can help, but only if the tests, CI checks, and review process keep them honest.

Last Week#

Agents Are Employees, Not Oracles#

What does it mean in English?#

Nerdy Details#

The criticism is partly right#

Every new session starts with no lived context#

Instructions are necessary, not sufficient#

CI is where trust starts becoming mechanical#

Tests need to cover more than happy paths#

Code review still matters#

Agent reviews are still not a substitute for ownership#

The workflow I actually want#

The same management principles scale down#

Next Week#