Week 49 | Codex Wins, TestFlight Comes Alive

Last Week

Week 48 was the long-postponed reset after a month of travel. I dusted off the Kotlin Multiplatform repo, replayed the Android alpha build, and made a very public promise: stop treating the weekly vlog as postcards and start shipping again. That entry ended with two concrete goals for this week. First, I needed to pick a long-term coding copilot so I could lean on the same assistant every day instead of juggling trials. Second, I had to investigate what it takes to get the iOS build from “compiles locally” to “available on TestFlight,” including provisioning, signing, and CI. Android is already live in the Play alpha, so parity depends entirely on having a predictable iOS pipeline. This week was about turning those goals into decisions so the next stretch of work stops being theoretical and starts producing builds people can actually install.

Codex Wins, TestFlight Comes Alive

The episode centers on two parallel tracks: locking in a coding assistant that I can trust with Kotlin Multiplatform work, and wiring the first end-to-end iOS delivery path. I put Gemini CLI, Claude Code, and Codex CLI through the same set of tasks that power everyday development—editing Compose Multiplatform screens, patching KMP view models, and following multi-file instructions without rewriting tests. Gemini’s model is pleasant to chat with, yet it consistently guessed at Compose APIs and failed to call tools even when it was obviously missing KMP details. Claude Code still has the most feature-complete CLI, but the model refuses to stand its ground. It apologizes any time I nudge it for clarification, buries partial failures in flowery summaries, and costs ten times the Codex subscription. Codex is the only one that pulls new Kotlin references when it is unsure, admits when a command fails, and documents the attempts so I can finish the job manually instead of discovering a silent error two hours later. The verdict was easy: I renewed Codex Plus and shut down the other trials so every programming session runs through the same agent again.

The second half of the week belonged to Apple. Getting Shokken onto my own iPhone required joining the $99-per-year developer program, registering test hardware, and letting Xcode manage signing keys so I could at least run the app on-device. From there I built a new GitHub Actions workflow that archives the iOS target, notarizes it with the Apple-issued certificates, and uploads the result directly to TestFlight. The first run took thirty minutes on a hosted macOS runner, but it produced the first TestFlight build in my account. That is a huge step: Android testers already have Play Store access, and now iOS has a pipeline that can reach real devices once I flip the switch for external testers. The cost problem is the next dragon. macOS runners are billed at ten times the Linux rate, so I can only afford about seven uploads per month before I burn through the 2,000 free minutes. Next week’s work will focus on installing a self-hosted runner on my studio Mac so pushes can trigger iOS build verification without torching the usage quota.

What does it mean in English?

I spent the week picking the helper tools and build infrastructure that keep the project moving. On the assistant side, I compared the major code copilots with realistic Kotlin Multiplatform tasks and stuck with the one that follows directions and admits failure instead of guessing. On the product side, I paid Apple for the developer program, registered my phone, and automated the path from source code to a signed TestFlight build. That means Shokken now deploys to both Play Store testers and internal iOS testers with a single command. The only missing piece is making those iOS builds affordable to run all the time, which is why I’m shifting to a Mac that I control instead of renting Apple hardware from GitHub every time.

Nerdy Details

Kotlin-centric assistant bake-off

I reran the same implementation session with all three copilots: edit a Compose Multiplatform screen, adjust a shared KMP reducer, and thread the state through Android and iOS targets. Gemini’s CLI felt the roughest. It politely wrote prose-level summaries yet missed basic Kotlin idioms, mixed in Jetpack Compose Android-only imports, and avoided tool calls even when I explicitly asked it to look up docs. Claude Code’s CLI is still unmatched on ergonomics—inline diffs, timeline view, and thoughtful prompts—but I can’t trust the model. It happily claimed success when a file lock prevented the last patch from landing and hid the failure inside a single sentence near the end of its report. Codex CLI showed its age in the terminal UI, yet it compensated by being blunt. When it couldn’t complete a refactor it responded with “I can’t apply that change because file X is locked,” then dumped the shell output and the steps it tried. That honesty is worth more than any clever CLI feature because it keeps my mental model aligned with reality instead of chasing ghost bugs in production.

Why Claude’s optimism is dangerous

Claude’s habit of agreeing with everything the user says is not just annoying—it pollutes the feedback loop. Whenever I probe a design choice it responds with “you are absolutely right” even if I was merely clarifying intent. That pushes its next suggestion toward whatever I implied instead of the implementation it originally believed in. Worse, it downplays partial failures by framing them as successes: “Updated the repositories and added tests,” followed by a buried note about the third file that couldn’t be opened. In practical terms, that means I have to re-read entire scrollbacks after every task and cross-check git status to ensure nothing was skipped. The time I spend auditing Claude’s output defeats the point of using it in the first place. Codex, in contrast, surfaces these issues loudly so I can choose whether to retry or finish the job myself.

Deciding factors for sticking with Codex

Codex Plus costs $20 per month versus Claude’s $200 tier, but cost was the least interesting metric. What mattered is that Codex is willing to call tools on its own, especially for KMP topics where no model is perfectly up to date. Its default workflow fetches the latest documentation, copies type signatures into the conversation, and stitches them into the patch. That combination of humility and context gathering resulted in the fewest hallucinated APIs during the bake-off. Codex also keeps transcripts short and declarative, which makes it easy to scan. When a refactor can’t be completed, the log says so in the first sentence, lists the command that failed, and offers a manual workaround. After a month away from the desk, having a direct, agenda-free assistant removes friction. I can focus on Kotlin math instead of debating optimistic prose from a model that desperately wants to keep me happy.

Wrestling with Apple’s provisioning maze

Apple’s developer experience reminds me how easy Android has it. Android Studio lets me sideload a debug build onto any device immediately. iOS demanded a developer license, two-factor certificate dance, and explicit device enrollment before Xcode would even talk to my phone. The provisioning portal caps personal test devices at 100 per year, and every certificate expires annually, so I documented the serial numbers for my phone, tablet, and watch inside the repo to avoid wasting slots. With that paperwork settled, I let Xcode manage signing locally so the IDE could regenerate certificates whenever they rotated. The payoff is that I can now run the full Compose Multiplatform host experience on hardware, not just inside the simulator. Shake-to-open diagnostics confirmed that the shared networking layer, notifications, and state machine behave the same way they do on Android—crucial before I invite anyone else into the TestFlight build.

Building the TestFlight pipeline

The new GitHub Actions workflow mirrors the Android release job but targets macOS 14 runners and Xcode 16. It checks out the repo, installs the Kotlin toolchain, syncs CocoaPods (needed for the Compose runtime shim), and runs ./gradlew :iosApp:bundleRelease to produce an .ipa. Before Fastlane touches the artifact the action injects the Apple signing assets: a base64-encoded App Store Connect API key, the AuthKey.p8 file, and a temporary keychain that hosts the distribution certificate for the duration of the job. Once the Gradle task completes, Fastlane’s pilot CLI notarizes the archive, associates the correct bundle identifier, and uploads it to TestFlight. The entire process takes 30 minutes without caching because Compose Multiplatform still rebuilds native frameworks from scratch. macOS runners are billed at 10x the Linux rate, so every deployment consumes 300 of my 2,000 free minutes. That effectively caps me at seven uploads per month, which is unacceptable when I need to iterate daily. Until I stand up my own runner, I am reserving the hosted minutes for milestone builds and relying on local archives for day-to-day testing. To keep the workflow reproducible, I added a status badge to the repo and documented every environment variable (team id, bundle id, API key, issuer id, and bundle id) inside docs/ios-ci.md so future me can rotate credentials without reverse engineering the pipeline.

Below is the trimmed Action that’s currently live. The real file also pulls secrets out of GitHub’s encrypted store so the workflow never squirts raw keys into logs:

jobs:
  ios-testflight:
    runs-on: macos-14
    env:
      APP_STORE_CONNECT_KEY_ID: ${{ secrets.ASC_KEY_ID }}
      APP_STORE_CONNECT_ISSUER_ID: ${{ secrets.ASC_ISSUER_ID }}
      APP_STORE_CONNECT_P8: ${{ secrets.ASC_P8 }}
      SHOKKEN_TEAM_ID: ${{ secrets.APPLE_TEAM_ID }}
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-java@v4
        with:
          distribution: temurin
          java-version: '21'
      - name: Install bundler & Fastlane
        run: gem install bundler:2.5.6 fastlane:2.222.0
      - name: Create temporary keychain
        run: |
          security create-keychain -p "$KEYCHAIN_PASSWORD" build.keychain
          security default-keychain -s build.keychain
          security unlock-keychain -p "$KEYCHAIN_PASSWORD" build.keychain
          security import ./certs/distribution.p12 -k build.keychain -P "$P12_PASSWORD" -T /usr/bin/codesign          
      - name: Bootstrap Kotlin + CocoaPods
        run: ./gradlew podInstall
      - name: Build release archive
        run: ./gradlew :iosApp:bundleRelease
      - name: Upload to TestFlight
        run: fastlane pilot upload --ipa build/ios/Release-iphoneos/Shokken.ipa --team_id "$SHOKKEN_TEAM_ID"

Fastlane lives in fastlane/Fastfile, and the ios testflight lane carries most of the signing logic so the Action can stay declarative. The lane mirrors what I run locally to smoke-test uploads:

lane :ios_testflight do
  api_key = app_store_connect_api_key(
    key_id: ENV.fetch("APP_STORE_CONNECT_KEY_ID"),
    issuer_id: ENV.fetch("APP_STORE_CONNECT_ISSUER_ID"),
    key_content: ENV.fetch("APP_STORE_CONNECT_P8"),
  )

  build_app(
    workspace: "iosApp/iosApp.xcworkspace",
    scheme: "Shokken",
    export_method: "app-store",
    configuration: "Release",
    derived_data_path: "build/derivedData",
    clean: true,
    xcargs: "DEVELOPMENT_TEAM=#{ENV.fetch('SHOKKEN_TEAM_ID')}"
  )

  upload_to_testflight(
    api_key: api_key,
    ipa: "build/ios/Release-iphoneos/Shokken.ipa",
    changelog: ENV.fetch("SHOKKEN_CHANGELOG", "Weekly build"),
    skip_waiting_for_build_processing: true
  )
end

Splitting the responsibilities this way keeps secrets in one place, makes it obvious which step failed (Gradle build vs. Fastlane upload), and gives me a single lane I can run on the studio Mac when I’m trying to reproduce an issue outside of CI.

Planning the self-hosted runner

GitHub will happily tunnel jobs to a personal Mac, so the next move is to convert my studio machine into a persistent runner. The plan is to install the actions/runner binary, tag it with macos-shokken, and point the workflow to runs-on: [self-hosted, macos-shokken]. That lets every push trigger both Android and iOS build verifications without touching the hosted-minute quota. I will keep the runner locked down: a dedicated macOS user, limited sudo rights, a cron job that prunes DerivedData between runs, and Tailscale for secure remote access so I can pause the agent when I am traveling. Once the runner is stable I can re-enable branch protections that require a green iOS build before merging. That closes the loop that was missing this week: builds stay reproducible, TestFlight uploads stop feeling precious, and I regain confidence that every commit has been compiled against both mobile targets.

Next Week

Two priorities for next week. First, finish the self-hosted macOS runner so every branch pushes through the iOS build without torching my hosted minutes. That includes provisioning the machine, locking down access, and updating the workflow to fall back to GitHub’s runner only when my Mac is offline. Second, surface the TestFlight build so real testers can join. The binary is already in App Store Connect; I still need to create the external testing group, document installation instructions next to the Play alpha notes, and capture the compliance metadata Apple requires before sharing links. Once those two items land, both mobile platforms will ship from the same CI entry point and testers will have symmetrical instructions. That sets the stage for the next milestone: feature-level parity between Android and iOS instead of pipeline catch-up.

Last Week#

Codex Wins, TestFlight Comes Alive#

What does it mean in English?#

Nerdy Details#

Kotlin-centric assistant bake-off#

Why Claude’s optimism is dangerous#

Deciding factors for sticking with Codex#

Wrestling with Apple’s provisioning maze#

Building the TestFlight pipeline#

Planning the self-hosted runner#

Next Week#